Academia.eduAcademia.edu

Practical Privacy-Preserving Data Mining

2008

Data anonymization is of increasing importance for allowing sharing of individual data for a variety of data analysis and mining applications. Most of existing work on data anonymization optimizes the anonymization in terms of data utility typically through one-size-fits-all measures such as data discernibility. Our primary viewpoint in this paper is that each target application may have a unique need of the data and the best way of measuring data utility is based on the analysis task for which the anonymized data will ultimately be used. We take a top-down analysis of typical application scenarios and derive applicationoriented anonymization criteria. We propose a prioritized anonymization scheme where we prioritize the attributes for anonymization based on how important and critical they are to the application needs. Finally, we present preliminary results that show the benefits of our approach.

Proceedings of SDM’08 International Workshop on Practical Privacy-Preserving Data Mining Edited by Kun Liu IBM Almaden Research Center, USA Ran Wolff University of Haifa, Israel April 26, 2008 Atlanta, Georgia, USA Copyright Notice: This volume is published electronically and is available from the workshop web page at http://www.cs.umbc.edu/~kunliu1/p3dm08/ The copyright of each paper in this volume resides with its authors. These papers appear in these electronic workshop proceedings by the authors’ permission being implicitly granted by their submission to the workshop. ii Contents Acknowledgements iv Message from the Workshop Chairs Kun Liu and Ran Wolff v Keynote - Privacy & Data Protection: Policy and Business Trends Harriet P. Pearson vi Towards Application-Oriented Data Anonymization Li Xiong, Kumudhavalli Rangachari 1 Efficient Algorithms for Masking and Finding Quasi-Identifiers Rajeev Motwani, Ying Xu 11 On the Lindell-Pinkas Secure Computation of Logarithms: From Theory to Practice Raphael S. Ryger, Onur Kardes, Rebecca N. Wright 21 Constrained k-Anonymity: Privacy with Generalization Boundaries John Miller, Alina Campan, Traian Marius Truta 30 Privacy-Preserving Predictive Models for Lung Cancer Survival Analysis Glenn Fung, Shipeng Yu, Cary Dehing-Oberije, Dirk De Ruysscher, Philippe Lambin, Sriram Krishnan, R. Rao Bharat 40 iii Acknowledgements We owe thanks to many people for their contributions to the success of the workshop. We would like to thank first the organizers of SDM’08 for hosting the workshop, and IBM Almaden Research Center for their generous sponsorship. We thank all the authors who submitted papers to the workshop, which allowed us to put together an impressive program. We would also like to express our thanks to our invited speaker Ms. Harriet P. Pearson, the Chief Privacy Officer of IBM Corporation, for her talk titled “Privacy & Data Protection: Policy and Business Trends”. We would especially like to thank the members of the program committee for giving up their time to review submissions. Despite the short period of time available, they provided thorough, objective evaluations of their allocated papers, ensuring a high standard of presentations at the workshop. Their names are gratefully listed below. Workshop Co-chairs Kun Liu and Ran Wolff Program Committee Osman Abul, TOBB University, Turkey Elisa Bertino, Purdue University, USA Francesco Bonchi, KDD Lab, ISTI-C.N.R., Pisa, Italy Alexandre Evfimievski, IBM Almaden Research Center, USA Chris Giannella, Loyola College in Baltimore, Maryland, USA Murat Kantarcioglu, University of Texas, Dallas, USA Hillol Kargupta, University of Maryland Baltimore County, USA Bradley Malin, Vanderbilt University, Nashville, USA Taneli Mielikäinen, Nokia Research Center Palo Alto, USA Benny Pinkas, University of Haifa, Israel Yucel Saygin, Sabanci University, Istanbul, Turkey Evimaria Terzi, IBM Almaden Research Center, USA Ke Wang, Simon Fraser University, Canada Rebecca Wright, Rutgers University, USA Xintao Wu, University of North Carolina at Charlotte, USA Sponsored by SIAM Conference on Data Mining (SDM’08) iv IBM Almaden Research Center Message from the P3DM’08 Workshop Chairs Kun Liu IBM Almaden Research Center, USA Ran Wolff University of Haifa, Israel Governmental and commercial organizations today capture large amounts of data on individual behavior and increasingly apply data mining to it. This has raised serious concerns for individuals’ civil liberties as well as their economic well being. In 2003, concerns over the U.S. Total Information Awareness (also known as Terrorism Information Awareness) project led to the introduction of a bill in the U.S. Senate that would have banned any data mining programs in the U.S. Department of Defense. Debates over the need for privacy protection vs. service to national security and business interests were held in newspapers, magazines, research articles, television talk shows and elsewhere. Currently, both the public and businesses seem to hold polarized opinions: There are those who think an organization can analyze information it has gathered for any purpose it desires and those who think that every type of data mining should be forbidden. Both positions do little merit to the issue because the former promotes public fear (notably, Sun's Scott McNealy '99 remark “You have no privacy, get over it!”) and the latter overly restrictive legislation. The truth of the matter is not that technology has progressed to the point where privacy is not feasible, but rather the opposite: privacy preservation technology has got to advance to the point where privacy would no longer rely on accidental lack of information but rather on intentional and engineered inability to know. This belief is at the heart of privacy-preserving data mining (PPDM). Pioneered by Agrawal & Srikant and Lindell & Pinkas' work from 2000, there has been an explosive number of publications in this area. Many privacy-preserving data mining techniques have been proposed, questioned, and improved. However, compared with the active and fruitful research in academia, applications of privacy-preserving data mining for real-life problems are quite rare. Without practice, it is feared that research in privacy-preserving data mining will stagnate. Furthermore, lack of practice may hint to serious problems with the underlying concepts of privacy-preserving data mining. Identifying and rectifying these problems must be a top priority for advancing the field. Following on these understandings, we set out to arrange a workshop on the practical aspects of privacy-preserving data mining. We were encouraged by the enthusiastic response of our PC members, to whom we would like to convey our immense gratitude. The workshop draws eight submissions, of which five were selected for presentation. As you will find in this collection, they range from real PPDM applications to efficiency improvements to known algorithms. Additionally, we pride ourselves in the participation of Harriet P. Pearson, who is the Chief Privacy Officer of IBM. In our perception, CPOs of large businesses such as IBM are likely to be important stake holders in any application of PPDM, and their view should be highly relevant for our community. v Privacy & Data Protection: Policy and Business Trends Harriet P. Pearson VP Regulatory Policy & Chief Privacy Officer, IBM Corporation Business, individuals and the public sector continue to take advantage of the rapid development and adoption of Web-based and other technologies. With innovation in business models and processes, as well as changing individual behaviors in venues such as online social networking, comes the need to address privacy and data protection. This phenomenon occurs every time society has embraced significant new technologies -- whether it be the printing press, telephone or home video rentals. But faster introduction and uptake of technologies in our time results in a larger number of issues and challenges to sort out. This talk will outline the major privacy and data protection trends and their likely effect on the development of public policies, industry practices and technology design. vi Towards Application-Oriented Data Anonymization†∗ Li Xiong‡ Kumudhavalli Rangachari§ Abstract before being shared over the network. Data anonymization is of increasing importance for allowing sharing of individual data for a variety of data analysis and mining applications. Most of existing work on data anonymization optimizes the anonymization in terms of data utility typically through one-size-fits-all measures such as data discernibility. Our primary viewpoint in this paper is that each target application may have a unique need of the data and the best way of measuring data utility is based on the analysis task for which the anonymized data will ultimately be used. We take a top-down analysis of typical application scenarios and derive applicationoriented anonymization criteria. We propose a prioritized anonymization scheme where we prioritize the attributes for anonymization based on how important and critical they are to the application needs. Finally, we present preliminary results that show the benefits of our approach. These scenarios can be generalized into the problem of privacy preserving data publishing where a data custodian needs to distribute an anonymized view of the data that does not contain individually identifiable information to data recipient(s) for various data analysis and mining tasks. Privacy preserving data publishing has been extensively studied in recent years and a few principles have been proposed that serve as criteria for judging whether a published dataset provides sufficient privacy protection [40, 34, 43, 3, 32, 53, 35, 37]. Notably, the earliest principle, k-anonymity [40], requires a set of k records (entities) to be indistinguishable from each other based on a quasi-identifier set, and its extension, l-diversity [34], requires every group to contain at least l well-represented sensitive values. A large body of work contributes to transforming a dataset to meet a privacy principle (dominantly k-anonymity) using techniques such as generalization, suppression (removal), permutation and swapping of certain data values while minimizing certain cost metrics [20, 50, 36, 9, 2, 17, 10, 59, 29, 30, 31, 49, 27, 51, 58]. 1 Introduction Data privacy and identity protection is a very important issue in this day and age when huge databases containing a population’s information need to be stored and distributed for research or other purposes. For example, the National Cancer Institute initiated the Shared Pathology Informatics Network (SPIN)1 for researchers throughout the country to share pathology-based data sets annotated with clinical information to discover and validate new diagnostic tests and therapies, and ultimately to improve patient care. However, individually identifiable health information is protected under the Health Insurance Portability and Accountability Act (HIPAA)2 . The data have to be sufficiently anonymized Most of these methods aim to optimize the data utility measured through a one-size-fitsall cost metric such as general discernibilty or information loss. Few works have considered targeted applications like classification and regression [21, 50, 17, 31] but do not model other kinds of applications nor provide a systematic or adaptive approach for handling various needs. Contributions. Our primary viewpoint in this paper is that each target application may have a unique need of the data and the best way of measuring data utility is based on the analysis task for which the anonymized data will ultimately be used. We aim to adapt existing methods by incorporating the application needs into the anonymization process, thereby increasing its utility to the target applications. ∗ P3DM’08, April 26, 2008, Atlanta, Georgia, USA. research is partially supported by an Emory URC grant. ‡ Dept. of Math & Computer Science, Emory University § Dept. of Math & Computer Science, Emory University 1 Shared Pathology Informatics Network. http://www.cancerdiagnosis.nci.nih.gov/spin/ 2 Health Insurance Portability and Accountability Act (HIPAA). http://www.hhs.gov/ocr/hipaa/. State law or institutional policy may differ from the HIPAA standard and should be considered as well. † This The paper makes a number of contributions. First, we take a top-down analysis of potential application scenarios and devise models and schemes to represent application requirements in terms of relative attribute 1 certainty as versus individual identifiability. There are studies focusing on specific mining tasks such as decision tree [8, 12], association rule mining [39, 15, 16], and on disclosure analysis [26, 19, 42, 12]. A main advantage of data anonymization as opposed to data perturbation is that the released data remain ”truthful”, though at a coarse level of granularity. This allows various analysis to be carried out using the data, including selection. importance that can be specified by users or learned from targeted analysis and mining tasks. Second, we propose a prioritized anonymization scheme where we prioritize the attributes for anonymization based on how important and critical they are to the application needs. We devise a prioritized cost metric that allows users to assign different weights to different attributes and adapt existing generalization-based anonymization methods in order to produce an optimized view for the user applications. Finally, we present preliminary results that show the benefits of our approach. Another related area is distributed privacy preserving data sharing and mining that deals with data sharing for specific tasks across multiple data sources in a distributed manner [33, 44, 23, 25, 46, 56, 45, 4, 6, 47, 24, 54, 11, 55]. The main goal is to ensure data is 2 Related Work not disclosed among participating parties. Common apOur research is inspired and informed by a number of proaches include data approach that involves data perturbation and protocol approach that applies randomrelated areas. We discuss them briefly below. response techniques. Privacy Preserving Access Control and Statistical Databases. Previous work on multilevel secure relational databases [22] provides many valuable insights for designing a fine-grained secure data model. Hippocratic databases [7, 28, 5] incorporate privacy protection within relational database systems. Byun et al. presented a comprehensive approach for privacy preserving access control based on the notion of purpose [14]. While these mechanisms enable multilevel access of sensitive information through access control at a granularity level up to a single attribute value for a single tuple, micro-views of the data are desired where even a single value of a tuple attribute may have different views [13]. Data Anonymization The work in this paper has its closest roots in data anonymization that provides a micro-view of the data while preserving privacy of individuals. The work in this area can be classified into a number of categories. The first one aims at devising generalization principles in that a generalized table is considered privacy preserving if it satisfies a generalization principle [40, 34, 43, 3, 32, 53, 35, 37]. Recent work[52] also considered personalized anonymity to guarantee minimum generalization for every individual in the dataset. Another large body of work contributes to the algorithms for transforming a dataset to one that meets a generalization principle and minimizes certain quality metrics. Several hardness results [36, 2] show that computing the optimal generalized table is NP-hard and the result suffers severe information loss when the number of quasi-identifier attributes are high. Optimal solutions [9, 29] enumerate all possible generalized relations with certain constraints using heuristics to prune the search space. Greedy solutions [20, 50, 17, 10, 59, 30, 31, 49] are proposed to obtain a suboptimal solution much faster. A few works are suggesting new approaches in addition to generalization, such as releasing marginals [27], anatomy technique [51], and permutation technique [58], to improve the utility of the published dataset. Another thread of research is focused on disclosure analysis [35]. A few works considered targeted classification and regression applications [20, 50, 17, 31]. Research in statistical databases has focused on enabling queries on aggregate information (e.g. sum, count) from a database without revealing individual records [1]. The approaches can be broadly classified into data perturbation, and query restriction. Data perturbation involves either altering the input databases, or altering query results returned. Query restriction includes schemes that check for possible privacy breaches by keeping audit trails and controlling overlap of successive aggregate queries. The techniques developed have focused only on aggregate queries and relational data types. Privacy Preserving Data Mining. One data sharing model is the mining-as-a-service model, in which individual data owners submit the data to a data collector for mining or a data custodian outsources mining to an untrusted service provider. The main approach is random perturbation that transforms data by adding random noise in a principled way [8, 48]. The main notion of privacy studied in this context is data un- Our work builds on top of the existing generalization principles and anonymization techniques and aims to adapt existing solutions for application-oriented anonymization that provides an optimal view for tar- 2 alence Class. T is k-anonymous with respect to X1 , ..., Xd if every tuple is in an equivalence class of size at least k. A k-anonymization of T is a transformation or generalization of the data T such that the transformation is k-anonymous. The l-diversity model provides a natural extension to incorporate a nominal sensitive attribute S. It requires that each equivalence class also contains at least l well-represented distinct values for S. Typical techniques to transform a dataset to satisfy k-anonymity include data generalization, data suppression, and data swapping. Table 1 also illustrates one possible anonymization with respect to a quasiidentifier set (Age, Gender, Zipcode) using data generalization that satisfies 2-anonymity and 2-diversity. geted applications. 3 Privacy Model Among the many identifiability based privacy principles, k-anonymity [41] and its extension l-diversity [34] are the two most widely accepted and serve as the basis for many others, and hence, will be used in our discussions and illustrations. Our work is orthogonal to these privacy principles. Below we introduce some terminologies and illustrate the basic ideas behind these principles. In defining anonymization, attributes of a given relational table T , are characterized into three types. Unique identifiers are attributes that identify individuals. Known identifiers are typically removed entirely from released micro-data. Quasi-identifier set is a minimal set of attributes (X1 , ..., Xd ) that can be joined with external information to re-identify individual records. We assume that a quasi-identifier is recognized based on domain knowledge. Sensitive attributes are those attributes that an adversary should not be permitted to uniquely associate their values with a unique identifier. 4 Application-Oriented Anonymization Our key hypothesis is that by considering important application requirements, the data anonymization process will achieve a better tradeoff between general data utility and application-specific data utility. We first take a top-down analysis of typical application scenarios and analyze what requirements and implications they pose to the anonymization process. We then present our prioritized optimization metric and anonymization techniques that aim to prioritize the anonymization for individual attributes based on how important they are to Table 1: Illustration of Anonymization: Original Data target applications. and Anonymized Data Name Henry Irene Dan Erica Age 25 28 28 26 Gender Male Female Male Female Zipcode 53710 53712 53711 53712 Diagnosis Influenza Lymphoma Bronchitis Influenza 4.1 Anonymization Goals There are different types of target applications for sharing anonymized data including: 1) query applications supporting ad-hoc queries, 2) applications with a specific mining task such as classification or clustering, and 3) exploratory applications without a specific mining task. We consider two typical scenarios of these applications on anonymized medical data and analyze their implications on the anonymization algorithms. Original Data Name ∗ ∗ ∗ ∗ Age [25 − 28] [25 − 28] [25 − 28] [25 − 28] Gender Zipcode Male [53710-53711] Female 53712 Male [53710-53711] Female 53712 Anonymized Data Disease Influenza Lymphoma Bronchitis Influenza Scenario 1. Disease-specific public health study. In this study, researchers select a subpopulation of certain health condition (e.g. Diagnosis = ”Lymphoma”), and study their geographic and demographic distribution, reaction to certain treatment, or survival rate. An example is to identify geographical patterns for the health condition that may be associated with features of the geographic environment. Table 1 illustrates an original relational table of personal information. Among the attributes, N ame is considered as an identifier, (Age, Gender, Zipcode) is considered as a quasi-identifer set, and Diagnosis is considered as a sensitive attribute. The k-anonymity model provides an intuitive requirement for privacy in stipulating that no individual record should be uniquely identifiable from a group of k with respect to the quasi-identifier set. The Scenario 2. Demographic / population study. In set of all tuples in T containing identical values for the this study, researchers may want to study a certain quasi-identifier set X1 , ..., Xd is referred to as an Equiv- demographic subpopulation (e.g. Gender = M ale and 3 Age > 50), and perform exploratory analysis or learn classification models based on demographic information and clinical symptoms to predict diagnosis. of the selection attributes should be minimized or adapted to the selection predicates so that the discernibility of selection attributes or predicates are maximized. The data analysis for the mentioned applications is typically conducted in two steps: 1) subpopulation identification through a selection predicate, and 2) analysis on the identified subpopulation including mining tasks such as clustering or classification of the population with respect to certain class labels. Given such a two-step process, we identify two requirements for optimizing the anonymization for applications: 1) maximize precision and recall of subpopulation identification, and 2) maximize quality of the analysis. • Discernibility of feature attributes. For most mining tasks, the anonymized dataset needs to maintain as much information about feature attributes as possible, in order to derive accurate classification models or achieve high quality clustering. As a result, the discernibility of feature attributes needs to be maximized in order to increase data utility. • Homogeneity of target attributes. For classification tasks, an additional criterion is to produce homogeneous partitions or equivalence classes of class labels. The few works specializing on optimizing anonymization for classification applications [21, 50, 17, 31] are mainly focused on this objective. However, it is important to note that if the class label is a sensitive attribute, this criterion is conflicting with the goal of l-diversity and other principles that attempts to achieve a guaranteed level of diversity in sensitive attributes and the question certainly warrants further investigation to achieve best tradeoff. We first categorize the attributes with respect to the applications on the anonymized data and then explain how the application requirement and optimization goal transform to concrete criteria for application-oriented anonymization. Given an anonymized relational table, each attribute can be characterized by one of the following types with respect to the target applications. • Selection attributes are those attributes used to identify a subpopulation (e.g. Diagnosis in Scenario 1 and Gender and Age in Scenario 2). • F eature attributes are those attributes used to perform analysis such as classifying or clustering data (e.g. Zipcode in Scenario 1 for geographic location based analysis). 4.2 Attribute Priorities Based on the above discussion and considering the variety of applications, the first idea we explored is to represent the application requirements using a list of attribute and weight pairs where each attribute is associated with a priority weight based on how important it is to the target applications. We envision that these priority weights can be either explicitly specified by users or implicitly learned by the system based on a set of sample queries and analysis. If the target applications can be fully specified by the users with feature attributes, target attributes, or selection attributes, they can be assigned a higher weight than other attributes in the quasi-identifer set. For instance, in Scenario 1, the attribute-weight list can be represented as (Age, 0), (Gender, 0), (Zipcode, 1) where Zipcode is the feature attribute for the location-based study. • T arget attributes are the class label or attributes for which the classification or prediction are trying to predict (e.g. Diagnosis in Scenario 2). Target attributes are not applicable for unsupervised learning tasks such as clustering. Given the above categorization and the goals in optimizing anonymization for target applications, we derive a set of generalization criteria for the different types of attributes in our anonymization model. • Discernibility of selection attributes or predicates. If a selection attribute is part of the quasi-identifier set and is subject to generalization, it may result in an imprecise query selection. For example, if the Age attribute is generalized into ranges of [0 − 40] and [40 above], the selection predicate Age > 50 in Scenario 2 will result in an imprecise subpopulation. In order to maximize the precision of the population identification, the generalization Alternatively, the attribute priorities can be learned implicitly from sample queries and analysis. For example, statistics can be collected from query loads on attribute frequencies for projection and selection. In many cases, the attributes in the SELECT clause (projection) correspond to feature attributes while attributes in the WHERE clause (selection) correspond to the selection attributes. The more frequently an attribute is queried, 4 the more important it is to the application, and the denote attribute priority associated with attribute Xi , less it should be generalized. Attributes can be then the metric is defined as follows: ordered by their frequencies where the weight is a normalized frequency. Another interesting idea is to use a min-term predicate set derived from query load and use X X m 2 weighti ∗ | |EX CW DM = that in the anonymization process similar to the data (4.2) i m i fragmentation techniques in distributed databases. This is on our future research agenda. Consider our example Scenario 1, if given an anonymized dataset such as in Table 1, the discernibility of equivalent classes along attribute Zipcode will be penalized more than the other two attributes because of the importance of geographic location. This metric corresponds well with our weighted attributed list representation of the application requirements. It provides a general judgement of the anonymization for exploratory analysis when there is some knowledge about attribute importance in applications but not sufficient knowledge about specific subpopulation or applications. 4.3 Anonymization Metric Before we can devise algorithms to optimize the solution for the application, we first need to define the optimization objective or the cost function. When the query and analysis semantics are known, a suitable metric for the subpopulation identification process is the P recision of the relevant subpopulation similar to the precision of relevant documents in Information Retrieval. Note that a generalized dataset will often produce a larger result set than the original table does with respect to a set of predicates consisting of quasi-identifiers. This is similar to the imprecision metric defined in [31]. For analysis tasks, appropriate metrics for specific analysis tasks should be used as the ultimate optimization goal. This includes accuracy for classification applications and intra-cluster similarity and inter-cluster dissimilarity for clustering applications. The majority metric [25] is a class-aware metric introduced to optimize a dataset for classification applications. 4.4 Anonymization A large number of algorithms have been developed for privacy preserving data anonymization. They can be roughly classified into top-down and bottom-up approaches and single dimensional and multidimensional approaches. Most of the techniques take a greedy approach and rely on certain heuristics at each step or iteration for selecting an attribute for partitioning (top-down) or generalization (bottom-up). In this study, we adapt the greedy topdown Mondrian multidimensional approach [30] and investigate heuristics for adapting it based on our prioritized optimization metric. It is on our future research agenda to explore various anonymization approaches and investigate systematic ways for adapting them towards application-oriented anonymization. When the query and analysis semantics are not specified, we need a general metric that measures the data utility. Intuitively, the anonymization process should generalize the original data as little as is necessary to satisfy the given privacy principle. There are mainly three cost metrics that have been used in the literature [38], namely, general loss metric, majority metric, and discernibility metric. Among the three, the disThe Mondrian algorithm (based on k-anonymity princernibility metric, denoted by CDM , is most commonly ciple) uses greedy recursive partitioning of the (multiused and is defined based on the size of equivalence dimensional) quasi-identifer domain space. In order to classes E: obtain approximately uniform partition occupancy, it recursively chooses the split attribute with the largest normalized range of values, referred to as spread, and X (for continuous or ordinal attributes) partitions the data m 2 (4.1) CDM = |E | around the median value of the split attribute. This m process is repeated until no allowable split remains, meaning that a particular region cannot be further diTo facilitate the application-oriented anonymization, vided without violating the anonymity constraint, or we devise a prioritized cost metric that allows users constraints imposed by value generalization hierarchies. to incorporate attribute priorities in order to achieve more granularity for more important attributes. Given The key of the algorithm is to select the best attribute m a quasi-identifier Xi , let |EX | denote the size of the for splitting (partitioning) during each iteration. In i mth equivalent class with respect to Xi , let weighti addition to using the spread (range) of the values of 5 each attribute i, denoted as spreadi , in the original 5 algorithm, our approach explores additional metrics. Experiments We performed a set of preliminary experiments evaluating our approach. The main questions we would like to answer are: 1) does the prioritized anonymization metric (weighted discernibility metric) correlate with good data utility from applications point of view? 2) does the prioritized anonymization scheme provide better data utility than general approaches? Attribute priority. Since our main generalization criteria is to maximize the discernibility of important attributes including selection attributes, feature attributes and class attributes for target applications, we use the attribute priority weight for attribute i, denoted by weighti , as an important selection criteria. Attributes with a larger weight will be selected for partitioning so that important attributes will have a more precise view in the anonymized data. We implemented a prioritized anonymization algorithm based on the Mondrian algorithm [30]. It uses a combined heuristic of the spread and attribute priorities Information gain. When target applications are well (without information gain) and aims to minimize the specified a priori, another important generalization cri- prioritized cost metric (instead of the general discerniterion for classification applications is to maximize the bility metric). We conducted two sets of experiments for homogeneity of class attributes within each equivalence exploratory and classification applications respectively. class. This is reminiscent of decision tree construction where each path of the decision tree leads to a homogeneous group of class labels [18]. Similarly, information 5.1 Exploratory Applications For exploratory apgain can be used as a scoring metric for selecting the plications, we used the Adults dataset from UC Irvine best attribute for partitioning in order to produce equiv- Machine Learning Repository configured as in [30]. We alence classes of homogeneous class labels. The informa- considered a simple application scenario that requires tion gain for a given attribute i, denoted by inf ogaini , precise information on a single demographic attribute is computed as the weighted entropy of the resultant (Age and Sex respectively) and hence it is assigned with partitions based on the split of attribute i: a higher weight than other attributes in the experiment. The dataset were anonymized using the Mondrian and prioritized approach respectively and we compare the weighted discernibility as well as general discernibility X |P ′ | X of the two anonymized datasets. ′ ′ −p(c|P )logp(c|P )) ( (4.3) inf ogaini = |P | ′ c∈Dc 10000000 200000000 Discernibility where P denotes the current partition, P ′ denotes the set of resultant partitions of the iteration, p(c|P ′ ) is the fraction of tuples in P ′ with class label c, and Dc is the domain of the class variable c. Weighted Discernibility Mondrian Prioritized 8000000 6000000 4000000 2000000 5 10 20 50 100 j (wj ∗ metricji )/ X 40000000 Mondrian Prioritized 200 2 5 10 20 50 100 200 k Discernibility Weighted Discernibility Figure 1: Adult Dataset (Sex-Prioritized) 10000000 1500000 Mondrian Prioritized Discernibility 8000000 6000000 4000000 0 X 80000000 k 2000000 Oi = 120000000 0 2 The attribute selection criteria for each iteration selects the best attribute based on an overall scoring metric determined by an aggregation of the above metrics. In this study, we use a linear combination of the individual metrics, denoted by Oi for attribute i: (4.4) 160000000 0 Weighted Discernibility P 5 10 20 50 k Discernibility j 900000 600000 300000 0 2 wj Mondrian Prioritized 1200000 100 200 2 5 10 20 50 100 200 k Weighted Discernibility Figure 2: Adult Dataset (Age-Prioritized) where metricji ∈ {spreadi , inf ogaini , weighti }, and wj is the weight of the individual metric j (wj >= 0). Figure 1 and 2 compare the prioritized approach and 6 400000 80 200000 100000 300000 200000 100000 Mondrian Prioritized Classification Accuracy (%) Mondrian Prioritized 300000 Discernibility Weighted Discernibility 400000 70 60 50 40 30 10 0 0 2 5 10 20 50 100 0 2 200 Original Mondrian Prioritized 20 5 10 20 50 100 200 2 5 10 50 100 200 (c) Classification Accuracy (b) Weighted Discernibility (a) Discernibility 20 k k k Figure 3: Japanese Credit Screening Dataset - Classification 500000 150000 100000 50000 60 Mondrian Prioritized 400000 Classification Accuracy (%) Mondrian Prioritized 200000 Weighted Discernibility Discernibility 250000 300000 200000 100000 0 0 2 5 10 20 k 50 (a) Discernibility 100 200 50 40 30 20 Original Mondrian Prioritized 10 0 2 5 10 20 50 100 200 k (b) Weighted Discernibility 2 5 10 20 k 50 100 200 (c) Prediction Accuracy Figure 4: Japanese Credit Screening Dataset - Prediction (A3) given varying weights (both arbitrary or assuming user knowledge) to examine their effect on classification accuracy. For prediction, attributes other than the class attribute were recoded into ranges using equiwidth3 approach. A target attribute is selected as the prediction attribute and the rest of the attributes are anonymized and used to predict the target attribute. the Mondrian approach in terms of general discernibility and weighted discernibility with respect to different value of k for Sex-prioritized and Age-prioritized anonymization respectively. We observe that even though the prioritized approach has a comparable general discernibility with the Mondrian, it achieves a much improved weighted discernibility in both cases, which is directly correlated with the user-desired data utility (i.e. having a more fine-grained view for Age attribute or Sex attribute for exploratory query or mining purposes). We assume the users have some domain knowledge of which attributes will be used as feature attributes for their classification and we then assigned higher priority weights for these attributes. In addition, we also experimented with a set of single-attribute classification by selecting one feature attribute each time and assigned weights for the attributes based on their classification accuracy. The results are similar and we report the first set of results below. 5.2 Classification Applications For classification applications, we used the Japanese Credit Screening dataset, also from the UCI Machine Learning Repository. The dataset consists of 653 instances, 15 attributes and a 2-valued class attribute (A16) that corresponds to a positive/negative (+/-) credit. The missing valued instances were removed and the experiments were carried out considering only the continuous attributes (A2, A3, A8, A11, A14 and A15). The dataset was anonymized using the prioritized approach and the Mondrian approach and the resultant anonymized data as well as the original data were used for classification and prediction. The Weka implementation of the simple Naive-Bayes classifier was used for the classification, with 10 fold cross-validation for classification accuracy determination. Figure 3(a) and 3(b) compare the prioritized and Mondrian approach in terms of general discernibility and weighted discernibility of the anonymized dataset respectively. Figure 3(c) compares the anonymized datasets as well as the original dataset in terms of accuracy for the class attribute. Similarly, Figure 4 presents the results for prediction of attribute A3. We observe that the prioritized approach performs better than the Mondrian for both classification and prediction in terms of accuracy and achieves a comparable accuracy as the original dataset. In addition, a comparison of the dis- For classification, the class attribute was recoded as 1.0/0.0. Different feature attributes were selected and 3 Equal 7 spread ranges for the recoded attributes. cernibility metrics and the classification accuracy shows that the weighted discernibility metric corresponds well to the application-oriented data utility, i.e. the classification accuracy. [4] [5] 6 Conclusion and Discussions We presented an application-oriented approach for data anonymization that takes into account the relative attribute importance for target applications. We derived a set of generalization criteria for application-oriented data anonymization and presented a prioritized generalization approach that aims to minimize the prioritized cost metric. Our initial results show that the prioritized anonymization metric correlates well with application-oriented data utility and the prioritized approach achieves better data utility than general approaches from application point of view. [6] [7] [8] [9] There are a few items on our research agenda. First, the presented anonymization technique uses a special generalization algorithm and a simple weighted heuristic. We will study different heuristics and generalize the result to more advanced privacy principles and anonymization approaches. Second, while it is not always possible for users to specify the attribute priorities before hand, we will study how to automatically learn attribute priorities from sample queries and mining tasks and further devise models and presentations that allow application requirements to be incorporated. In addition, a more in-depth and longer-term issue that we will investigate is the notion of priorities, in particular, the interaction between what data owners perceive and what the data users (applications) perceive. Finally, it is important to note that there are inference implications of releasing multiple anonymized views where multiple data users may collude and combine their views to breach data privacy. While there is work beginning investigating the inference problem [57], the direction certainly warrants further research. [10] [11] [12] [13] [14] [15] [16] [17] References [18] [1] N. R. Adams and J. C. Wortman. Security-control methods for statistical databases: a comparative study. ACM Computing Surveys, 21(4), 1989. [2] C. C. Aggarwal. On k-anonymity and the curse of dimensionality. In VLDB, pages 901–909, 2005. [3] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu. Achieving [19] [20] 8 anonymity via clustering. In PODS, pages 153–162, 2006. G. Aggarwal, N. Mishra, and B. Pinkas. Secure computation of the kth ranked element. In IACR Conference on Eurocrypt, 2004. R. Agrawal, P. Bird, T. Grandison, J. Kieman, S. Logan, and W. Rjaibi. Extending relational database systems to automatically enforce privacy policies. In 21st ICDE, 2005. R. Agrawal, A. Evfimievski, and R. Srikant. Information sharing across private databases. In SIGMOD, 2003. R. Agrawal, J. Kieman, R. Srikant, and Y. Xu. Hippocratic databases. In VLDB, 2002. R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 439–450. ACM Press, May 2000. R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In ICDE ’05: Proceedings of the 21st International Conference on Data Engineering (ICDE’05), pages 217–228, Washington, DC, USA, 2005. IEEE Computer Society. E. Bertino, B. Ooi, Y. Yang, and R. H. Deng. Privacy and ownership preserving of outsourced medical data. In ICDE, 2005. S. S. Bhowmick, L. Gruenwald, M. Iwaihara, and S. Chatvichienchai. Private-iye: A framework for privacy preserving data integration. In ICDE Workshops, page 91, 2006. S. Bu, L. V. S. Lakshmanan, R. T. Ng, and G. Ramesh. Preservation of patterns and input-output privacy. In ICDE, pages 696–705, 2007. J. Byun and E. Bertino. Micro-views, or on how to protect privacy while enhancing data usability concept and challenges. SIGMOD Record, 35(1), 2006. J.-W. Byun, E. Bertino, and N. Li. Purpose based access control of complex data for privacy protection. In ACM Symposium on Access Control Models and Technologies (SACMAT), 2005. A. V. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, pages 211–222, 2003. A. V. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. Inf. Syst., 29(4):343–364, 2004. B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In Proc. of the 21st IEEE International Conference on Data Engineering (ICDE 2005), pages 205–216, Tokyo, Japan, April 2005. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd edition, 2006. Z. Huang, W. Du, and B. Chen. Deriving private information from randomized data. In SIGMOD Conference, pages 37–48, 2005. V. S. Iyengar. Transforming data to satisfy privacy constraints. In KDD, pages 279–288, 2002. [40] L. Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, 2002. [41] L. Sweeney. k-anonymity: a model for protecting privacy. International journal on uncertainty, fuzziness and knowledge-based systems, 10(5), 2002. [42] Z. Teng and W. Du. Comparisons of k-anonymization and randomization schemes under linking attacks. In ICDM, pages 1091–1096, 2006. [43] T. M. Truta and B. Vinay. Privacy protection: psensitive k-anonymity property. In ICDE Workshops, page 94, 2006. [44] J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In ACM SIGKDD, 2002. [45] J. vaidya and C. Clifton. Privacy-preserving kmeans clustering over vertically partitioned data. In SIGKDD, 2003. [46] J. Vaidya and C. Clifton. Privacy preserving nave bayes classifier for vertically partitioned data. In ACM SIGKDD, 2003. [47] J. Vaidya and C. Clifton. Privacy-preserving top-k queries. In ICDE, 2005. [48] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y. Theodoridis. State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, 33(1), 2004. [49] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD, 2006. [50] K. Wang, P. S. Yu, and S. Chakraborty. Bottomup generalization: a data mining solution to privacy protection. In Proc. of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 2004. [51] X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. In VLDB, pages 139–150, 2006. [52] X. Xiao and Y. Tao. Personalized privacy preservation. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, 2006. [53] X. Xiao and Y. Tao. M-invariance: towards privacy preserving re-publication of dynamic datasets. In SIGMOD Conference, pages 689–700, 2007. [54] L. Xiong, S. Chitti, and L. Liu. Topk queries across multiple private databases. In 25th International Conference on Distributed Computing Systems (ICDCS 2005), 2005. [55] L. Xiong, S. Chitti, and L. Liu. Mining multiple private databases using a knn classifier. In ACM Symposium of Applied Computing (SAC), pages 435–440, 2007. [56] Z. Yang, S. Zhong, and R. N. Wright. Privacypreserving classification of customer data without loss of accuracy. In SIAM SDM, 2005. [57] C. Yao, X. S. Wang, and S. Jajodia. Checking for kanonymity violation by views. In VLDB, 2005. [58] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu. Aggregate query answering on anonymized tables. In ICDE, pages 116–125, 2007. [21] V. S. Iyengar. Transforming data to satisfy privacy constraints. In SIGKDD, 2002. [22] S. Jajodia and R. Sandhu. Toward a multilevel secure relational data model. In ACM SIGMOD, 1991. [23] M. Kantarcioglu and C. Clifton. Privacy preserving data mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data Engineering (TKDE), 16(9), 2004. [24] M. Kantarcioglu and C. Clifton. Privacy preserving k-nn classifier. In ICDE, 2005. [25] M. Kantarcoglu and J. Vaidya. Privacy preserving naive bayes classifier for horizontally partitioned data. In IEEE ICDM Workshop on Privacy Preserving Data Mining, 2003. [26] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the privacy preserving properties of random data perturbation techniques. In ICDM, pages 99–106, 2003. [27] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In SIGMOD Conference, pages 217–228, 2006. [28] K. LeFevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan, Y. Xu, and D. DeWitt. Limiting disclosure in hippocratic databases. In 30th International Conference on Very Large Data Bases, 2004. [29] K. LeFevre, D. Dewitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In ACM SIGMOD International Conference on Management of Data, 2005. [30] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In IEEE ICDE, 2006. [31] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Workload-aware anonymization. In SIGKDD, 2006. [32] N. Li and T. Li. t-closeness: Privacy beyond kanonymity and l-diversity. In To appear in International Conference on Data Engineering (ICDE), 2007. [33] Y. Lindell and B. Pinkas. Privacy preserving data mining. Journal of Cryptology, 15(3), 2002. [34] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), page 24, 2006. [35] D. J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, and J. Y. Halpern. Worst-case background knowledge for privacy-preserving data publishing. In ICDE, pages 126–135, 2007. [36] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, pages 223–228, 2004. [37] M. E. Nergiz, M. Atzori, and C. Clifton. Hiding the presence of individuals from shared databases. In SIGMOD Conference, pages 665–676, 2007. [38] M. E. Nergiz and C. Clifton. Thoughts on kanonymization. In ICDE Workshops, page 96, 2006. [39] S. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. In VLDB, pages 682–693, 2002. 9 [59] S. Zhong, Z. Yang, and R. N. Wright. Privacyenhancing k-anonymization of customer data. In PODS, 2005. 10 Efficient Algorithms for Masking and Finding Quasi-Identifiers ∗ Rajeev Motwani † Abstract A quasi-identifier refers to a subset of attributes that can uniquely identify most tuples in a table. Incautious publication of quasi-identifiers will lead to privacy leakage. In this paper we consider the problems of finding and masking quasi-identifiers. Both problems are provably hard with severe time and space requirements. We focus on designing efficient approximation algorithms for large data sets. We first propose two natural measures for quantifying quasi-identifiers: distinct ratio and separation ratio. We develop efficient algorithms that find small quasi-identifiers with provable size and separation/distinct ratio guarantees, with space and time requirements sublinear in the number of tuples. We also propose efficient algorithms for masking quasi-identifiers, where we use a random sampling technique to greatly reduce the space and time requirements, without much sacrifice in the quality of the results. Our algorithms for masking and finding quasi-identifiers naturally apply to stream databases. Extensive experimental results on real world data sets confirm efficiency and accuracy of our algorithms. 1 Introduction A quasi-identifier (also called a semi-key) is a subset of attributes which uniquely identifies most entities in the real world or tuples in a table. A well-known example is that the combination of gender, date of birth, and zipcode can uniquely determine about 87% of the population in United States. Quasi-identifiers play an important role in many aspects of data management, including privacy, data cleaning, and query optimization. As pointed out in the seminal paper of Sweeney [25], publishing data with quasi-identifiers leaves open attacks that combine the data with other publicly available information to identify represented individuals. To avoid ∗ P3DM’08, April 26, 2008, Atlanta, Georgia, USA. University. rajeev@cs.stanford.edu. Supported in part by NSF Grant ITR-0331640, and a grant from Media-X. ‡ Stanford University. xuying@cs.stanford.edu. Supported in part by Stanford Graduate Fellowship and NSF Grant ITR-0331640. † Stanford Ying Xu ‡ such linking attacks via quasi-identifiers, the concept of kanonymity was proposed [25, 24] and many algorithms for k-anonymity have been developed [23, 2, 4]. In this paper we consider the problem of masking quasi-identifiers: we want to publish a subset of attributes (we either publish the exact value of every tuple on an attribute, or not publish the attribute at all), so that no quasi-identifier is revealed in the published data. This can be viewed as a variant of k-anonymity where the suppression is only allowed at the attribute level. While this approach is admittedly too restrictive in some applications, there are two reasons we consider it. First, the traditional tuple-level suppression may distort the distribution of the original data and the association between attributes, so sometimes it might be desirable to publish fewer attributes with complete and accurate information. Second, as noted in [15], the traditional k-anonymity algorithms are expensive and do not scale well to large data sets; by restricting the suppression to a coarser level we are able to design more efficient algorithms. We also consider the problem of finding small keys and quasi-identifiers, which can be used by adversaries to perform linking attacks. When a table which is not properly anonymized is published, an adversary would be interested in finding keys or quasi-identifiers in the table so that once he collects other persons’ information on those attributes, he will be able to link the records to real world entities. Collecting information on each attribute incurs certain cost to the adversary (for example, he needs to look up yellow pages to collect the area code of phone numbers, to get party affiliation information from the voter list, etc), so the adversary wishes to find a subset of attributes with a small size or weight that is a key or almost a key to minimize the attack cost. Finding quasi-identifiers also has other important applications besides privacy. One application is data cleaning. Integration of heterogeneous databases sometimes causes the same real-world entity to be represented by multiple records in the integrated database due to spelling mistakes, inconsistent conventions, etc. A critical task in data cleaning is to identify and remove such fuzzy duplicates [3, 6]. We can estimate the ratio of fuzzy duplicates, for example by checking some samples manually or plotting the distribution of pairwise similarity; now if we can find a quasi- 11 identifier whose “quasiness” is similar to the fuzzy duplicate ratio, then those tuples which collide on the quasiidentifier are likely to be fuzzy duplicates. Finally, quasiidentifiers are a special case of approximate functional dependency [13, 22], and their automatic discovery is valuable to query optimization and indexing [9]. In this paper, we study the problems of finding and masking quasi-identifiers in given tables. Both problems are provably hard with severe time and space requirements, so we focus on designing efficient approximation algorithms for large data sets. First we define measures for quantifying the “quasiness” of quasi-identifiers. We propose two natural measures – separation ratio and distinct ratio. Then we consider the problem of finding the minimum key. The problem is NP-hard and the best-known approximation algorithm is a greedy algorithm with approximation ratio O(ln n) (n is the number of tuples); however, even this greedy algorithm requires multiple scans of the table, which are expensive for large databases that cannot reside in main memory and prohibitive for stream databases. To enable more efficient algorithms, we sacrifice accuracy by allowing approximate answers (quasi-identifiers). We develop efficient algorithms that find small quasi-identifiers with provable size and separation/distinct ratio guarantees, with both space and time complexities sublinear in the number of input tuples. Finally we present efficient algorithms for masking quasi-identifiers. We use a random sampling technique to greatly reduce the space and time requirements, without sacrificing much in the quality of the results. Our algorithms for masking and finding minimum quasiidentifiers naturally apply to stream databases: we only require one pass over the table to get a random sample of the tuples and the space complexity is sublinear in the number of input tuples (at the cost of only providing approximate solutions). 1.1 Definitions and Overview of Results A key is a subset of attributes that uniquely identifies each tuple in a table. A quasi-identifier is a subset of attributes that can distinguish almost all tuples. We propose two natural measures for quantifying a quasi-identifier. Since keys are a special case of functional dependencies, our measures for quasi-identifiers also conform with the measures of approximate functional dependencies proposed in earlier work [13, 22, 11, 8]. An α-separation quasi-identifier is a subset of attributes which separates at least an α fraction of all possible tuple pairs. 1 2 3 4 5 age 20 30 40 20 40 sex Female Female Female Male Male state CA CA TX NY CA Table 1. An example table. The first column labels the tuples for future references and is not part of the table. We illustrate the notions with an example (Table 1). The example table has 3 attributes. The attribute age is a 0.6distinct quasi-identifier because it has 3 distinct values in a total of 5 tuples; it is a 0.8-separation quasi-identifier because 8 out of 10 tuple pairs can be separated by age. {sex, state} is 0.8-distinct and 0.9-separation. The separation ratio of a quasi-identifier is always larger than its distinct ratio, but there is no one-to-one mapping. Let us consider a 0.5-distinct quasi-identifier in a table of 100 tuples. One possible scenario is that projected on the quasi-identifier there are 50 distinct values and each value 50 ≈ corresponds to 2 tuples, so its separation ratio is 1− 100 (2) 0.99; another possible scenario is that for 49 of the 50 distinct values there is only one tuple for each value, and all the other 51 tuples have the same value, and then this quasiidentifier is 0.75-separation. Indeed, an α-distinct quasiidentifier can be an α′ -separation quasi-identifier where α′ . Both can be as small as 2α − α2 , or as large as 1 − 2(1−α) n distinct ratio and separation ratio are very natural measures for quasi-identifiers and have different applications as noted in the literature on approximate functional dependency. In this paper we study quasi-identifiers using both measures. Given a table with n tuples and m attributes, we consider the following problems. The size of a key (quasi-identifier) refers to the number of attributes in the key. Minimum Key Problem: find a key of the minimum size. This problem is provably hard so we also consider its relaxed version: (ǫ, δ)-Separation or -Distinct Minimum Key Problem: look for a quasi-identifier with a small size such that, with probability at least 1 − δ, the output quasi-identifier has separation or distinct ratio at least 1 − ǫ. (1) An α-distinct quasi-identifier is a subset of attributes which becomes a key in the table remaining after the removal of at most a 1 − α fraction of tuples in the original table. β-Separation or -Distinct Quasi-identifier Masking Problem: delete a minimum number of attributes such that there is no quasi-identifier with separation or distinct ratio greater than β in the remaining attributes. (2) We say that a subset of attributes separates a pair of tuples x and y if x and y have different values on at least one attribute in the subset. 12 In the example of Table 1, {age, state} is a minimum key, with size 2; the optimal solution to 0.8-distinct quasiidentifier masking problem is {sex, state}; the optimal solution to 0.8-separation quasi-identifier masking problem is {age}, {sex} or {state}, all of size 1. The result data after quasi-identifier masking can be viewed as an approximation to k-anonymity. For example, after 0.2-distinct quasi-identifier masking, the result data is approximately 5-anonymous, in the sense that on average each tuple is indistinguishable from another 4 tuples. It does not provide perfect privacy as there may still exist some tuple with a unique value, nevertheless it provides anonymity for the majority of the tuples. The k-anonymity problem is NP-hard [17, 2]; further, Lodha and Thomas [15] note that there is no efficient approximation algorithm known that scale well for large data sets, and they also aim at preserving privacy for majority. We hope to provide scalable anonymizing algorithm by relaxing the privacy constraints. Finally we would like to maximize the utility of published data, and we measure utility in terms of the number of attributes published (our solution can be generalized to the case where attributes have different weights and utility is the weighted sum of published attributes). We summarize below the contributions of this paper. 1. We propose greedy algorithms for the (ǫ, δ)-separation and distinct minimum key problems, which find small quasi-identifiers with provable size and separation (distinct) ratio guarantees, with space and time requirements sublinear in n. In particular, the space complexity is O(m2 ) for the √ (ǫ, δ)-separation minimum key problem, and O(m mn) for (ǫ, δ)-distinct. The algorithms are particularly useful when n ≫ m, which is typical of database applications where a large table may consist of millions of tuples, but only a relatively small number of attributes. We also extend the algorithms to find the approximate minimum β-separation quasi-identifiers. (Section 2) 2. We present greedy algorithms for β-separation and β-distinct quasi-identifier masking. The algorithms are slow on large data sets, and we use a random sampling technique to greatly reduce the space and time requirements, without much sacrifice in the utility of the published data. (Section 3) 3. We have implemented all the above algorithms and conducted extensive experiments using real data sets. The experimental results confirm the efficiency and accuracy of our algorithms. (Section 4) 2 Finding Minimum Keys In this section we consider the Minimum Key problem. First we show the problem is NP-hard (Section 2.1) and the best approximation algorithm is a greedy algorithm which gives O(ln n)-approximate solution (Section 2.2). The greedy algorithm requires multiple scans of the table, which is expensive for large tables and inhibitive for stream databases. To enable more efficient algorithms, we relax the problem by allowing approximate answers, i.e. the (ǫ, δ)-Separation (Distinct) Minimum Key problem. We develop random sampling based algorithms with approximation guarantees and sublinear space (Section 2.3, 2.4). 2.1 Hardness Result The Minimum Key problem is NP-Hard, which follows easily from the NP-hardness of the Minimum Test Collection problem. Minimum Test Collection: Given a set S of elements and a collection C of subsets of S, a test collection is a subcollection of C such that for each pair of distinct elements there is some set that contains exactly one of the two elements. The Minimum Test Collection problem is to find a test collection with the smallest cardinality. Minimum Test Collection is equivalent to a special case of the Minimum Key problem where each attribute is boolean: let S be the set of tuples and C be all the attributes; each subset in C corresponds to an attribute and contains all the tuples whose values are true in this attribute, then a test collection is equivalent to a key in the table. Minimum Test Collection is known to be NP-hard [7], therefore the Minimum Key problem is also NP-hard. 2.2 A Greedy Approximation Algorithm The best known approximation algorithm for Minimum Test Collection is a greedy algorithm with approximation ratio 1 + 2 ln |S| [18], i.e. it finds a test collection with size at most 1 + 2 ln |S| times the smallest test collection size. The algorithm can be extended to the more general Minimum Key problem, where each attribute can be from an arbitrary domain, not just boolean. Before presenting the algorithm, let us consider a naive greedy algorithm: compute the separation (or distinct) ratio of each attribute in advance; each time pick the attribute with the highest separation ratio in the remaining attributes, until we get a key. The algorithm is fast and easy to implement, but unfortunately it does not perform well when the attributes are correlated. For example if there are many attributes pairwise highly correlated and each has a high separation ratio, then the optimal solution probably includes only one of these attributes while the above greedy algorithm is likely to pick all of them. The approximation ratio of this algorithm can be arbitrarily bad. A fix to the naive algorithm is to pick each time the attribute which separates the largest number of tuple pairs not yet separated. To prove the approximation ratio of the algorithm, we reduce Minimum Key to the Minimum Set Cover problem. The reduction plays an important role 13 in designing algorithms for finding and masking quasiidentifiers in later sections. Minimum Set Cover: Given a finite set S (called the ground set) and a collection C of subsets of S, a set cover I is a subcollection of C such that every element in S belongs to at least one member of I. Minimum Set Cover problem asks for a set cover with the smallest size. Given an instance of Minimum Key with n tuples and m attributes, we reduce it to a set cover instance as follows: the ground set¡ S¢ consists of all distinct unordered pairs of tuples (|S| = n2 ); each attribute c in the table is mapped to a subset containing all pairs of tuples separated by attribute c. Now a collection of subsets covers S if and only if the corresponding attributes can separate all pairs of tuples, i.e., those attributes form a key, therefore there is a one-to-one map between minimum set covers and minimum keys. Consider the example of Table 1. The ground set of the corresponding set cover instance contains 10 elements where each element is a pair of tuples. The column age is mapped to a subset cage with 8 pairs: {(1, 2), (1, 3), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (4, 5)}; the column sex is mapped to a subset csex with 6 pairs, and state 7 pairs. The attribute set {age, sex} is a key; correspondingly the collection {cage , csex } is a set cover. The Greedy Set Cover Algorithm starts with an empty collection (of subsets) and adds subsets one by one until every element in S has been covered; each time it chooses the subset covering the largest number of uncovered elements. It is well known that this greedy algorithm is a 1 + ln |S| approximation algorithm for Minimum Set Cover. L EMMA 2.1. [12] The Greedy Set Cover Algorithm outputs a set cover of size at most 1 + ln |S| times the minimum set cover size. The Greedy Minimum Key Algorithm mimics the greedy set cover algorithm: start with an empty set of attributes and add attributes one by one until all tuple pairs are separated; each time chooses an attribute separating the largest number of tuple pairs not yet separated. The running time of the algorithm is O(m3 n). It is easy to infer the approximation ratio of this algorithm from Lemma 2.1: T HEOREM 2.1. Greedy Minimum Key Algorithm outputs a key of size at most 1 + 2 ln n times the minimum key size. for large data sets. In this and next section, we relax the minimum key problem by allowing quasi-identifiers and design efficient algorithms with approximate guarantees. We use the standard (ǫ, δ) formulation: with probability at least 1 − δ, we allow an “error” of at most ǫ, i.e. we output a quasi-identifier with separation (distinct) ratio at least 1 − ǫ. The (ǫ, δ) Minimum Set Cover Problem is defined similarly and requires the output set cover covering at least a 1 − ǫ fraction of all elements. Our algorithms are based on random sampling. We first randomly sample k elements (tuples), and reduce the input set cover (key) instance to a smaller set cover (key) instance containing only the sampled elements (tuples). We then solve the exact minimum set cover (key) problem in the smaller instance (which is again a hard problem but has much smaller size, so we can afford to apply the greedy algorithms in Section 2.2), and output the solution as an approximate solution to the original problem. The number of samples k is carefully chosen so that the error probability is bounded. We present in detail the algorithm for (ǫ, δ)-set cover in Section 2.3.1; the (ǫ, δ)-Separation Minimum Key problem can be solved by reducing to (ǫ, δ) Minimum Set Cover (Section 2.3); we discuss (ǫ, δ)-Distinct Minimum Key in Section 2.4. 2.3.1 (ǫ, δ) Minimum Set Cover The key observation underlying our algorithm is that to check whether a given collection of subsets is a set cover, we only need to check some randomly sampled elements if we allow approximate solutions. If the collection only covers part of S, then it will fail the check after enough random samples. The idea is formalized as the following lemma. L EMMA 2.2. s1 , s2 , . . . , sk are k elements independently randomly chosen from S. If a subset S ′ satisfies |S ′ | < α|S|, then P r[si ∈ S ′ , ∀i] < αk . The proof is straightforward. The probability that a random element of S belongs to S ′ is |S ′ |/|S| < α, therefore the probability of all k random elements belonging to S ′ is at most αk . Now we combine the idea of random sample checking with the greedy algorithm for the exact set cover. Our Greedy Approximate Set Cover algorithm is as follows: 1. Choose k elements uniformly at random from S (k is defined later); The greedy algorithms are optimal because neither problem is approximable within c ln |S| for some c > 0 [10]. Note that this is the worst case bound and in practice the algorithms usually find much smaller set covers or keys. 2.3 (ǫ, δ)-Separation Minimum Key 2. Reduce the problem to a smaller set cover instance: the ground set S̃ consists of the k chosen elements; each subset in the original problem maps to a subset which is the intersection of S̃ and the original subset; The greedy algorithm in the last section is optimal in terms of approximation ratio, however, it requires multiple scans (O(m2 ) scans indeed) of the table, which is expensive 3. Apply Greedy Set Cover Algorithm to find an exact set cover for S̃, and output the solution as an approximate set cover to S. 14 Let n be the size of the ground set S, and m be the number of subsets. We say a collection of subsets is an αset cover if it covers at least an α fraction of the elements. T HEOREM 2.3. With probability 1−δ, the above algorithm outputs a (1 − ǫ)-separation quasi-identifier whose size is at 2m ∗ ∗ 1 most (1 + ln log 1−ǫ δ )|I |, where I is the smallest key. T HEOREM 2.2. With probability 1−δ, the above algorithm 2m 1 with k = log 1−ǫ δ outputs a (1 − ǫ)-set cover whose The proof directly follows Theorem 2.2. The approximation ratio is essentially ln m + O(1). The space requirement of the above algorithm is mk = O(m2 ), which significantly improves upon the input size mn. 1 cardinality is at most (1 + ln log 1−ǫ the optimal exact set cover. 2m ∗ δ )|I |, where I ∗ is Proof. Denote by S̃ the ground set of the reduced instance (|S̃| = k); by I˜∗ the minimum set cover of S̃ . The greedy algorithm outputs a subcollection of subsets covering all ˜ By Lemma 2.1, |I| ˜ ≤ k elements of S̃, denoted by I. ∗ ∗ ˜ (1 + ln |S̃|)|I |. Note that I , the minimum set cover of the original set S, corresponds to a set cover of S̃, so ˜ ≤ (1 + ln k)|I ∗ |. |I˜∗ | ≤ |I ∗ |, and hence |I| ˜ We map I back to a subcollection I of the original problem. We have ˜ ≤ (1 + ln k)|I ∗ | = (1 + ln log 1 2m )|I ∗ |. |I| = |I| δ 1−ǫ Now bound the probability that I is not a 1 − ǫ-set cover. By Lemma 2.2, the probability that a subcollection covering less than a 1 − ǫ fraction of S covers all k chosen elements of S̃ is at most log (1 − ǫ)k = (1 − ǫ) 1 1−ǫ 2m δ = (1 − ǫ)log1−ǫ δ 2m = δ . 2m There are 2m possible subcollections; by union bound, the overall error probability, i.e. the probability that any subcollection is not a (1−ǫ)-cover of S but is an exact cover of S̃, is at most δ. Hence, with probability at least 1 − δ, I is a (1 − ǫ)-set cover for S. If we take ǫ and δ as constants, the approximation ratio is essentially ln m+O(1), which is smaller than 1+ln n when n ≫ m. The space requirement of the above algorithm is mk = O(m2 ) and running time is O(m4 ). 2.3.2 (ǫ, δ)-Separation Minimum Key The reduction from Minimum Key to Minimum Set Cover preserves the separation ratio: an α-separation quasi-identifier separates at least an α fraction of all pairs of tuples, so its corresponding subcollection is an α-set cover; and vice versa. Therefore, we can reduce the (ǫ, δ)-Separation Minimum Key problem to the (ǫ, δ)-Set Cover problem where |S| = O(n2 ). The complete algorithm is as follows. 1 1. Randomly choose k = log 1−ǫ 2m δ pairs of tuples; 2. Reduce the problem to a set cover instance where the ground set S̃ is the set of those k pairs and each attribute maps to a subset of the k pairs separated by this attribute; 3. Apply Greedy Set Cover Algorithm to find an exact set cover for S̃, and output the corresponding attributes as a quasi-identifier to the original table. 2.4 (ǫ, δ)-Distinct Minimum Key Unfortunately, the reduction to set cover does not necessarily map an α-distinct quasi-identifier to an α-set cover. As pointed out in Section 1.1, an α-distinct quasi-identifier corresponds to an α′ -separation quasi-identifier, and thus reduces to an α′ -set cover, where α′ can be as small as . Therefore reducing this 2α − α2 , or as large as 1 − 2(1−α) n problem directly to set cover gives too loose bound, and a new algorithm is desired. Our algorithm for finding distinct quasi-identifiers is again based on random sampling. We reduce the input (ǫ, δ)-Distinct Minimum Key instance to a smaller (exact) Minimum Key instance by randomly choosing k tuples and keeping all m attributes. The following lemma bounds the probability that a subset of attributes is an (exact) key in the sample table, but not an α-distinct quasi-identifier in the original table. L EMMA 2.3. Randomly choose k tuples from input table T to form table T1 . Let p be the probability that an (exact) key of T1 is not an α-distinct quasi-identifier in T . Then p < e− ( 1 −1)k(k−1) α 2n Proof: Suppose we have n balls distributed in d = αn distinct bins. Randomly choose k balls without replacement, and the probability that the k balls are all from different bins is exactly p. Let x1 , x2 , . . . , xd be the number of balls in the Pd d bins ( i=1 xi = n, xi > 0), then P all{i1 ,i2 ,...,ik } xi1 xi2 . . . xik ¡n¢ . p= k p is maximized when all xi s are equal, i.e. each bin has balls. Next we compute p for this case. The first ball can be from any bin; to choose the second ball, we have n − 1 choices, but it cannot be from the same bin as the first one, so α1 − 1 of the n − 1 choices are infeasible; similar arguments hold for the remaining balls. Summing up, the probability that all k balls are from distinct bins is 1 α p = 1(1 − ≤ e < e− 1 15 (k − 1)( α1 − 1) 2( 1 − 1) −1 ) ) . . . (1 − )(1 − α n − (k − 1) n−2 n−1 1 α 1 1 (k−1)( −1) 2( −1) −1 α α ) + n−(k−1) + n−2 −( α n−1 ( 1 −1)k(k−1) α 2n  The Greedy (ǫ, δ)-Distinct Minimum Key Algorithm is as follows: q m 2(1−ǫ) n ln 2δ tuples and 1. Randomly choose k = ǫ keep all attributes to form table T1 ; 2. Apply Greedy Minimum Key Algorithm to find an exact key in T1 , and output it as a quasi-identifier to the original table. T HEOREM 2.4. With probability 1−δ, the above algorithm outputs a (1 − ǫ)-distinct quasi-identifier whose size is at most (1 + 2 ln k)|I ∗ |, where I ∗ is the smallest exact key. The proof is similar to Theorem 2.2, substituting Lemma 2.2 with Lemma 2.3. k is chosen such that p ≤ 2δm to guarantee that the overall error probability is less than δ. The approximation ratio is essentially ln m + ln n + O(1), which improves the 1 + 2 ln n result √ for the exact key. The space requirement is mk = O(m mn), sublinear in the number of tuples of the original table. 2.5 Minimum β-Separation Quasi-identifier In previous sections, our goal is to find a small quasiidentifier that is almost a key. Note that ǫ indicates our “error tolerance”, not our goal. For (ǫ, δ)-Separation Minimum Key problem, our algorithm is likely to output quasi-identifiers whose separation ratios are far greater than 1 − ǫ. For example, suppose the minimum key of a given table consists of 100 attributes, while the minimum 0.9-separation quasi-identifier has 10 attributes, then our (0.1, 0.01)-separation algorithm may output a quasiidentifier that has say 98 attributes and is 0.999-separation. However, sometimes we may be interested in finding 0.9separation quasi-identifiers which have much smaller sizes. For this purpose we consider the Minimum β-Separation Quasi-identifier Problem: find a quasi-identifier with the minimum size and separation ratio at least β. The Minimum β-Separation Quasi-identifier Problem is at least as hard as Minimum Key since the latter is a special case where β = 1. So again we consider the approximate version by relaxing the separation ratio: we require the algorithm to output a quasi-identifier with separation ratio at least (1 − ǫ)β with probability at least 1 − δ. We present the algorithm for approximate β-set cover; the β-separation quasi-identifier problem can be reduced to β-set cover as before. The Greedy Minimum β-Set Cover algorithm works as 2m 16 elements follows: first randomly sample k = βǫ 2 ln δ from the ground set S, and construct a smaller set cover instance defined on the k chosen elements; run the greedy algorithm on the smaller set cover instance until get a subcollection covering at least (2 − ǫ)βk/2 elements (start with an empty subcollection; each time add to the subcollection a subset covering the largest number of uncovered elements). T HEOREM 2.5. The Greedy Minimum β-Set Cover algorithm runs in space mk = O(m2 ), and with probability at least 1 − δ, outputs a (1 − ǫ)β-set cover with size at most )|I ∗ |, where I ∗ is the minimum β-set cover (1 + ln (2−ǫ)βk 2 of S. The proof can be found in our technical report. This algorithm also applies to the minimum exact set cover problem (the special case where β = 1), but the bound is worse than Theorem 2.2; see our technical report for detailed comparison. The minimum β-separation quasi-identifier problem can be solved by reducing to β-set cover problem and applying the above greedy algorithm. Unfortunately, we cannot provide similar algorithms for β-distinct quasi-identifiers; the main difficulty is that it is hard to give a tight bound to the distinct ratio of the original table by only looking at a small sample of tuples. The negative results on distinct ratio estimation can be found in [5]. 3 Masking Quasi-Identifiers In this section we consider the quasi-identifier masking problem: when we release a table, we want to publish a subset of the attributes subject to the privacy constraint that no β-separation (or β-distinct) quasi-identifier is published; on the other hand we want to maximize the utility, which is measured by the number of published attributes. For each problem, we first present a greedy algorithm which generates good results but runs slow for large tables, and then show how to accelerate the algorithms using random sampling. (The algorithms can be easily extended to the case where the attributes have weights and the utility is the sum of attribute weights.) 3.1 Masking β-Separation Quasi-identifiers As in Section 2.2, we can reduce the problem to a set cover type problem: let the ground set S be the set of all pairs of tuples, and let each attribute correspond to a subset of tuple pairs separated by this attribute, then the problem of Masking β-Separation Quasi-identifier is equivalent to finding a maximum number of subsets such that at most a β fraction of elements in S is covered by the selected subsets. We refer to this problem as Maximum Non-Set Cover problem. Unfortunately, the Maximum NonSet Cover problem is NP-hard by a reduction from the Dense Subgraph problem. (See our technical report for the hardness proof.) We propose a greedy heuristic for masking β-separation quasi-identifiers: start with an empty set of attributes, and add attributes to the set one by one as long as the separation ratio is below β; each time pick the attribute separating the least number of tuple pairs not yet separated. The algorithm produces a subset of attributes satisfying the privacy constraint and with good utility in practice, 16 however it suffers from the same efficiency issue as the greedy algorithm in Section 2.2: it requires O(m2 ) scans of the table and is thus slow for large data sets. We again use random sampling technique to accelerate the algorithm: the following lemma gives a necessary condition for a βseparation quasi-identifier in the sample table (with high probability), so only looking at the sample table and pruning all attribute sets satisfying the necessary condition will guarantee the privacy constraint. The proof of the lemma is omit for lack of space. L EMMA 3.1. Randomly sample k pairs of tuples, then a βseparation quasi-identifier separates at least αβ of the k 2 pairs, with probability at least 1 − e−(1−α) βk/2 . The Greedy Approximate β-Separation Masking Algorithm is as follows: 1. Randomly choose k pairs of tuples; q m /δ) )β. Run the following 2. Let β ′ = (1 − 2 ln(2 βk greedy algorithm on the selected pairs: start with an empty set C of attributes, and add attributes to the set C one by one as long as the number of separated pairs is below β ′ k; each time pick the attribute separating the least number of tuple pairs not yet separated; 3. Publish the set of attributes C. By the nature of the algorithm the published attributes C do not contain quasi-identifiers with separation greater than β ′ in the sample pairs; by Lemma 3.1, this ensures that ′ 2 with probability at least 1 − 2m e−(1−β /β) βk/2 = 1 − δ, C do not contain any β-separation quasi-identifier in the original table. Therefore the attributes published by the above algorithm satisfies the privacy constraint. T HEOREM 3.1. With probability at least 1 − δ, the above algorithm outputs an attribute set with separation ratio at most β. We may over-prune because the condition in Lemma 3.1 is not a sufficient condition, which means we may lose some utility. The parameter k in the algorithm offers a tradeoff between the time/space complexity and the utility. Obviously both the running time and the space increase linearly with k; on the other hand, the utility (the number of published attributes) also increases with k because the pruning condition becomes tighter as k increases. Our experiment results show that the algorithm is able to dramatically reduce the running time and space complexity, without much sacrifice in the utility (see Section 4). 3.2 Masking β-Distinct Quasi-identifiers For masking β-distinct quasi-identifiers, we can use a similar greedy heuristic: start with an empty set of attributes, and each time pick the attribute adding the least number of distinct values, as long as the distinct ratio is below β. And similarly we can use a sample table to trade off utility for efficiency. 1. Randomly choose k tuples and keep all the columns to form a sample table T1 ; q m /δ) )β. Run the following 2. Let β ′ = (1 − 2 ln(2 βk greedy algorithm on T1 : start with an empty set C of attributes, and add attributes to the set C one by one as long as the distinct ratio is below β ′ ; each time pick the attribute adding the least number of distinct values; 3. Publish the set of attributes C. Lemma 3.2 and Theorem 3.2 state the privacy guarantee of the above algorithm. L EMMA 3.2. Randomly sample k tuples from the input table T into a small table T1 (k ≪ n, where n is the number of tuples in T ). A β-distinct quasi-identifier of T is an αβ-distinct quasi-identifier of T1 with probability at 2 least 1 − e−(1−α) βk/2 . Proof. By the definition of β-distinct quasi-identifier, the tuples has at least βn distinct values projected on the quasiidentifier. Take (any) one tuple from each distinct value, and call those representing tuples “good tuples”. There are at least βn good tuples in T . Let k1 be the number of distinct values in T1 projected on the quasi-identifier, and k ′ be the number of good tuples in T1 . We have k1 ≥ k ′ because all good tuples are distinct. (The probability that any good tuple is chosen more than once is negligible when k ≪ n.) Next we bound the probability P r[k ′ ≤ αβk]. Since each random tuple has a probability at least β of being good, and each sample are chosen independently, we can use Chernoff bound (see [19] Ch. 4) and get P r[k ′ ≤ αβk] ≤ e−(1−α) 2 βk/2 Since k1 ≥ k ′ , we have P r[k1 ≤ αβk] ≤ P r[k ′ ≤ αβk] ≤ e−(1−α) 2 βk/2 2 Hence with probability at least 1 − e−(1−α) βk/2 , the quasiidentifier has distinct ratio at least αβ in T1 . T HEOREM 3.2. With probability at least 1−δ, the attribute set published by the algorithm has distinct ratio at most β. 4 Experiments We have implemented all algorithms for finding and masking quasi-identifiers, and conducted extensive experiments using real data sets. All experiments were run on a 2.4GHz Pentium PC with 1GB memory. 17 40 4500 35 4000 30 3500 3000 25 2500 20 2000 15 1500 10 1000 5 Running time Utility 500 0 0 50 100 150 200 Utility (# published attributes) Running Time (seconds) 5000 0 250 Sample Size k (*1000) (a) Masking 0.5-distinct quasi-identifiers 20 Running Time (seconds) 18 100 16 14 80 12 60 10 8 40 6 4 20 0 Running time Utility 2 Utility (# published attributes) 120 0 0 20 40 60 80 100 120 140 160 180 200 Sample Size k (*1000) (b) Masking 0.8-separation quasi-identifiers Figure 1. Performance of masking quasi-identifier algorithms with different sample sizes on table california. Figures (a) and (b) show how the running time (the left y axis) and the utility (the right y axis) change with the sample size (the parameter k) in Greedy Approximate algorithms for masking 0.5distinct and 0.8-separation quasi-identifiers. has 581012 rows and 54 attributes. We use 14 attributes of adult including age, education level, marital status; the number of records in adult is around 30000. 4.2 Masking Quasi-identifiers The greedy approximate algorithms for masking quasiidentifiers are randomized algorithms that guarantee to satisfy the privacy constraints with probability 1 − δ. We set δ = 0.01, and the privacy constraint are satisfied in all experiments, which confirms the accuracy of our algorithms. Figure 1 shows the tradeoff between the running time and the utility (the number of attributes published), using the california data set. Both the running time and the utility decrease as the sample size k decreases; however, the running time decreases linearly with k while the utility degrades very slowly. For example, running the greedy algorithm for masking 0.5-distinct quasi-identifiers on the entire table (without random sampling) takes 80 minutes and publishes 34 attributes (the rightmost point in Figure a); using a sample of 30000 tuples the greedy algorithm takes only 10 minutes and outputs 32 attributes. Figure b shows the impact of k on the masking separation quasi-identifier algorithm. To run the greedy algorithm for masking 0.8separation quasi-identifier on the entire table takes 728 seconds (not shown in the figure); using a sample of 50000 pairs offers the same utility and only takes 30 seconds. The results show that our random sampling technique can greatly improve time and space complexity (space is also linear in k), with only minor sacrifice on the utility. Data Sets adult covtype idaho wa texas ca census 4.1 Data Sets One source of data sets is the census microdata “PublicUse Microdata Samples (PUMS)” [1], provided by US Census Bureau. We gather the 5 percent samples of Census 2000 data from all states and put into a table “census”. To study the performance of our algorithms on tables with different sizes, we also extract 1 percent samples of statelevel data and select 4 states with different population sizes – Idaho, Washington, Texas and California. We extract 41 attributes including age, sex, race, education level, salary etc. We only use adult records (age ≥ 20) because many children are indistinguishable even with all 41 attributes. The table census has 10 million distinct adults, and the sizes of Idaho, Washington, Texas and California are 8867, 41784, 141130 and 233687 respectively. We also use two data sets adult and covtype provided by UCI Machine Learning Repository [21]. The covtype table Greedy utility time 12 36s 33 172s 34 880s 3017s 35 4628s 34 - Greedy Approximate utility time 2000s 46 33 620s 33 630s 32 606s 30 755s Table 2. Algorithms for masking 0.5-distinct quasi-identifiers. The column “Greedy” represents the greedy algorithm on the entire table; the column “Greedy Approximate” represents running greedy algorithm on a random sample of 30000 tuples. We compare the running time and the utility (the number of published attributes) of the two algorithms on different data sets. The results of Greedy on census and covtype are not available because the algorithm does not terminate in 10 hours; the results of Greedy Approximate on adult and Idaho are not available because the input tuple number is less than 30000. 18 Data Sets adult covtype idaho wa texas ca census Greedy utility time 5 19s 2 hours 38 24 147s 23 646s 19 1149s 16 728s - Greedy Approximate time utility 5 2s 104s 37 23 30s 23 35s 19 34s 16 30s 170s 17 Table 3. Algorithms for masking 0.8-separation quasi-identifiers. The column “Greedy” represents the greedy algorithm on the entire table, and the column “Greedy Approximate” represents running greedy algorithm on a random sample of 50000 pairs of tuples. We compare the running time and the utility of the two algorithms on different data sets. The result of Greedy on census is unavailable because the algorithm does not terminate in 10 hours. Table 2 and 3 compare the running time and the utility (the number of published attributes) of running the greedy algorithm on the entire table versus on a random sample (we use a sample of 30000 tuples in Table 2 and a sample of 50000 pairs of tuples in Table 3). Results on all data sets confirm that the random sampling technique is able to reduce the running time dramatically especially for large tables, with only minor impact on the utility. For the largest data set census, running the greedy algorithm on the entire table does not terminate in 10 hours, while with random sampling it only takes no more than 13 minutes for masking 0.5-distinct quasi-identifier and 3 minutes for masking 0.8separation quasi-identifier. 4.3 Approximate Minimum Key Algorithms Finally we examine the greedy algorithms for finding minimum key and (ǫ, δ)-separation or -distinct minimum key in Section 2. Table 4 shows the experimental results of the Greedy Minimum Key, Greedy (0.1, 0.01)-Distinct Minimum Key, and Greedy (0.001, 0.01)-Separation Minimum Key algorithms on different data sets. The Greedy Minimum Key algorithm (applying greedy algorithm directly on the entire table) works well for small data sets such as adult, idaho, but becomes unaffordable as the data size increases. The approximate algorithms for separation or distinct minimum key are much faster. For the table California, the greedy minimum key algorithm takes almost one hour, while the greedy distinct algorithm takes 2.5 minutes, and greedy separation algorithm merely seconds; for the largest table census, the greedy minimum key algorithm takes more than 10 hours, while the approximate algorithms take no more than 15 minutes. The space and time requirements of our approximate minimum key algorithms are sublinear in the number of input tuples, and we expect the algorithms to scale well on even larger data sets. We measure the distinct and separation ratios of the output quasi-identifiers, and find the ratios always within error ǫ. This confirms the accuracy of our algorithms. Theorem 2.3 and 2.4 provide the theoretical bounds on the size of the quasi-identifiers found by our algorithms (ln m or ln mn times the minimum key size). Those bounds are worst case bounds, and in practice we usually get much smaller quasi-identifiers. For example, we find that the minimum key size of adult is 13 by exhaustive search, and the greedy algorithm for both distinct and separation minimum key find quasi-identifiers no larger than the minimum key. (For other data sets in Table 4, computing the minimum key exactly takes prohibitively long time, so we are not able to verify the approximation ratio of our algorithms.) We also generate synthetic tables with known minimum key sizes, then apply the greedy distinct minimum key algorithm (with ǫ = 0.1) on those tables and are always able to find quasi-identifiers no larger than the minimum key size. Those experiments show that in practice our approximate minimum key algorithms usually perform much better than the theoretical worst case bounds, and are often able to find quasi-identifiers with high separation (distinct) ratio and size close to the minimum. 5 Related Work The implication of quasi-identifiers to privacy is first formally studied by Sweeney, who also proposed the kanonymity framework as a solution to this problem [25, 24]. Afterwards there is numerous work which studies the complexity of this problem [17, 2], designs and implements algorithms to achieve k-anonymity [23, 4], or extends upon the framework [16, 14]. Our algorithm for masking quasi-identifiers can be viewed as an approximation to k-anonymity where the suppression must be conducted at the attribute level. Also it is an “on average” k-anonymity because it does not provide perfect anonymity for every individual but does so for the majority; a similar idea is used in [15]. On the other side, our algorithms for finding keys/quasi-identifiers attempt to attack the privacy of published data from the adversary’s point of view, when the publish data is not k-anonymized. To the best of our knowledge, there is no existing work addressing this problem. Our algorithms exploit the idea of using random samples to trade off between accuracy and space complexity, and can be viewed as streaming algorithms. Streaming algorithms emerged as a hot research topic in the last decade; see [20] for a survey of this area. Keys are special cases of functional dependencies, and quasi-identifiers are a special case of approximate functional dependency. Our definitions of separation and dis- 19 Data Sets adult covtype idaho wa texas ca census Greedy key size time 35.5s 13 5 964s 50.4s 14 22 490s 2032s 29 3307s 29 - distinct Greedy (ǫ = 0.1) key size distinct ratio time 1.0 13 8.8s 0.9997 78.1s 3 0.997 15.2s 8 0.995 34.1s 8 0.995 120s 14 0.994 145s 13 0.993 808s 17 separation Greedy (ǫ = 0.001) key size separation ratio time 0.99995 3.11s 5 0.999996 27.1s 2 0.9999 1.07s 3 0.99993 7.14s 3 0.99995 13.2s 4 0.99998 16.3s 4 0.99998 120s 3 Table 4. Running time and output key sizes of the Greedy Minimum Key, Greedy (0.1, 0.01)-Distinct Minimum Key, and Greedy (0.001, 0.01)-Separation Minimum Key algorithms. The result of Greedy Minimum Key on census is not available because the algorithm does not terminate in 10 hours. tinct ratios for quasi-identifiers are adapted from the measures for quantifying approximations of functional dependencies proposed in [13, 22]. 6 Conclusions and Future Work In this paper, we designed efficient algorithms for discovering and masking quasi-identifiers in large tables. We developed efficient algorithms that find small quasiidentifiers with provable size and separation/distinct ratio guarantees, with space and time complexity sublinear in the number of input tuples. We also designed efficient algorithms for masking quasi-identifiers in large tables. All algorithms in the paper can be extended to the weighted case, where each attribute is associated with a weight and the size/utility of a set of attributes is defined as the sum of their weights. The idea of using random samples to trade off between accuracy and space complexity can potentially be explored in other problems on large tables. References [1] Public-use microdata samples (pums). http://www.census.gov/main/www/pums.html. [2] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D.Thomas, and A. Zhu. Anonymizing tables. In ICDT, 2005. [3] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002. [4] R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In ICDE, 2005. [5] M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In PODS, 2000. [6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. [7] M. R. Garey and D. S. Johnson. Computers and intractability. 1979. [8] C. Giannella and E. Robertson. On approximation measures for functional dependencies. Information Systems, 2004. [9] C. M. Giannella, M. M. Dalkilic, D. P. Groth, and E. L. Robertson. Using horizontal-vertical decompositions to improve query evaluation. LNCS 2405. [10] B. Halldorsson, M. Halldorsson, and R. Ravi. Approximability of the minimum test collection problem. In ESA, 2001. [11] Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. Discovery of functional and approximate dependencies using partitions. In ICDE, 1998. [12] D. Johnson. Approximation algorithms for combinatorial problems. In J. Comput. System Sci., 1974. [13] J. Kivinen and H. Mannila. Approximate dependency inference from relations. In Theoretical Computer Science, 1995. [14] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE, 2007. [15] S. Lodha and D. Thomas. Probabilistic anonymity. Technical Report. [16] Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: privacy beyond k-anonymity. In ICDE, 2006. [17] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, 2004. [18] B. Moret and H. Shapiro. On minimizing a set of tests. In SIAM Journal on Scientific and Statistical Computing, 1985. [19] R. Motwani and P. Raghavan. Randomized algorithm. 1995. [20] S. Muthukrishnan. Data streams: Algorithms and applications. 2005. [21] D. Newman, S. Hettich, C. Blake, and C. Merz. Uci repository of machine learning databases. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [22] B. Pfahringer and S. Kramer. Compression-based evaluation of partial determinations. In SIGKDD, 1995. [23] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information. In PODS, 1998. [24] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. In International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002. [25] L. Sweeney. k-anonymity: a model for protecting privacy. In International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002. 20 On the Lindell-Pinkas Secure Computation of Logarithms: From Theory to Practice Raphael S. Ryger∗ Onur Kardes† Rebecca N. Wright† Yale University New Haven, CT USA ryger@cs.yale.edu Stevens Institute of Technology Hoboken, NJ USA onur@cs.stevens.edu Rutgers University Piscataway, NJ USA rebecca.wright@rutgers.edu Abstract Lindell and Pinkas demonstrated that it is feasible to preserve privacy in data mining by employing a combination of general-purpose and specialized securemultiparty-computation (SMC) protocol components. Yet practical obstacles of several sorts have impeded a fully practical realization of this idea. In this paper, we address the correctness and practicality of one of their primary contributions, a secure natural logarithm computation, which is a building block crucial to an SMC approach to privacy-preserving data mining applications including construction of ID3 trees and Bayesian networks. We first demonstrate a minor error in the Lindell-Pinkas solution, then provide a correction along with several optimizations. We explore a modest trade-off of perfect secrecy for a performance advantage, a strategy that adds flexibility in the effective application of hybrid SMC to data mining. 1 Introduction Privacy-preservation objectives in data mining can often be framed ideally as instances of secure multiparty computation (SMC), wherein multiple parties cooperate in a computation without thereby learning each other’s inputs. The characterization of SMC is very encompassing, admitting a great variety of input and output configurations, so that a general recipe for adding the SMC input security to arbitrary well-specified multiparty computations would seem to solve many quite different problems in one fell swoop. Indeed, general approaches to SMC were proposed for a variety of settings already in the 1980s. Yet the framing of privacy preservation for particular data-mining tasks as SMC problems, making them amenable to the general approaches, is usually not useful. For all but the most ∗ Supported in part by ONR grant N00014-01-1-0795 and by US-Israel BSF grant 2002065. † Supported in part by NSF grant 0331584. P3DM’08, April 26, 2008, Atlanta, Georgia, USA. 21 trivial computations, the general SMC solutions have been too cumbersome to apply and would be impractical to run. They require the computation to be represented as an algebraic circuit, with all loops unrolled to as many iterations as would possibly be needed for the supported inputs, and with all contingent branches of the logic as conventionally expressed—such as iterations that happen not to be needed—executed in every run regardless of the inputs. One may reasonably conclude that SMC is just a theoretical curiosity, not relevant for real-world privacy-preserving data mining, where inputs are not just a few bits but rather entire databases. Lindell and Pinkas [LP00, LP02] have shown the latter conclusion to be inappropriate. A privacy-preserving data-mining task, they point out, need not be cast as a monolithic SMC problem to which to apply an expensive general SMC solution. Instead, the task may be decomposed into modules requiring SMC, all within a computational superstructure that may itself admissibly be left public. The modules requiring SMC may, in part, be implemented with special-purpose protocols with good performance, leaving general SMC as a fallback (at the module-implementation level) only where special approaches have not been found. The key to such construction is that we are able to ensure secure chaining of the secure protocol components. We prevent information from leaking at the seams between the SMC components by having them produce not public intermediate outputs but rather individualparty shares of the logical outputs, shares that may then be fed as inputs to further SMC components. Lindell and Pinkas illustrate this creative, hybrid methodology by designing a two-party SMC version of the ID3 data-mining algorithm for building a classification tree, a query-sequencing strategy for predicting an unknown attribute—e.g., loan worthiness—of a new entity whose other attributes—e.g., those characterizing credit history, assets, and income—are obtainable by (cost-bearing) query. At each construction step, the ID3 algorithm enters an episode of information-theoretic analysis of the database of known-entity attributes. The privacy concern is introduced, in the Lindell-Pinkas setting, by horizontal partitioning of that database between two parties that must not share their records. The computation is to go through as if the parties have pooled their data, yet without them revealing to each other in their computational cooperation any more regarding their private data than is implied by the ultimate result that is to be made known to them both. While demonstrating the potential in a modular SMC approach to prospective appliers of the theory, Lindell and Pinkas offer SMC researchers and implementors some design suggestions for particular SMC modules needed in their structuring of the two-party ID3 computation. Strikingly, they need only three such SMC modules, all relatively small and clearly useful for building other protocols, namely, shares-to-shares logarithm and product protocols and a shares-to-publicvalue minindex protocol. Their intriguing recommendation for the secure logarithm protocol, critical to the accuracy and performance of SMC data mining whenever information-theoretic analysis is involved, is our focus in this paper. The present authors have been engaged in a privacypreserving data-mining project [YW06, KRWF05] very much inspired by Lindell and Pinkas. Our setting is similar: a database is arbitrarily partitioned between two parties wishing to keep their portions of the data private to the extent that is consistent with achieving their shared objective of discovering a Bayes-net structure in their combined data. The information-theoretic considerations and the scoring formula they lead to are very similar to those in the ID3 algorithm for classificationstrategy building, as is the external flow of control that invokes scoring on candidate next query attributes given a set of query attributes that has already been decided upon. (The details and their differences are not germane to the present discussion.) The adaptation we do for privacy preservation in our two-party setting is, not surprisingly, very similar to what Lindell and Pinkas do. Indeed, we need the same SMC components that they do and just one more, for computing scalar products of binary-valued vectors. The latter additional need has more to do with the difference in setting—we are admitting arbitrary, rather than just horizontal, partitioning of the data—than with the difference in analytical objective. In fact, our software would not require much adjustment to serve as a privacy-preserving two-party ID3 implementation—in fact, supporting arbitrarily partitioned data, given the incorporated scalar-product component. Launching our investigation a few years after Lin- 22 dell and Pinkas’s paper, we have had the advantage of the availability of the Fairplay system [MNPS04] for actually implementing the Yao-protocol components. We have created tools to support the entire methodology, enabling us to take our protocol from a theoretical suggestion all the way to usable software. This exercise has been illuminating. On one hand, it has produced the most convincing vindication of which we are aware of Lindell and Pinkas’ broad thesis regarding the practical achievability of SMC in data mining while teaching us much about the software engineering required for complex SMC protocols. On the other hand, as is typical in implementation work, it has revealed flaws in a number of areas of the underlying theoretical work, including our own. In this paper, we present our observations on the Lindell-Pinkas logarithm proposal. We correct a mathematical oversight and address a general modularSMC issue that it highlights, the disposition of scaling factors that creep into intermediate results for technical reasons. We begin in Section 2 with a careful account of the Lindell-Pinkas proposal for a precision-configurable secure two-party shares-to-shares computation of natural logarithms. In Section 3, we explain the mathematical oversight in the original proposal and show that the cost of a straightforward fix by scale-up is surprisingly low, although leaving us with a greatly inflated scale-up factor. In Sections 4 and 5, we propose efficient alternatives for doing arbitrary scaling securely. These enable a significant optimization in the first phase of the LindellPinkas protocol, allowing removal of the table look-up from the Yao circuit evaluation. We briefly point out the effectiveness of a simple dodge of most of the problematics of the Lindell-Pinkas protocol in Section 6. We conclude with a discussion of our implementation of the revised Lindell-Pinkas protocol and its performance in Section 7. 2 The Lindell-Pinkas ln x protocol The Lindell-Pinkas proposed protocol for securely computing ln x is intended as a component in a larger secure two-party protocol. The parties are presumed not to know, and must not hereby learn, either the argument or its logarithm. They contribute secret shares of the argument and obtain secret shares of its logarithm. The proposed design for this protocol module is itself modular, proceeding in two chained phases involving different technology. The first phase internally determines n and ε such that x = 2n (1 + ε) with −1/4 ≤ ε < 1/2. Note that, since n is an approximate base-2 logarithm of x, the first phase gets us most of the way to the desired logarithm of x. Furthermore, this phase dominates the performance time of the en- tire logarithm protocol: in absence of a specialized SMC protocol for the first phase, Lindell and Pinkas fall back to dictating it be implemented using Yao’s general approach to secure two-party computation, entailing gateby-gate cryptography-laden evaluation of an obfuscated Boolean circuit. Yet the main thrust of the LindellPinkas recommendation is in the second phase, which takes (the secret shares of) ε delivered by phase one and computes an additive correction to the logarithm approximation delivered (as secret shares) by phase one. We will return to the performance-critical considerations in implementing phase one, not addressed by Lindell and Pinkas. We assume that its Boolean circuitry reconstitutes x from its shares; consults the top 1-bit in its binary representation and the value of the bit following it to determine n and ε as defined; represents n and ε in a manner to be discussed; and returns shares of these representations to the respective parties. These values allow an additive decomposition of the sought natural logarithm of x, (2.1) ln x = ln 2n (1 + ε) = n ln 2 + ln(1 + ε) The purpose is to take advantage of the Taylor expansion of the latter term, (2.2) ∞ X (−1)i−1 εi ε2 ε3 ε4 ln(1 + ε) = = ε− + − + ··· i 2 3 4 i=1 to enable, in phase two, correction of the phase-one approximation of the logarithm with configurable precision by choice of the number of series terms to be used— a parameter k to be agreed upon by the parties. The computation in the second, refining phase is to proceed by oblivious polynomial evaluation, a specialized SMC technology which is inexpensive compared to the Yao protocol of the first phase. In this rough mathematical plan, the value ε to be passed from phase one to phase two is a (generally non-integer) rational and the terms in the decomposition of the final result in equation (2.1) are (generally non-integer) reals, whereas the values we will accept and produce in the two SMC phases are most naturally viewed as integers. We are, then, representing the rational and the reals as integers through scale-up and finite-precision approximation. We have considerable latitude in choice of the scale-up factors, particularly considering that the scale-up of a logarithm is just the logarithm to a different base—just as good for information-theoretic purposes as long as the base is used consistently. Still, several considerations inform our choice of scale-up factors. We want the scale-ups to preserve enough precision. On the other hand, there is a performance penalty, here and elsewhere in the larger 23 computation to which this component is contributing, especially in Yao-protocol episodes, for processing additional bits. The chosen scale-up must work mathematically within the larger computation. If an adjustment of the scaling were to be needed for compatibility with the rest of the computation—other than further scale-up by an integer factor—it would entail another secure computation. (We return to this issue in §4.) For the LindellPinkas ID3 computation or for our Bayes-net structurediscovery computation, both information-theoretic, no adjustment would be needed. All the terms added and subtracted to get scores within the larger computation would be scaled similarly, and those scaled scores serve only in comparison with each other. We assume that the parties have common knowledge of some upper bound N on n, the approximate base-2 logarithm of the input x, and we have phase one deliver the rational ε scaled up by 2N . This loses no information, deferring control of the precision of the correction term, ln 2n (1+ε) in some scale-up, to phase two. Bearing in mind that the slope of the natural-logarithm function is around 1 in the interval around 1 to which we are constraining 1 + ε, we aim for a scale-up of the correction term by at least 2N , and plan to scale up the main term of the decomposition, n ln 2, to match. Lindell and Pinkas suggest that the mapping from n to n ln 2 · 2N be done by table look-up within the Yao protocol of phase one. Any further integer scale-up of the main term to match the scaling of the correction term can be done autonomously by the parties, without SMC, by modular multiplication of their respective shares. Lindell and Pinkas stipulate that the sharing be with respect to a finite field F that is large enough in a sense we discuss in more detail in Section 3. A non-field ring will do provided that any particular needed inverses exist. This allows us, e.g., to use Paillier homomorphic encryption in a Zpq both for the oblivious polynomial evaluation needed in phase two of this logarithm component and, subsequently in the larger computation, for the shares-to-shares secure multiplication to compute x ln x—without additional secure Yao computations to convert the sharing from one modulus to another. The only inverses Lindell and Pinkas need here are of powers of 2, and these would be available in Zpq . The set-up for phase two, then, preserving the Lindell-Pinkas notation, is that phase one has delivered to the parties, respectively, shares β1 and β2 such that β1 +F β2 = n ln 2 · 2N , toward (whatever ultimate scaleup of) the main term of the decomposition (2.1); and shares α1 and α2 such that α1 +F α2 = ε · 2N , toward the phase-two computation of (the scale-up of) the correction term of the decomposition. We continue to evaluation to follow, we have phase two. Replacing ε in formula (2.2) with (α1 +F α2 )/2N , (2.4) k we get X σ(−1)i−1 (α1 + y)i z2 = Q(y) |y=α2 = − z1 i 2N i i=1 y=α2 ∞ X (−1)i−1 (α1 +F α2 )i (2.3) ln(1 + ε) = where all operations—once the approach to the division i 2N i i=1 in the summand is sorted out—are in F, so that In this infinite-series expression, the only operation to be carried out in the finite ring F is the recombination of the shares, α1 and α2 , as noted. The objective in phase two is to compute the series in sufficiently good approximation through oblivious polynomial evaluation by the two parties, returning shares of the value to the parties. So we need to get from the infinite series— a specification of a limit in R for what appear to be operations in Q—to a polynomial over the finite ring F that may be evaluated so as to contribute to the sought shares. This will entail several steps of transformation. Step 1. The computation must be finite. We take only k terms of the series. Step 2. We deal somehow with the division that appears in the summand. We need to be sure we end up, when the transformation is complete, with a polynomial over F . We can scale up the whole formula to cancel some or all of the division. The disposition of any remaining division, as we work toward determining the coefficients of the polynomial to be evaluated, turns out to be problematic, largely occasioning this paper. (The existence of modular inverses in F for the remaining divisors is not sufficient.) For the moment, let σ be whatever scale-up factor we decide to use here. Step 3. We reinterpret the outer summation and the multiplication, including the binomial exponentiation and the multiplication by σ, as modular addition and multiplications in F . Note that we cannot even open the parentheses by formal exponentiation, applying a distributive law, without first reinterpreting the multiplication as in F. We have no law regarding the distribution of multiplication in Z over addition in F. This requires that we assure ourselves that the reinterpretation does not alter the value of the expression. Lindell and Pinkas ensure this by requiring F to be sufficiently large, and we will review the consideration. Step 4. We replace the occurrence of ‘α2 ’ in (2.3)—as truncated, division-resolved, and modularly reinterpreted—with the variable ‘y’. Knowing α1 , party 1 does the formal exponentiations and collects terms, all modulo |F |, yielding a polynomial in ‘y’ over F. Party 1 randomly chooses z1 ∈ F and subtracts it from the constant term of the polynomial. Where Q(y) is the resulting polynomial and z2 is its value at y = α2 , to be obtained by party 2 through the oblivious polynomial 24 z1 +F z2 ≈ ∞ X σ(−1)i−1 (α1 +F α2 )i = ln(1 + ε) · σ i 2N i i=1 —all operations here, except as indicated, back in R. Thus, the computation of z2 according to (2.4) by oblivious polynomial evaluation accomplishes the sharing of ln(1 + ε) · σ as z1 and z2 . The parties may autonomously modularly multiply β1 and β2 by lcm(2N , σ)/2N , giving β1′ and β2′ , respectively; and modularly multiply z1 and z2 by by lcm(2N , σ)/σ, giving z1′ and z2′ , respectively; and modularly add their respective results from these scale-ups. Then, per the decomposition in (2.1), (β1′ +F z1′ ) +F (β2′ +F z2′ ) = (β1′ +F β2′ ) +F (z1′ +F z2′ ) ≈ (n ln 2 + ln(1 + ε)) · lcm(2N , σ) = ln x · lcm(2N , σ) accomplishing the original goal of securely computing shares of ln x from shares of x—if with a scale-up that we hope is innocuous. But this sketch of the protocol still needs to be fleshed out. We back up now, first briefly to step 3, and then to step 2, our main focus. By the time we get to step 3, we should be left with an expression prescribing finitely many operations in Z, viewing +F as an operation in Z and viewing division as a partially-defined operation in Z. Looking ahead to step 4, we will be replacing the occurrences of ’α2 ’ in this expression with the variable ’y’ and algebraically reorganizing it into the polynomial Q(y) (with a change to the constant term). In this step 3, we change only the semantics of the expression arrived at, not its syntactic composition. The claim to be made is that the hybrid expression at hand, involving some modular additions but otherwise non-modular operations, can be reinterpreted to involve only modular operations without change to the induced expression value—allowing the expression then to be transformed syntactically with guarantee of preservation of value, but now with respect to the new semantics. This tricky claim, made implicitly, bears explicit examination. We can frame the issue abstractly. Suppose ϕ is an arbitrarily complex numerical expression built recursively of variables and function symbols (admitting constants as 0-ary function symbols). We have a conventional interpretation of ϕ in the domain Z. We also have an alternate interpretation of ϕ in the domain Zm . Furthermore, we have an alternate expression, ϕ′ , obtained from ϕ by transformations guaranteed to preserve the value of the whole under the interpretation in Zm for any assignment of values from Zm to the variables. We intend to compute ϕ′ as interpreted in Zm . Under what circumstances can we be assured that this computation will yield the same value as does evaluation of the original expression ϕ according to the original interpretation in Z? In the case at hand, ϕ is evaluation. In our case, the single variable, ’y’, is assigned the value α2 , which may be as large as, but no larger than, m − 1. The constant α1 is similarly less than m. We can view these mod-m values, returned by the Yao protocol in phase one, as being the corresponding signed-mod-m values instead, with +F operating on them isomorphically. Moreover, α1 +F y then evaluates into the interval [− 41 2N , 12 2N ), where we can arrange for the endpoints to be much smaller in absolute value than ⌊ m 2 ⌋. This allows Lindell and Pinkas to reason about setting m high enough so that indeed all subexk X pressions of our ϕ will evaluate, in the original interσ(−1)i−1 (α1 +F y)i (2.5) pretation, into the signed-mod-m domain. Note that if N i i2 i=1 formal powers of ’α1 ’ and of ’y’ appeared as subexpressions in our original expression ϕ, as they do in our ϕ′ , (with some decision as to how to interpret the division), the polynomial Q(y)+z1 which we actually compute, we whereas ϕ′ is would have concern over potential loss of information in Q(y) + z1 modular reduction impeding the modular reinterpretation; but the power subexpressions appear only after we to be interpreted in Zm (where m = |F |) and be so have reinterpreted and transformed ϕ, and are by then computed, with the value to be assigned to ’y’ in both of no concern. cases being α2 . We now return to step 2, attending to the division There are obvious strong sufficient conditions under in the Taylor-series terms. which modular reinterpretation preserves value. We do have to be careful to take into account, in generaliz3 The division problem ing, that in our instance ε may be negative, and that our summation expression has sign alternation, so we We have already seen that choices of scaling factor are need to proceed via a “signed-modular” interpretation, governed by several considerations including preservation of precision, avoidance of division where it cannot wherein the mod-m integers ⌈ m 2 ⌉ to m − 1 are viewed as “negative”, i.e., they are isomorphically replaced by be carried out exactly, and compatibility among intermediate results. For preservation of precision, we have the integers −⌊ m 2 ⌋ to −1. (Choosing the midpoint for the cutover here is arbitrary, in principle, but appro- been aiming to compute the main and correction terms N priate for our instance.) If (a) for the values we are of (2.1) scaled up by at least 2 . Lindell and Pinkas incontemplating assigning to the variables, the recursive corporate this factor into their σ in preparing the polyevaluation of ϕ under the original interpretation assigns nomial. To dispose of the i factors in the denominavalues to the subexpressions of ϕ that are always inte- tor in (2.4), they increase the scale-up by a factor of m lcm(2, . . . , k). With σ now at 2N lcm(2, . . . , k), the trungers in the interval [−⌊ m 2 ⌋, ⌊ 2 ⌋]; and if (b) the functions assigned to the function symbols in the signed-modular cated Taylor series we are looking at in step 2 becomes reinterpretation agree with the functions assigned by the original interpretation whenever the arguments and their image under the original function are all in that signed-mod-m interval; then the signed-modular reinterpretation will agree with the original interpretation on the whole expression ϕ for the contemplated value assignments to the variables. Note that we need not assume that the reinterpretation associates with the function symbols the signed-modular analogues of the original functions, although this would ensure (b). Nor would a stipulation of modular agreement be sufficient, in general, without condition (a), even if the original evaluation produces only (overall) values in the signedmod-m domain for value assignments of interest. The danger is that modular reduction of intermediate values, if needed, may lose information present in the original 25 ln(1 + ε) · 2N lcm(2, . . . , k) ≈ (3.6) k X (−1)i−1 (lcm(2, . . . , k)/i) (α1 +F α2 )i 2N (i−1) i=1 We know that in step 3 we will be reinterpreting the operations in this expression—more precisely, in the expression we intend this expression to suggest—as operations in F. Clearly, since k is agreed upon before the computation, the subexpression ’lcm(2, . . . , k)/i’ may be replaced immediately by (a token for) its integer value. We are still left with a divisor of 2N (i−1) , but Lindell and Pinkas reason that (α1 +F α2 )i , although not determined until run time, will be divisible by 2N (i−1) . After all, (α1 +F α2 )i will be (ε · 2N )i , and the denominator was designed expressly to divide this to leave εi ·2N . Apparently, all we need to do is allow the division bar to be reinterpreted in step 3 as the (partially defined) division operation in Zm , i.e., multiplication by the modular inverse of the divisor. We can assume that m is not even, so that powers of 2 have inverses modulo m. Furthermore, whenever a divides b (in Z) and b < m, if a has an inverse (a ∈ Z∗m ) then a−1 b in Zm is just the integer b/a. It would appear that the strong sufficient conditions for reinterpretation are met. The trouble is that, although (α1 +F α2 )i = (ε · 2N )i is an integer smaller than m (given that we will ensure that m is large enough) and although the expression ‘(ε · 2N )i ’ appears to be formally divisible by the expression ‘2N (i−1) ’, the integer (ε · 2N )i is not, in general, divisible by the integer 2N (i−1) . In Q, the division indeed yields εi 2N , which is just the scale-up by 2N we engineered it to achieve. That rational scaleup is an integer for i = 1, but will generally not be an integer for i > 1. (Roughly, εi 2N is an integer if the lowest-order 1 bit in the binary representation of x is within N/i digits of its highest-order 1 bit—a condition that excludes most values of x already for i = 2.) This undermines the sufficient condition Lindell and Pinkas hoped to rely on to justify the modular reinterpretation, our step 3. Without the divisibility in the integers, there is no reason to believe that reinterpretation of the division by 2N (i−1) as modular multiplication by its mod-m inverse (2N (i−1) )−1 would have anything to do with the approximation we thought we were computing. The ensuing formal manipulation in step 4 to get to a polynomial to be evaluated obliviously would be irrelevant. The immediate brute-force recourse is to increase the scale-up factor, σ, currently at 2N lcm(2, . . . , k), to 2N k lcm(2, . . . , k). This leaves our truncated Taylor series as ln(1 + ε) · 2N k lcm(2, . . . , k) ≈ (3.7) k X (−1)i−1 2N (k−i) (lcm(2, . . . , k)/i) (α1 +F α2 )i i=1 Phase one still feeds phase two shares of ε scaled up by 2N . For compatibility with the larger scale-up of the correction term of the decomposition as now delivered (in shares) by phase two, the parties will autonomously scale up their shares of the main term of the decomposition by a further factor of 2N (k−1) . The natural concern that a scaling factor so much larger will require F to be much larger, with adverse performance implications, turns out to be unfounded. Surprisingly, the guideline given by Lindell and Pinkas for the size of F—namely, 2N k+2k or more—need not be increased by much. The original guideline actually 26 remains sufficient for the step-3 reinterpretation of the operations to be sound. But now, with the (unshared) scaled-up correction term alone so much wider, requiring some 2N k bits of representation, we are in danger of running out of room in the space for the scaled-up main term if log2 N > 2k. Raising the size requirement for F to 2N k+2k+log2 N should be sufficient. If we want to provide, in the larger protocol, for computation of x ln x, scaled up to x(σ ln x), in the same space F, we need to raise the size requirement for F to 2N k+2k+log2 N +N Our larger scale-up here does not carry any additional information, of course. The creeping growth in the computational space does affect performance, but only minimally. Even in Yao SMC episodes, the larger space affects only the modular addition to reconstitute shared inputs at the outset and the modular addition to share the computed results at the end. The computation proper is affected by the size of the space of the actual unshared inputs, but not by the size of the space for modular sharing. The more significant issue is that we continue to be saddled with scaling factors that are best not incurred in building blocks intended for general use. We explore efficient ways to reverse unwanted scaling. The problem is tantamount to that of efficiently introducing wanted arbitrary—i.e., not necessarily integral—scaling. Lindell and Pinkas need such scaling to get from base-2 logarithms to natural logarithms in phase one of the protocol. A good solution to this problem of secure arbitrary scaling will enable us to do better than (even a smart implementation of) the table look-up inside the phase-one Yao protocol that they call for, in addition to allowing reversal of whatever scale-up is delivered by the entire logarithm protocol. 4 Secure non-integer scaling of shared values Suppose parties 1 and 2 hold secret shares modulo m, respectively γ1 and γ2 , of a value γ; and suppose σ = κ + ρ is a scaling factor to be applied to γ, where κ is a non-negative integer and 0 ≤ ρ < 1. σγ is not, in general, an integer, but a solution that can provide the parties shares of an integer approximation of σγ suffices. κγ may be shared exactly simply by having the parties autonomously modularly scale up their shares by κ. That leaves the sharing of (an approximation of) ργ, the shares to be added modularly to the shares of κγ to obtain shares of (an approximation of) σγ. The problem is that approximate multiplication by a noninteger does not distribute over modular addition, even approximately! A bifurcated distributive property does hold, however. If the ordinary sum γ1 + γ2 is < m, the usual distributive law for multiplication of the sum by ρ holds approximately for approximate multiplication. If, on the other hand, the ordinary sum γ1 + γ2 is ≥ m, then the modular sum is, in ordinary terms, γ1 + γ2 − m, so that the distribution of the multiplication by ρ over the modular addition of γ1 and γ2 will need an adjustment of approximately −ρm. This suggests the following protocol to accomplish the scaling by ρ mostly by autonomous computation by the parties on their own shares, but with a very minimal recourse to a Yao protocol to select between the two cases just enumerated. The Yao computation takes ργ1 and ργ2 , each rounded to the nearest integer, as computed by the respective parties; and the original shares γ1 and γ2 as well. Party 1 also supplies a secret random input z1 < m. The circuit returns to party 2 either (ργ1 + ργ2 ) +mod m z or (ργ1 + ργ2 − ρm) +mod m z accordingly as γ1 + γ2 < m or not. Party 1’s share is m − z1 . The integer approximation of ρm is built into the circuit. The cumulative approximation error is less than 1.5, and usually less than 1. But an unconventional approach can allow us to do better still. 5 The practical power of imperfect secrecy In implementing secure protocols, we tend to be induced by different considerations to choose moduli for sharing that are vastly larger than the largest value that will be shared. In the Lindell-Pinkas logarithm proposal, for instance, if N is 13, as to accommodate ID3 database record counts of around 8,000, and k is 4, our share space is of a size greater than 1020 . Prior to our correction, logarithms are to be returned scaled up by around 105 , making for a maximum output of around 106 . Thus, the size of the sharing space is larger than the largest shared value by a factor of 1014 . In such a configuration, it is a bit misleading to state that the distributive law is bifurcated. The case of the shares not jointly exceeding the modulus is very improbable. If we could assume the nearly certain case of the shares being excessive—i.e., needing modular reduction—to hold, we would not need a Yao episode to select between two versions of the scaling computation. Each party would scale autonomously and party 1 would subtract ρm to correct for the excess. We could abide the very small chance of error in this assumption. But better would be to guarantee (approximate) correctness of the autonomous scaling by contriving to ensure that the shares be excessive. This turns out to be quite tricky in theory while straightforward in practice. It entails a small sacrifice of the information-theoretic perfection of the secrecy in the sharing, but the sacrifice should be of no practical significance. 27 Let t be the largest value to be shared, much smaller than the modulus m. We can ensure that shares are excessive by restricting the independently set share to be greater than t. But we can show that if it is agreed that the independent share will be chosen uniformly randomly from the interval [t+ 1, m− 1] then, if it is actually chosen within t of either end of this interval, information will leak to the other party through the complementary share given him for certain of the values from [0, t] that might be shared—to the point of completely revealing the value to the other party in the extreme case. If the choice is at least t away from the ends of the choice interval, perfect secrecy is maintained. But if we take this to heart and agree that the independent share must be from the smaller interval [2t + 1, m − 1 − t] then the same argument can be made regarding the possibility that the choice is actually within t of the ends of this smaller interval. Recursively, to preserve secrecy, we would lop off the ends of the choice interval until nothing was left. But as in the “surprise quiz” (or “unexpected hanging”) paradox, wherein we establish that it is impossible to give a surprise quiz “some day next week,” the conclusion here, too, is absurd from a practical point of view. If the independent share is chosen from some huge, but undeclared, interval around m/2, huge by comparison with t but tiny by comparison with m, there simply is no problem with loss of secrecy. We can assume that the sharing is excessive, and arbitrary scaling can be accomplished by the parties completely autonomously. We may be able to look at the random choice of the independent share from an undeclared interval instead as a non-uniform random choice, the distribution being almost flat, with the peak probability around m/2 dropping off extremely gradually to 0 as the ends of [t + 1, m − 1] are approached. As long as the probabilities are essentially the same in a cell of radius t around whatever independent share is actually chosen— and it is exceedingly unlikely that there not exist a complete such cell around the choice—secrecy is preserved. But theorizing about the epistemology here is beyond our scope. The point is that, in practice, it seems worth considering that we can gain performance by not requiring Yao episodes when non-integer scaling is needed. In the Lindell-Pinkas protocol, for scaling the approximate base-2 logarithms determined in phase one to corresponding approximate natural logarithms, this approach is fine. For getting rid of the scale-up delivered in the final result, beyond whatever scale-up is sufficient for the precision we wish to preserve, we would need to extend the size of F somewhat before using this approach, now that our correction has greatly increased the maximum value that may be delivered as shares by the oblivious polynomial evaluation. On balance, considering the added expense that would be incurred in other components of the larger protocol, it is best not to enlarge F (further) and to reverse the scaling of the result, if necessary, by the method of the preceding section. 6 Alternative: Pretty good precision, high performance For many purposes, a much simpler secure computation for logarithms may offer adequate precision. The base is often not important, as noted, so base 2 may do—as indeed it would in the ID3 computation. Noting that in the interval [1, 2] the functions y = log2 x and y = x − 1 agree at the ends of the interval and deviate by only 0.085 in the middle, we have the Yao circuit determine the floor of the base-2 logarithm and then append to its binary representation the four bits of the argument following its top 1-bit. This gives a result within 1/16 of the desired base-2 logarithm. We used this approach in our Bayes-net structure computation [YW06, KRWF05] while sorting out the issues with the much more complex Lindell-Pinkas proposal. As in the Lindell-Pinkas secure ID3 computation, the logarithms inform scores that, in turn, are significant only in how they compare with other scores, not in their absolute values. As long as the sense of these score comparisons is not affected, inaccuracies in the logarithms are tolerable. We bear in mind also that, in the particular data-mining contexts we are addressing, the algorithms are based on taking the database as a predictive sample of a larger space. In so depending on the database, they are subject to what may be regarded as sampling error in any case. From that perspective, even the reversal of sense in some comparisons of close scores cannot be regarded as rendering the approach inappropriate. However, as much simpler as this approach is, the performance consideration in its favor is considerably weakened once we remove the conversion from base-2 to scaled-up natural logarithms from the Yao portion of the Lindell-Pinkas protocol, as we now see we can do. 7 Implementation and performance We have evolved an array of tools to aid in developing hybrid-SMC protocols of the style demonstrated by Lindell and Pinkas. These will be documented in a Yale Computer Science Department technical report and will be made available. Among the resources are a library of Perl functions offering a level of abstraction and control we have found useful for specifying the generation of Boolean circuits; scripts for testing circuits 28 without the overhead of secure computation; particular circuit generators, as for the phase-one Yao episode in the Lindell-Pinkas logarithm protocol and for the minindex Yao episode needed for the best-score selection in their larger secure ID3 computation; additional SMC components not involving circuits; and a library of Perl functions facilitating the coordination of an entire hybrid-SMC computation involving two parties across a network. We have been developing and experimenting on NetBSD and Linux operating systems running on Intel Pentium 4 CPUs at 1.5 to 3.2 GHz. We use the Fairplay run-time system, written in Java and running over Sun JRE 1.5, to execute Yao-protocol episodes. The Yao episode in phase one of the Lindell-Pinkas logarithm protocol completely dominates the running time of the entire logarithm computation, making the peformance of Fairplay itself critical. We cannot address the performance of multiparty computations without giving special attention to the cost of communication. This element is a wildcard, dependent on link quality and sheer propagation delay across the network distance between the parties. We have done most of our experimentation with the communication component trivialized by running both parties on the same machine or on two machines on the same LAN. For a reality check, we did some experimenting with one party at Yale University in New Haven, CT and the other party at Stevens Institute of Technology in Hoboken, NJ, with a 15 ms round-trip messaging time between them. There was no significant difference in performance in Yao computations. Admittedly, this is at a relatively small network distance. But there is another way to look at this. If network distance were really making the communication cost prohibitive, the two parties anxious to accomplish the joint datamining computation securely could arrange to run the protocol from outposts of theirs housing prepositioned copies of their respective private data, the outposts securely segregated from each other but at a small network distance. From this perspective, and recognizing that the protocols we are considering involve CPU-intensive cryptographic operations, it is meaningful to assess their performance with the communication component minimized. With the parties running on 3.2 GHz CPUs, and working with a 60-bit modulus, it takes around 5 seconds to run the complete Lindell-Pinkas logarithm computation. In more detail, to accommodate input x of up to 17 bits (≤ 131071), with k = 3 terms of the series to be computed in phase 2 (for an absolute error within 0.0112), we generate a circuit of 1497 gates and the computation runs in around 5.0 seconds. With the same modulus, to accommodate input x of only up to 13 bits (≤ 8191), allowing k = 4 terms of the series to be computed in phase 2 (for an absolute error within 0.0044), we generate a circuit of 1386 gates and the computation runs in around 4.9 seconds. Accommodating inputs of only up to 10 bits (≤ 1023), allowing as many k = 5 series terms (for an absolute error within 0.0018), the gate count comes down to 1314 and the running time comes down to around 4.8 seconds. Clearly, a 5-second wait for a single result of a Lindell-Pinkas secure-logarithm computation seems quite tolerable, but it serves little purpose in itself, of course. This is a shares-to-shares protocol intended for incorporation in a larger data-mining protocol that will ultimately leave the parties with meaningful results. It is reasonable to ask, in such a larger hybrid-SMC protocol, how badly would a 5-second delay for each logarithm computation—and, presumably, comparable delays for other needed SMC building blocks—bog down the entire data-mining algorithm? We can give a rough idea, based on experiment, of the performance that appears to be possible now in an entire privacy-preserving data-mining computation based on a hybrid-SMC approach. Without fully qualifying the tasks, software versions, and hardware involved, our secure Bayes-net structure-discovery implementation has run against an arbitrarily privately partitioned database of 100,000 records of six fields in about 2.5 hours. This involved almost 500 invocations of the secure logarithm protocol, each involving a Yaoprotocol episode run using the Fairplay system, as well as other component protocols. The overall time, computing against this many records, was dominated not by the Yao protocol episodes of the logarithm and minindex components but rather by the scalar-product computations needed to determine securely the numbers of records matching patterns across the private portions of the logical database. The scalar-product computations require a number of homomorphic-encryption operations linear in the number of records in the database. In developing and using these tools over some time, we note that the room for improvement in performance as implementations are optimized is large. Improvements that do not affect complexity classes, hence of lesser interest to theoreticians, are very significant to practitioners. Improvements in complexity class are there as well; we gained a log factor in our gate counts in the logarithm circuits over our initial naive implementation. Meanwhile, it is clear that significant hybrid-SMC computations are already implementable in a maintainable, modular manner with a development effort that is not exorbitant. Performance of such computations is becoming quite reasonable for realistic application in 29 privacy-preserving data-mining contexts. Acknowledgments We thank Benny Pinkas for helpful discussion of the design of the original Lindell-Pinkas logarithm protocol. References [KRWF05] Onur Kardes, Raphael S. Ryger, Rebecca N. Wright, and Joan Feigenbaum. Implementing privacypreserving Bayesian-net discovery for vertically partitioned data. In Proceedings of the ICDM Workshop on Privacy and Security Aspects of Data Mining, pages 26–34, 2005. [LP00] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In Advances in Cryptology – CRYPTO ’00, volume 1880 of Lecture Notes in Computer Science, pages 36–54. Springer-Verlag, 2000. [LP02] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. Journal of Cryptology, 15(3):177–206, 2002. [MNPS04] Dahlia Malkhi, Noam Nissan, Benny Pinkas, and Yaron Sella. Fairplay – a secure two-party computation system. In Proc. of the 13th Symposium on Security, pages 287–302. Usenix, 2004. [YW06] Zhiqiang Yang and Rebecca N. Wright. Privacypreserving computation of Bayesian networks on vertically partitioned data. IEEE Transactions on Data Knowledge Engineering, 18(9), 2006. An earlier version appeared in KDD 2004. Constrained k-Anonymity: Privacy with Generalization Boundaries John Miller* Alina Campan* $ Traian Marius Truta* tuple corresponds to one individual entity). Of particular interest are the cost metrics that quantify the information loss [2, 5, 19, 27]. Although producing the optimal solution for the k-anonymity problem w.r.t. various proposed cost measures has been proved to be NP-hard [9], there are several polynomial algorithms that produce good solutions for the k-anonymity problem for real life datasets [1, 2, 8, 9, 21]. Recent results have shown that k-anonymity fails to protect the privacy of individuals in all situations [12, 20, 26]. Several privacy models that extend the kanonymity model have been proposed in the literature to avoid k-anonymity short-comings: p-sensitive kanonymity [20] with its extension called extended psensitive k-anonymity [3], l-diversity [12], (α, k)anonymity [24], t-closeness [10], (k, e)-anonymity [28], (c, k)-safety [13], m-confidentiality [25], personalized privacy [26], etc. In general, the existing anonymization algorithms use different quasi-identifiers generalization strategies in order to obtain a masked microdata that is kanonymous (or satisfies an extension of k-anonymity) and conserves as much information intrinsic to the initial microdata as possible. To our knowledge, a privacy model that considers the specification of the maximum allowed generalization level for quasiidentifier attributes in the masked microdata does not exist, nor does a corresponding anonymization algorithm capable of controlling the generalization amount. The ability to limit the amount of allowed generalization could be valuable, and, in fact, indispensable for real life datasets. For example, for some specific data analysis tasks, available masked microdata with the address information generalized beyond the US state level could be useless. In this case the only solution would be to ask the owner of the initial microdata to have the anonymization algorithm applied repeatedly on that data, perhaps with a decreased level of anonymity (a smaller k) until the masked microdata satisfies the maximum generalization level requirement (i.e. no address is generalized further than the US state). In this paper, we first introduce a new anonymity model, called constrained k-anonymity, which preserves the k-anonymity requirement while specifying quasi-identifiers generalization boundaries Abstract: In the last few years, due to new privacy regulations, research in data privacy has flourished. A large number of privacy models were developed most of which are based on the k-anonymity property. Because of several shortcomings of the k-anonymity model, other privacy models were introduced (ldiversity, p-sensitive k-anonymity, (α, k) – anonymity, t-closeness, etc.). While differing in their methods and quality of their results, they all focus first on masking the data, and then protecting the quality of the data as a whole. We consider a new approach, where requirements on the amount of distortion allowed to the initial data are imposed in order to preserve its usefulness. Our approach consists of specifying quasiidentifiers generalization boundaries, and achieving kanonymity within the imposed boundaries. We think that limiting the amount of generalization when masking microdata is indispensable for real life datasets and applications. In this paper, the constrained k-anonymity model and its properties are introduced and an algorithm for generating constrained kanonymous microdata is presented. Our experiments have shown that the proposed algorithm is comparable with existing algorithms used for generating kanonymity with respect to results quality, and that by using existing unconstrained k-anonymization algorithms the generalization boundaries are violated. We also discuss how the constrained k-anonymity model can be easily extended to other privacy models. 1 Introduction A huge interest in data privacy has been generated recently within the public and media [14], as well as in the legislative body [6] and research community. Many research efforts have been directed towards finding methods to anonymize datasets to satisfy the kanonymity property [16, 17]. These methods also consider minimizing one or more cost metrics between the initial and released microdata (a dataset where each P3DM'08, April 26, 2008, Atlanta, Georgia, USA. * Department of Computer Science, Northern Kentucky University, U.S.A., {millerj10, campana1, trutat1}@nku.edu $ Visiting from Department of Computer Science, Babes-Bolyai University, Romania 30 (or limits). Second, we describe an algorithm to transform a microdata set such that its corresponding masked microdata will comply with the constrained kanonymity. This algorithm relies on several properties stated and proved for the proposed privacy model. The paper is organized as follows. Section 2 introduces basic data privacy concepts; and generalization and tuple suppression techniques as a mean to achieve data privacy. Section 3 presents the new constrained k-anonymity model. An anonymization algorithm to transform microdata to comply with constrained k-anonymity is described in Section 4. Section 5 contains comparative quality results, in terms of information loss, processing time, for our algorithm and one of the existing kanonymization algorithms. The paper ends with future work directions and conclusions. To rigorously and succinctly express the kanonymity property, we use the following concept: Definition 1 (QI-Cluster): Given a microdata , a QIcluster consists of all the tuples with identical combination of quasi-identifier attribute values in . There is no consensus in the literature over the term used to denote a QI-cluster. This term was not defined when k-anonymity was introduced [17, 18]. More recent papers use different terminologies such as equivalence class [24] and QI-group [26]. We define k-anonymity based on the minimum size of all QI-clusters. Definition 2 (K-Anonymity Property): The kanonymity property for an is satisfied if every QIcluster from contains k or more tuples. A general method widely used for masking initial microdata to conform to the k-anonymity model is the generalization of the quasi-identifier attributes. Generalization consists in replacing the actual value of the attribute with a less specific, more general value that is faithful to the original [18]. Initially, this technique was used for categorical attributes and employed predefined domain and value generalization hierarchies [18]. Generalization was extended for numerical attributes either by using predefined hierarchies [7] or a hierarchy-free model [9]. To each categorical attribute a domain generalization hierarchy is associated. The values from different domains of this hierarchy are represented in a tree called value generalization hierarchy. We illustrate domain and value generalization hierarchy in Figure 1 for attributes ZipCode and Sex. There are several ways to perform generalization. Generalization that maps all values of a quasi-identifier categorical attribute from to a more general domain in its domain generalization hierarchy is called fulldomain generalization [9, 16]. Generalization can also map an attribute’s values to different domains in its domain generalization hierarchy, each value being replaced by the same generalized value in the entire dataset [7]. The least restrictive generalization, called cell level generalization [11], extends Iyengar model [7] by allowing the same value to be mapped to different generalized values, in distinct tuples. 2 K-Anonymity, Generalization and Suppression Let be the initial microdata and be the released (a.k.a. masked) microdata. The attributes characterizing are classified into the following three categories: identifier attributes such as Name and SSN that can be used to identify a record. key or quasi-identifier attributes such as ZipCode and Age that may be known by an intruder. sensitive or confidential attributes such as PrincipalDiagnosis and Income that are assumed to be unknown to an intruder. While the identifier attributes are removed from the published microdata, the quasi-identifier and confidential attributes are usually released to the researchers / analysts. A general assumption is that the values for the confidential attributes are not available from any external source. This assumption guarantees that an intruder cannot use the confidential attributes’ values to increase his/her chances of disclosure, and, therefore, modifying this type of attributes values is unnecessary. Unfortunately, an intruder may use record linkage techniques [23] between quasi-identifier attributes and external available information to glean the identity of individuals from the masked microdata. To avoid this possibility of disclosure, one frequently used solution is to modify the initial microdata, more specifically the quasi-identifier attributes values, in order to enforce the k-anonymity property. Z2 = {*****} ***** S1 = {*} 482** Z1 = {482**, 410**} Z0 = {48201, 41075, 41076, 41088, 41099} 48201 41075 410** 41076 41088 41099 S0 = {male, female} Figure 1: Examples of domain and value generalization hierarchies. 31 * male(M) female(F) Tuple suppression [16, 18] is the only other method used in this paper for masking the initial microdata. By eliminating entire tuples we are able to reduce the amount of generalization required for achieving the kanonymity property in the remaining tuples. Since the constrained k-anonymity model uses generalization boundaries, for many initial microdata sets suppression has to be used in order to generate constrained kanonymous masked microdata. Figure 2 contains an example of defining maximal allowed generalization values for a subset of values for the Location attribute. The MAGVals for the leaf values “San Diego” and “Lincoln” are “California”, and, respectively, “Midwest” (the MAGVals are marked by * characters that delimit them). This means that the quasiidentifier Location’s value “San Diego” may be generalized to itself or “California”, but not to “West Coast” or “United States”. Also, “Lincoln” may be generalized to itself, “Nebraska”, or “Midwest”, but not to “United States”. 3 Constrained K-Anonymity In order to specify a generalization boundary, we introduce the concept of a maximum allowed generalization value that is associated with each possible quasi-identifier attribute value from . This concept is used to express how far the owner of the data thinks that the quasi-identifier’s values could be generalized, such that the resulted masked microdata would still be useful. Limiting the amount of generalization for quasi-identifier attribute values is a necessity for various uses of the data. The data owner is often aware of the way various researchers are using the data and, as a consequence, he/she is able to identify maximum allowed generalization values. For instance, when the released microdata is used to compute various statistical measures related to the US states, the data owner will select the states as maximal allowed generalization values. The desired protection level should be achieved with minimal changes to the initial microdata . However, minimal changes may cause generalization that surpasses the maximal allowed generalization values and the masked microdata would become unusable. More changes are preferred in this situation if they do not contradict the generalization boundaries. At this stage, for simplicity, we use predefined hierarchies for both categorical and numerical quasiidentifier attributes, when defining maximal allowed generalization values. Techniques to dynamically build hierarchies for numerical attributes exist in the literature [4] and we intend to use them in our future research. United States West Coast *California* San Diego Los Angeles *Midwest* *Kansas* Wichita Nebraska Kansas City Lincoln Figure 2: Examples of MAGVals. The second requirement in the MAGVal’s definition specifies that the hierarchy path between a leaf value v and MAGVal(v) can contain no node other than MAGVal(v) that is a maximum allowed generalization value. This restriction is imposed in order to avoid any ambiguity about the MAGVals of the leaf values in a sensitive attribute hierarchy. Note that several MAGVals may exist on a path between a leaf and the root as a result of defining MAGVals for other leaves within that hierarchy. Definition 4. (Maximum Allowed Generalization Set): Let Q be a quasi-identifier attribute and Q its predefined value generalization hierarchy. The set of all MAGVals for attribute Q is called Q’s maximum allowed generalization set, and it is denoted by MAGSet(Q) = { MAGVal(v) | ∀v ∈ leaves( Q) } (The notation leaves( Q) represents all the leaves from the Q value generalization hierarchy). Given the hierarchy for the attribute Location presented in Figure 2, MAGSet(Location) = {California, Kansas, Midwest}. Usually, the data owner/user only has generalization restrictions for some of the quasiidentifiers in a microdata that is to be masked. If for a particular quasi-identifier attribute Q there are not any restrictions in respect to its generalization, then no maximal allowed generalization values are specified for Q’s value hierarchy; in this case, each leaf value in Q is considered to have the Q’s root value as its maximal allowed generalization value. Definition 3. (Maximum Allowed Generalization Value): Let Q be a quasi-identifier attribute (categorical or numerical), and its predefined value Q generalization hierarchy. For every leaf value v ∈ Q, the maximum allowed generalization value of v, denoted by MAGVal(v), is the value (leaf or not-leaf) in Q situated on the path from v to the root, such that: for any released microdata, the value v is permitted to be generalized only up to MAGVal(v) and when several MAGVals exist on the path between v and the hierarchy root, then the MAGVal(v) is the first MAGVal that is reached when following the path from v to the root node. 32 Record r1 r2 r3 r4 r5 r6 r7 Name Alice Bob Charley Dave Eva John Casey SSN Age 123456789 323232323 232345656 333333333 666666666 214365879 909090909 32 30 42 30 35 20 25 Location San Diego Los Angeles Wichita Kansas City Lincoln Lincoln Wichita Sex Race Diagnosis M M M M F M F W W W W W B B AIDS Asthma Asthma Asthma Diabetes Asthma Diabetes Income 17,000 68,000 80,000 55,000 23,000 55,000 23,000 Figure 3: An initial microdata set Record Location Sex Race r1 r2 r3 r4 r5 30-32 30-32 30-42 30-42 30-42 Age California California MidWest MidWest MidWest Location M M * * * Sex W W W W W Race Record r1 r2 r3 r4 r7 30-32 30-32 25-42 25-42 25-42 California California Kansas Kansas Kansas M M * * * W W * * * r6 r7 20-25 20-25 MidWest MidWest * * B B r5 r6 20-35 20-35 b) Lincoln Lincoln * * * * a) Age Figure 4: Two masked microdata sets (Only the quasi-identifier 1 and 2 for the initial microdata attribute values are shown in the masked microdata sets) 1, satisfies 2-anonymity, but contradicts constrained 2-anonymity w.r.t. Location attribute’s maximal allowed generalization. On the other hand, the second microdata set, 2, satisfies constrained 2-anonymity: every QIcluster consists of at least 2 tuples, and none of the Location initial attribute’s values are generalized beyond its MAGVal. Definition 5. (Constraint Violation): We say that the masked microdata has a constraint violation if one quasi-identifier value, v, in , is generalized in one tuple in beyond its specific maximal generalization value, MAGVal(v). Definition 6. (Constrained K-Anonymity): The masked microdata satisfies the constrained k-anonymity property if it satisfies k-anonymity and it does not have any constraint violation. 4 GreedyCKA - An Algorithm for Constrained KAnonymization In this section we assume that the initial microdata set , the generalization boundaries for its quasi-identifier attributes, expressed as MAGVals in their corresponding hierarchies, and the k value (as in k-anonymity) are given. First, we will describe a method to decide if can be masked to comply with constrained k-anonymity using generalization only, and second, we will introduce an algorithm for achieving constrained k-anonymity. Our approach to constrained k-anonymization partially follows an idea found in [1] and [2], which consists in modeling and solving k-anonymization as a clustering problem. Basically, the algorithm takes an initial microdata set and establishes a “good” partitioning of it into clusters. The released microdata set is afterwards formed by generalizing the quasiidentifier attributes’ values of all tuples inside each cluster to the same values (called generalization information for a cluster). However, it is not always possible to mask an initial microdata to satisfy constrained k-anonymity only by generalization. Sometimes a solution to constrained k-anonymization has to combine generalization with suppression. In this case, our algorithm suppresses the minimal set of tuples We note that a k-anonymous masked microdata may have multiple constraint violations, but any masked microdata that satisfies constrained k-anonymity property will not have any constraint violations; or in other words, any quasi-identifier value, v, from the initial microdata will never be generalized beyond its MAGVal(v) in any constrained k-anonymous masked microdata. Consider the following example. The initial microdata set in Figure 3 is characterized by the following attributes: Name and SSN are identifier attributes (to be removed from the masked microdata), Age, Location, Sex, and Race are the quasi-identifier attributes, and Diagnosis and Income are the sensitive attributes. The attribute Location values and their MAGVals are described by Figure 2. The remaining quasi-identifier attributes do not have any generalization boundary requirements. Figure 4 illustrates two possible masked microdata . In this 1 and 2 for the initial microdata figure, only quasi-identifier values are shown, the confidential attribute values will be kept unchanged from the initial microdata (Diagnosis and Income attributes from Figure 3). The first masked microdata, 33 from such that is possible to build a constrained kanonymous masked microdata for the remaining tuples. The constrained k-anonymization by clustering problem can be formally stated as follows. constraint violation. Proof. Assume that there are two tuples ri and rj within cl such that MAGVal(vi) ≠ MAGVal(vj), where vi = ri[Q] and vj = rj[Q], vi, vj ∈ leaves( Q). Let a be a value within HQ that is the first common ancestor for MAGVal(vi) and MAGVal(vj). Depending on how MAGVal(vi) and MAGVal(vj) are located relatively to one another in the Q’s value generalization hierarchy, a can be one of them, or a value on a superior tree level. In any case, a will be different from, and an ancestor for at least one of MAGVal(vi) or MAGVal(vj). This is a consequence of the fact that MAGVal(vi) ≠ MAGVal(vj): a common ancestor of two different nodes x and y in a tree is a node which is different from at least one of the nodes x and y. Because of this fact, when cl will be generalized to gen(cl), gen(cl)[Q] will be a (or depending on the other tuples in cl, even an ancestor of a) – therefore at least one of the values vi and vj will be generalized further than its maximal allowed generalization value, leading to a constraint violation. // q.e.d. Definition 7. (Constrained K-Anonymization by Clustering Problem): Given a microdata , the constrained k-anonymization by clustering problem for is to find a partition = {cl1, cl2, … , clv, clv+1} , j=1..v+1, are called clusters of , where clj ⊆ v and: cl j = clv+1; cl i cl j = ∅, i, j = 1..v+1, i≠j; j =1 |clj | ≥ k, j=1..v ; and a cost measure is optimized. The cluster clv+1 is formed of all the tuples in that have to be suppressed in , and the tuples within every cluster clj, j=1..v will be generalized (their quasiidentifier attributes) in to common values. The generalization information of a cluster, which is introduced next, represents the minimal covering “tuple” for that cluster. Since in this paper we use predefined value generalization hierarchies for both categorical and numerical attributes, we do not have to consider a definition that distinguishes between these two types of attributes [21]. Property 1 restricts the possible solutions of the constrained anonymization by clustering problem to those partitions of for which every cluster to be generalized doesn’t show any constraint violation w.r.t. each of the quasi-identifier attributes. The following definition introduces a masked microdata that will help us to express when the can be transformed to satisfy constrained k-anonymity using generalization only. Definition 8. (Generalization Information): Let cl = {r1, r2, …, ru} be a cluster of tuples selected from , = {Q1, Q2, ..., Qs} be the set of quasi-identifier attributes. The generalization information of cl w.r.t. quasi-identifier attribute set is the “tuple” gen(cl), having the scheme , where for each attribute Qj ∈ , gen(cl)[Qj] = the lowest common ancestor in Qj of {r1[Qj], …, ru[Qj]}. Definition 9. (Maximum Allowed Microdata): The maximum allowed microdata for a microdata , is the masked microdata where every quasiidentifier value, v, in is generalized to MAGVal(v). Property 2. For a given , if its maximum allowed microdata is not k-anonymous, then any masked microdata obtained from by applying generalization only will not satisfy constrained k-anonymity. For the cluster cl, its generalization information gen(cl) is the tuple having as value for each quasiidentifier attribute the most specific common generalized value for all that attribute values from cl’s tuples. In the corresponding , each tuple from the cluster cl will have its quasi-identifier attributes values replaced by gen(cl). To decide whether an initial microdata can be masked to satisfy constrained k-anonymity property using generalization only, we introduce several properties. These properties will also allow us, in case that constrained k-anonymity cannot be achieved using generalization only, to select the tuples that must be suppressed. Proof. Assume that is not k-anonymous, and there is a masked microdata that satisfies constrained kanonymity. This means that every QI-cluster from has at least k elements and it does not have any constraint violation. Let cli be a cluster of elements from that is generalized to a QI-cluster in (i = 1, .., v). Because satisfies constrained k-anonymity, the generalization of cli to gen(cli) does not create any constraint violation. Based on Property 1, for each quasi-identifier attribute, all entities from cli share the same MAGVals. As a consequence, by generalizing all quasi-identifier attributes values to their corresponding MAGVals (this is the procedure to create the microdata) all entities from the cluster cli (for all i = 1, .., v) will be contained within the same QI-cluster. This Property 1. Let be a microdata set and cl a cluster of tuples from . If cl contains two tuples ri and rj such that MAGVal(ri[Q]) ≠ MAGVal(rj[Q]), where Q is a quasi-identifier attribute, then the generalization of the tuples from cl to gen(cl) will create at least one 34 means that each QI-cluster in contains one or more QI-clusters from and its size will, then, be at least k. In conclusion, is k-anonymous, which is a contradiction with our initial assumption. // q.e.d. Property 6. Any subset of that contains one or more entities from cannot be masked using generalization only to achieve constrained k-anonymity. Proof. We assume that there is an initial microdata , a subset of , that contains one or more entities from , and can be masked using generalization only to comply with constrained k-anonymity. Let x ∈ ∩ Let be the maximum allowed microdata for . Based on Property 4, if can be masked to be constrained k-anonymous, then is kanonymous, therefore x will belong to a QI-cluster with size at least k. By construction is a subset of , and therefore, the size of each QI-cluster from is equal to or greater than the size of the corresponding QI-cluster from . This means that x will belong to a QI-cluster with size at least k in the . This is a contradiction with x ∈ . // q.e.d. Property 3. If satisfies k-anonymity then satisfies the constrained k-anonymity property. Proof. This follows from the definition of . Property 4. An initial microdata, , can be masked to comply with constrained k-anonymity using only generalization if and only if its corresponding satisfies k-anonymity. Proof. “If”: If satisfies k-anonymity, then based on Property 3, is also constrained k-anonymous, and can be masked to (in the worst case – or even to a less generalized masked microdata) to comply with constrained k-anonymity. “Only If”: If does not satisfy k-anonymity, then based on Property 2, any masked microdata obtained by applying generalization only to will not satisfy constrained k-anonymity. // q.e.d. The Properties 5 and 6 show that is the minimal tuple set that must be suppressed from such that the remaining set could be constrained kanonymized. To compute a constrained k-anonymous masked microdata using minimum suppression and generalization only we follow an idea found in [1] and [2], which consists in modeling and solving kanonymization as a clustering problem. First, we suppress all tuples from the set. Next, we create all \ QI-clusters in the maximum allowed microdata for . Last, each such cluster will be divided further, if possible, using the clustering approach from [1, 2], into several clusters, all with size greater than or equal to k. This approach uses a greedy technique that tries to optimize an information loss (IL) measure. The information loss measure we use in our algorithm implementation was introduced in [2]. We present it in Definitions 10 and 11. Note that this IL definition assumes that value generalization hierarchies are predefined for all quasi-identifier attributes. Now we have all the tools required to check whether an initial microdata can be masked to satisfy the constrained k-anonymity property using generalization only. We follow the next two steps: Compute for . This is done by replacing each quasi-identifier attribute value with its corresponding MAGVal. If all QI-clusters from have at least k entities than the can be masked to satisfy constrained kanonymity. It is very likely that there are some QI-clusters in with size less than k. We use the notation to represent all entities from these QI-clusters (for simplicity we use the same notation to refer to entities from both and ). Unfortunately, the entities from cannot be k-anonymized while preserving the constraint condition, as shown by the Property 6. For a given with its corresponding and sets the following two properties hold: Definition 10. (Cluster Information Loss): Let cl ∈ be a cluster, gen(cl) its generalization information and = {Q1, Q2, .., Qt} the set of quasi-identifier attributes. The cluster information loss caused by generalizing cl tuples to gen(cl) is: t height ( Λ ( gen ( cl )[ Q ])) j IL ( cl ) = | cl | ⋅ height ( H ) Q j =1 j Property 5. \ can be masked using generalization only to comply with constrained kanonymity. Proof. By definition of the set, all QI-clusters from have size k or more, which means that satisfies the k-anonymity property. Based on Property 4 ( is the maximum allowed microdata for ), can be masked using generalization only to comply with constrained kanonymity. // q.e.d. where: |cl| denotes the cluster cl cardinality; Λ(w), w∈HQj is the subhierarchy of HQj rooted in w; height(HQj) is the height of the tree hierarchy HQj. 35 This idea of dividing into clusters based on common MAGVals of the quasi-identifiers can be employed for other privacy models as well, not only for k-anonymity. For instance, if we use an algorithm that creates a p-sensitive k-anonymous masked microdata [20], such as EnhancedPKClustering [22], we just need to execute that algorithm instead of Greedy_kmember_Clustering, for each QI-cluster from \ . The obtained masked microdata will be psensitive k-anonymous and will satisfy the generalization boundaries. We can define this new privacy model as constrained p-sensitive k-anonymity. Using similar modifications in the GreedyCKA algorithm, we can introduce constrained versions of other privacy models such as: constrained l-diversity [12], constrained t-closeness [10], etc. and generate their corresponding masked microdata sets. Definition 11. (Total Information Loss): Total information loss for a partition of the initial microdata set is the sum of the information loss measure for all clusters in . In is worth noting that, for the constrained kanonymization by clustering problem, the cluster of tuples to be suppressed, clv+1, will have the maximum possible IL value for a cluster of the same size as clv+1. The information loss for this cluster will be: IL(clv+1) = |clv+1|⋅n, where n is the number of quasi-identifier attributes. When performing experiments to compare the quality of constrained k-anonymous microdata and k-anonymous microdata, produced for the same , the information loss of the constrained k-anonymous solution includes the information loss caused by the suppressed cluster as well, and not only for the generalized clusters. More than that, for every suppressed tuple we consider the maximum information loss that it can cause when it is masked. This way, the quality of the constrained k-anonymous solutions will not be biased because of a favored way of computing information loss for the suppressed tuples. The two-stage constrained k-anonymization algorithm called GreedyCKA is depicted in Figure 5. We present below the pseudocode of the GreedyCKA Algorithm: 5 Experimental Results In this section we compare the GreedyCKA and Greedy_k-member_Clustering [2] algorithms with respect to: the quality of the results they produce measured against the information loss measure; the algorithms’ efficiency as expressed by their running time; the number of constraint violation that kanonymous masked microdata produced by Greedy_kmember_Clustering have; and the suppression amount performed by GreedyCKA in order to produce constrained k-anonymous masked microdata in presence of different constraint sets. The two algorithms were implemented in Java; tests were executed on a dual CPU machine with 3.00 GHz and 1 GB of RAM, running Windows 2003 Server. A set of experiments were performed for an consisting of 10,000 tuples randomly selected from the Adult dataset from the UC Irvine Machine Learning Repository [15]. In all the experiments, we considered a set of eight quasi-identifier attributes: education-num, workclass, marital-status, occupation, race, sex, age, and native-country. Algorithm GreedyCKA is Input – initial microdata; k – as in k-anonymity; Output ={cl1,cl2,… clv,clv+1} - a solution for the constrained k-anonymization by clustering problem for ; Compute and ; = ∅; For each QI-cluster from \ , cl, { // By cl we refer to the entities from // that are clustered together in . ’ = Greedy_k-member_Clustering(cl, k); // [2] = ∪ ’; } v = | |; clv+1 = ; End GreedyCKA; QI-clusters in Initial microdata Final QI-clusters in , k=3 Stage 1, forming Suppressed tuples Figure 5: The two-stage process in creating constrained k-anonymous masked microdata 36 The GreedyCKA and Greedy_k-member_Clustering algorithms were applied to this microdata set, for different k values, from k=2 to k=10. Two different generalization constraint sets were successively considered for every k value. First, only the nativecountry attribute’s values were subject to generalization constraints, as depicted in Figure 6. Second, both native-country and age had generalization boundaries; the value generalization hierarchy and the maximum allowed generalization values for the age attribute are illustrated in Figure 7. In Figures 6 and 7, the MAGVals are depicted as bold and delimited by * characters. Of course, Greedy_k-member_Clustering proceeded without taking into consideration the generalization boundaries, as it is a “simple”, unconstrained kanonymization algorithm. This is why the masked microdata it produces will generally contain numerous constraint violations. On the other side, the kanonymization process of GreedyCKA is conducted in respect to the specified generalization boundaries; this is why the masked microdata produced by GreedyCKA is free of constraint violations. The quasi-identifier attributes without generalization boundaries have the following heights for their corresponding value generalization hierarchies: education-num – 4, workclass – 1, marital-status – 2, occupation – 1, race – 1, and sex – 1. However, masking microdata to comply with the more restrictive constrained k-anonymity model sometimes comes with a price. As the experiments show, it is possible to lose more of the intrinsic microdata information when masking it to satisfy constrained k-anonymity than when masking it to satisfy k-anonymity only. Figure 8 presents comparatively the information loss measure for the masked microdata created by GreedyCKA and Greedy_kmember_Clustering, with the two different constraint sets and for k values in the range 2-10. As expected, the information loss value is generally greater when constraints are considered in the kanonymization process. Exceptions may however occur. For example, GreedyCKA obtained better results then Greedy_k-member_Clustering for k = 8, 9 and 10, when only native_country was constrained. The information lost is influenced, of course, by the constraint requirements and by the microdata distribution w.r.t. the constrained attributes. When more quasi-identifiers have generalization boundaries or more restrictive generalizetion boundaries, the information lost in the constrained k-anonymization process will generally increase. Regarding the running time, we can state that GreedyCKA will always be more efficient than Greedy_k-member_Clustering. The explanation for this fact is that, when generalization boundaries are imposed, they will cause the initial microdata to be divided in several subsets (the QI-clusters of ), on which Greedy_k-member_Clustering will be afterwards applied. Greedy_k-member_Clustering has an O(n2) complexity, and applying it on smaller microdata subsets will reduce the processing time. More constraints and QI-clusters exist in , more significant is the reduction of the processing time for microdata masking (see Figure 9). Country *North_A* South_A *Central_A* *West_E* USA … Canada *Asia* *Europe* *America* Italy … East_E West_A Greece *Africa* *East_A* *North_Af* … South Africa Figure 6: MAGVals for the quasi-identifier attribute Country. 0-100 20-59 *0-19* 0-9 0 1 10-19 *20-29* *30-39* *40-49* *50-59* … … 60-100 *60-69* *70-100* … Figure 7: MAGVals for the quasi-identifier attribute Age. 37 South_Af 100 25000 20000 GreedyCKA, native_country constrained GreedyCKA, native_country and age constrained Greedy_k-member_Clustering 15000 IL 10000 5000 0 k= 2 3 4 5 6 7 8 9 10 Figure 8: Information Loss (IL) for GreedyCKA and Greedy_k-member_Clustering. 1:40:48 1:26:24 Time 1:12:00 GreedyCKA, native_country constrained 0:57:36 0:43:12 GreedyCKA, native_country and age constrained 0:28:48 Greedy_k-member_Clustering 0:14:24 0:00:00 k = 2 3 4 5 6 7 8 9 10 Figure 9: Running Time for GreedyCKA and Greedy_k-member_Clustering. Table 2 shows the number of tuples suppressed by GreedyCKA, while masking the initial microdata. All in all, our experiments proved that constrained k-anonymous masked microdata can be achieved without sacrificing the data quality to a significant extent, when compared to a corresponding k-anonymous unconstrained masked microdata. While the constrained k-anonymity model responds to a necessity in real-life applications, the existing kanonymization algorithms are not able to build masked microdata that comply with it. In this context, GreedyCKA takes optimal suppression decisions, based on the proved properties of the new model (Properties 5 and 6), and builds high-quality constrained kanonymous masked microdata. As pointed out, when Greedy_k-member_Clustering is applied to k-anonymize , the resulting masked microdata usually contains numerous constraint violations. Table 1 reports the number of constraint violations in the outcome of the Greedy_k-member_ Clustering unconstrained k-anonymization algorithm, for two maximal generalization requirement sets. k 2 3 4 5 6 7 8 9 10 No of constraint violations No of constraint violations for 1 constrained attribute – for 2 constrained attributes – native_country, age native_country 605 2209 991 3824 1377 5297 1657 6163 1906 6964 2198 7743 2354 8417 2550 8931 2728 9549 6 Conclusions and Future Work In this paper we defined a new privacy model, called constrained k-anonymity, which takes into consideration generalization boundaries imposed by the data owner for quasi-identifier attributes. Based on the model properties, an efficient algorithm to generate a masked microdata to comply with constrained k-anonymity property was introduced. Our experiments showed that the proposed algorithm obtains comparable information loss values with Greedy_k-member_Clustering algorithm, while the masked microdata sets obtained by the latter have many constraint violations. Table 1: Constraint violations in Greedy_kmember_Clustering k 2 3 4 5 6 7 8 9 10 No of suppressed tuples for 1 constrained attribute – 0 0 0 0 0 0 0 0 0 native_country No of suppressed tuples for 2 constrained attributes– 5 15 24 28 48 60 81 97 106 native_country, age Table 2: Number of tuples suppressed by GreedyCKA 38 [11] M. Lunacek, D. Whitley, and I. Ray, A Crossover Operator for the k-Anonymity Problem, in Proc. of the GECCO (2006) pp. 1713–1720. [12] A. Machanavajjhala, J. Gehrke, and D. Kifer, LDiversity: Privacy beyond K-Anonymity, in Proc. of the IEEE ICDE (2006), pp. 24. [13] D. J. Martin, D. Kifer, A. Machanavajjhala, and J. Gehrke, Worst-Case Background Knowledge for Privacy-Preserving Data Publishing, Proc. of the IEEE ICDE (2007), pp. 126–135. [14] MSNBC, Privacy Lost, www.msnbc.msn.com/id/ 15157222 , 2006. [15] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz, UCI Repository of Machine Learning Databases, www. ics.uci.edu/~mlearn/MLRepository.html, 1998. [16] P. Samarati, Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6 (2001), pp. 1010– 1027. [17] L. Sweeney, k-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems, Vol. 10, No. 5 (2002), pp. 557–570. [18] L. Sweeney, Achieving k-Anonymity Privacy Protection Using Generalization and Suppression, International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems, Vol. 10, No. 5 (2002), pp. 571–588. [19] T. M. Truta, F. Fotouhi, and D. Barth-Jones, Privacy and Confidentiality Management for the Microaggregation Disclosure Control Method, in Proc. of the PES Workshop, with ACM CCS (2003), pp. 21–30. [20] T. M. Truta, V. Bindu, Privacy Protection: P-Sensitive K-Anonymity Property, in Proc. of the PDM Workshop, with IEEE ICDE (2006), pp. 94. [21] T. M. Truta, A. Campan, K-Anonymization Incremental Maintenance and Optimization Techniques, in Proc. of the ACM SAC (2007), pp. 380–387. [22] T. M. Truta, A. Campan, P. Meyer, Generating Microdata with P-Sensitive K-Anonymity Property, in Proc. of the SDM Workshop, with VLDB (2007), pp. 124–141. [23] W. Winkler, Matching and Record Linkage, Business Survey Methods, Wiley (1995), pp. 374–403. [24] R. C. W. Wong, J. Li, A. W. C. Fu, and K. Wang, (α, k)Anonymity: An Enhanced k-Anonymity Model for Privacy-Preserving Data Publishing, in Proc. of the ACM SIGKDD (2006), pp. 754–759. [25] R. C. W. Wong, J. Li, A. W. C. Fu, and J. Pei, Minimality Attack in Privacy-Preserving Data Publishing, in Proc. of the VLDB (2007), pp. 543–554. [26] X. Xiao, Y. Tao, Personalized Privacy Preservation, in Proc. of the ACM SIGMOD (2006), pp. 229–240. [27] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. Fu, Utility-Based Anonymization Using Local Recoding, in Proc. of ACM SIGKDD (2006), pp. 785–790. [28] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu, Aggregate Query Answering on Anonymized Tables, in Proc. Of the IEEE ICDE (2007), pp. 116–125. In this paper we used predefined hierarchies for all quasi-identifier attributes. As future work we plan to extend this concept further for numerical attributes. We plan to provide a technique to dynamically determine for each numerical quasi-identifier value, its maximal allowed generalization, based on that attribute’s values in the analyzed microdata and a minimal user input. We also pointed out that the constraint k-anonymity property and even our proposed algorithm, GreedyCKA, can be extended to other privacy models (models such as constrained l-diversity, constrained (α, k)-anonymity, constrained p-sensitive k-anonymity, etc. can be easily defined). Finding specific properties for these enhanced privacy models, and developing improved algorithms to generate masked microdata to comply with such models are subject of future work. Acknowledgments This work was partially supported by the CNCSIS (Romanian National University Research Council) grant PNCDI-PN II, IDEI 550/2007. References [1] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu, Achieving Anonymity via Clustering, in Proc. of the ACM PODS (2006), pp. 153–162. [2] J.W. Byun, A. Kamra, E. Bertino, and N. Li, Efficient kAnonymization using Clustering Techniques, in Proc. Of DASFAA (2006), pp. 188–200. [3] A. Campan, T. M. Truta, Extended P-Sensitive KAnonymity, Studia Universitatis Babes-Bolyai, Informatica, Vol. 51, No. 2 (2006), pp. 19–30. [4] B. C. M. Fung, K. Wang, and P. S. Yu, Anonymizing classification data for privacy preservation, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 5 (2007), pp. 711–725. [5] G. Ghinita, K. Karras, P. Kalinis, and N. Mamoulis, Fast Data Anonymization with Low Information Loss, in Proc. of VLDB (2007), pp. 758–769. [6] HIPAA, Health Insurance Portability and Accountability Act, www.hhs.gov/ocr/hipaa, 2002. [7] V. Iyengar, Transforming Data to Satisfy Privacy Constraints, in Proc. of the ACM SIGKDD (2002), pp. 279–288. [8] K. LeFevre, D. DeWitt, and R. Ramakrishnan, Incognito: Efficient Full-Domain K-Anonymity, in Proc. of the ACM SIGMOD (2005), pp. 49–60. [9] K. LeFevre, D. DeWitt, and R. Ramakrishnan, Mondrian Multidimensional K-Anonymity, in Proc. of the IEEE ICDE (2006), pp. 25. [10] N. Li, T. Li, and S. Venkatasubramanian, T-Closeness: Privacy Beyond k-Anonymity and l-Diversity, in Proc. of the IEEE ICDE (2007), pp. 106–115. 39 Privacy-Preserving Predictive Models for Lung Cancer Survival Analysis Glenn Fung1 , Shipeng Yu1 , Cary Dehing-Oberije2 , Dirk De Ruysscher2 , Philippe Lambin2 , Sriram Krishnan1 , R. Rao Bharat1 1 CAD and Knowledge Solutionis, Siemens Medical Solutions USA, Inc., Malvern, PA USA. 2 MAASTRO clinic, the Netherlands. Abstract Privacy-preserving data mining (PPDM) is a recent emergent research area that deals with the incorporation of privacy preserving concerns to data mining techniques. We consider a real clinical setting where the data is horizontally distributed among different institutions. Each one of the medical institutions involved in this work provides a database containing a subset of patients. There is recent work that shows the potential of the PPDM approach in medical applications. However, there is few work in developing/implementing PPDM for predictive personalized medicine. In this paper we use real data from several institutions across Europe to build models for survival prediction for non-small-cell lung cancer patients while addressing the potential privacy preserving issues that may arise when sharing data across institutions located in different countries. Our experiments in a real clinical setting show that the privacy preserving approach may result in improved models while avoiding the burdens of traditional data sharing (legal and/or anonymization expenses). 1 Introduction Privacy-preserving data mining (PPDM) is a recent emergent research area that deals with the incorporation of privacy preserving concerns to data mining techniques. We are particularly interested in a scenario when the data is horizontally distributed among different institutions. In the medical domain this means that each medical institution (hospitals, clinics, etc.) provides a database containing a complete (or almost complete) subset of item sets (patients). An efficient PPDM algorithm should be able to process the data from all the sources and learn data mining/machine learning models that take into account all the information available without sharing explicitly private information among the sources. The ultimate goal of a PPDM model is to perform similarly or identically to a model learned by having access to all the data at the same time. P3DM’08, April 26, 2008, Atlanta, Georgia, USA. 40 There are have been a push for the incorporation of electronic health records (EHR) in medical institutions worldwide. There seems to be a consensus that the availability of EHR will have several significant benefits for health systems across the world, including: improvement of quality of care by tracking performance on clinical measures, better and more accurate insurance reimbursement, computer assisted diagnosis (CAD) tools, etc. Therefore, there is a constant increase on the number of hospitals saving huge amounts of data that can be used to build predictive models to assist doctors in the medical decision process for treatment, diagnosis, and prognosis among others. However, sharing the data across institutions becomes a difficult and tedious process that also involves considerable legal and economic burden on the institutions sharing the medical data. In this paper we explore two privacy preserving techniques applied to learn survival predictive models for non-small-cell lung cancer patients treated with (chemo) radiotherapy. We use real data collected from patients treated on three European institutions in two different countries (the Netherlands and Belgium) to build our models. The framework we are describing in this paper allows to design/learn improved predictive models that perform better than the individual models obtained by using local data from only one institution, without addressing the local and international privacy preserving concerns that arise when sharing patientrelated data. As far as we know, there is none previous work related to learning survival models for lung cancer radiation therapy addressing PP concerns. The rest of the paper is organized as follows: in the next section, we introduced the notation used in the paper. In section 3 we present an overview of the related work. In sections 4.1 and 4.3 we present the overview of the two methods used for our predictive models: Newton-Lagrangian Support Vector Machines [5] and Cox Regression [3]. Later in sections 4.2 and 4.4, we present the technical details of the corresponding privacy preserving (PP) algorithms used. We conclude the paper describing our application with experimental privacy preserving classifying techniques including crypresults performed in a real clinical setting and the tographically private SVMs [7], wavelet-based distortion [10]. There is recent work that shows the potential of conclusions. the approach [6, 12] in medical settings. However , there is few work in developing/implementing PPDM for pre2 Notation We describe our notations now. All vectors will be dictive personalized medicine. column vectors unless transposed to a row vector by a Predictive Models prime ′ . For a vector x ∈ Rn the notation xj will signify 4 Privacy-Preserving (PPPM) either the j-th component or j-th block of components. The scalar (inner) product of two vectors x and y in In this section we introduce two PP predictive models, the n-dimensional real space Rn will be denoted by x′ y. namely PP Support Vector Machines and PP Cox The notation A ∈ Rm×n will signify a real m×n matrix. Regression. We first give an overview of the two For such a matrix, A′ will denote the transpose of A, techniques in sections 4.1 and 4.3, and then present the Ai will denote the i-th row or i-th block of rows of A. PP versions in sections 4.2 and 4.4. A vector of ones in a real space of arbitrary dimension will be denoted by e. Thus for e ∈ Rm and y ∈ Rm the 4.1 Overview of Support Vector Machines. We notation e′ y will denote the sum of the components of y. describe in this section the fundamental classification A vector of zeros in a real space of arbitrary dimension problems that lead to the standard quadratic Support will be denoted by 0. For A ∈ Rm×n and B ∈ Rk×n , vector machine (SVM) formulation that minimizes a a kernel K(A, B ′ ) maps Rm×n × Rn×k into Rm×k . In quadratic convex function. We consider the problem particular, if x and y are column vectors in Rn then, of classifying m points in the n-dimensional real space K(x′ , y) is a real number, K(x′ , B ′ ) is a row vector in Rn , represented by the m × n matrix A, according to Rk and K(A, B ′ ) is an m × k matrix. The abbreviation membership of each point Ai in the classes +1 or -1 as specified by a given m × m diagonal matrix D with ones “s.t.” stands for “subject to”. or minus ones along its diagonal. For this problem, the standard support vector machine with a linear kernel 3 Related Work ′ As a consequence of the recent advances of network com- AA [13] is given by the following quadratic program puting, there has been recently great interest in privacy- for some ν > 0: preserving data mining techniques. An extensive review min νe′ y + 12 w′ w (w,γ,y)∈Rn+1+m of PPDM techniques can be found in [14]. Most of the (4.1) s.t. D(Aw − eγ) + y ≥ e available data mining techniques require and assume y ≥ 0. that there is complete access to all data at all times. This may not be true for example, in an uncentralized As depicted in Figure 1, w is the normal to the bounding distributed medical setting where for each data source planes: or institution, there are local procedures in place to enforce privacy and security of the data. If this is the x′ w − γ = +1 case, there is a need to use efficient data mining and (4.2) x′ w − γ = −1, machine learning techniques that can use data across institutions while complying with the non-disclosure na- and γ determines their location relative to the origin. ture of the available data. There are two main kinds of The first plane above bounds the class +1 points and data partitioning when dealing with distributed setting the second plane bounds the class -1 points when the where PPDM is needed: a) the data is partitioned verti- two classes are strictly linearly separable, that is when cally, this means that all institutions have some subset the slack variable y = 0. The linear separating surface of features (predictors, variables) for all the available is the plane patients. When this is the case, several techniques have been proposed to address the issue including: adding random perturbations to the data [2, 4]. The other popular PPDM setting occurs when the data is partitioned horizontally among institutions, that means that different entities hold the same input features for different groups of individuals. This case have been addressed in [16, 15] by privacy-preserving SVMs and induction tree classifiers. There are several other recently proposed 41 (4.3) x′ w = γ, midway between the bounding planes (4.2). If the classes are linearly inseparable then the two planes bound the two classes with a “soft margin” determined by a nonnegative slack variable y, that is: (4.4) x′ w − γ + yi ≥ +1, for x′ = Ai and Dii = +1, x′ w − γ − yi ≤ −1, for x′ = Ai and Dii = −1. The 1-norm of the slack variable y is minimized with weight ν in (4.1). The quadratic term in (4.1), which is twice the reciprocal of the square of the 2-norm distance 2 kwk between the two bounding planes of (4.2) in the ndimensional space of w ∈ Rn for a fixed γ, maximizes that distance, often called the “margin”. Figure 1 depicts the points represented by A, the bounding planes 2 , and the separating plane (4.3) (4.2) with margin kwk which separates A+, the points represented by rows of A with Dii = +1, from A−, the points represented by rows of A with Dii = −1. For this paper we used Newton′ xw =γ+1 x x x x x x x A- x x x x x x x x x x x x x x x x x x x A+ x′ w = γ − 1 Margin= duced kernel K(A, B ′ ) : Rm×n → Rm×m̃ , where B ∈ Rm̃×n is a completely random matrix with fewer rows than the number of available features, (m̃ < n) . 2. Each entity makes public only a common randomly generated linear transformation of the data given by the matrix product of its privately held matrix of data rows multiplied by the transpose of a common random matrix B for linear kernels, and a similar kernel function for nonlinear kernels. In our experimental setting, we assumed that all the available patient data is normalized between 0 and 1 and therefore the elements of B were generated according to a normal distribution with mean zero, variance one and standard deviation one. Next, we formally introduce the PPSVM algorithm 2 as presented in [11] kwk Algorithm 4.1. Nonlinear PPSVM Algorithm w Separating Surface: x′ w = γ Figure 1: 2 , The bounding planes (4.2) with margin kwk and the plane (4.3) separating A+, the points represented by rows of A with Dii = +1, from A−, the points represented by rows of A with Dii = −1. (I) All q entities agree on the same random matrix B ∈ Rm̄×n with m̄ < n for security reasons as justified in the explanation immediately following this algorithm. All entities make public the class matrix D (labels) where Dll = ±, l = 1, . . . , m for the each of the data matrices Ai , i = 1, . . . , q that they all hold. (II) Each entity generates its own privately held random matrix B·j ∈ Rm̄×nj , j = 1, . . . . . . , p, where nj is the number of input features held by entity j. Lagrangian SVM (NSVM), an algorithm based on an (III) Each entity j makes public its nonlinear kernel essentially equivalent formulations of this classification K(Aj , B ′ ). This does not reveal Aj but allows the problem [5]. In this formulation, the square of 2-norm of public computation of the full nonlinear kernel: the slack variable y is minimized with weight ν2 instead (4.5)      of the 1-norm of y as in (4.1). In addition the distance A1 K(A1 , B ′ ) between the planes (4.2) is measured in the (n + 1) A2    K(A2 , B ′ )       2 . dimensional space of (w, γ) ∈ Rn+1 , that is k(w,γ)k K(A, B ′ ) = K  .  , B ′  =   .. .  .     . Measuring the margin in this (n + 1)-dimensional space ′ n A K(A , B ) q q instead of R induces strong convexity and has little or no effect in general on the problem. (IV) A publicly calculated linear classifier K(x′ , B ′ )u − γ = 0 is computed by any linear hyperplane based 4.2 Privacy Preserving SVMs. For our privacy classification or regression method method such as preserving application we chose to use a technique on the ones presented in sections 4.1 and 4.3. random kernel mappings recently proposed by Mangasarian and Wild on [11]. The algorithm is based on (V) For each new x ∈ Rn , obtained by an entity, that entity privately computes K(x′ , B ′ ) and classifies two simple basic ideas: the given x according to the sign of K(x′ , B ′ )u − γ. 1. The use of reduced kernel mappings [9, 8], Note that algorithm 4.1 works for any kernel with where the kernel centers are randomly chosen. the following associative property: Instead of using the complete kernel function      ′ m×n m×m K(A, A ) : R → R as it is usually done K(C, F ) C , F = K in kernel methods they propose the use of a reK(D, F ) D 42 Which is, in particular, the case of the linear kernel K(A, B ′ ) = AB ′ and that we will use for the rest of the paper. As stated in [11], it is important to note than in the the above algorithm no entity j reveals its data nor its components of a new testing data point. When m̄ < n, there is an infinite number of matrices Ai ∈ Rmi×n in the solution set of the equation Ai B ′ = Pi , when B and Pi are given. This claim can be justified by the wellknown properties of under-determined systems of linear equations. Furthermore, the following proposition which is originally stated and proved in [11] is aimed to formally support the claim presented above: Now given any two observations xi and xj , from the definition of hazard function we can get h(ti ) = exp[w′ (xi − xj )], h(tj ) which is independent of time t. The baseline hazard α(t) also does not affect the hazard ratio. This is why the Cox model is a proportional-hazards model. And Cox has showed in [3] that even though the baseline hazard is unspecified, the Cox model can still be estimated by the method of partial likelihood. It is also possible to extract an estimate of the baseline hazard after having fit the model. Proposition 4.2. (infinite solutions of Ai B ′ = Pi if m̄ < n) Given the matrix product Pi′ = AiB ′ ∈ Rmi ×m̄ , where Ai ∈ Rmi ×n is unknown and B is a known matrix in Rm̄×n with m̄ < n, there are an infinite number of solutions, including: 4.4 Privacy Preserving Cox Regression. The main idea of the privacy preserving SVM is to perform a random mapping of the original predictive variables into a new space, and then perform standard SVM on the new space. Since in the Cox regression the interacmi  mi  tion between the parameter of the models and the data n! n = is linear, we can also apply the same idea presented in m̄ (n − m̄)!m̄! section 4.2 for the privacy preserving Cox regression. mi ×n ′ possible solutions Ai ∈ R to the equation Ai B = Given the random matrix B and assuming that we are Pi . Furthermore, the  infinite number of matrices in using a linear kernel, equation 4.6 is slightly changed to: mi n log h(t) = α(t) + w′ xB ′ , matrices also satisfy (4.7) the affine hull of these m̄ Ai B ′ = Pi . Again it is important to note, that to our knowledge, 4.3 Overview of Cox Regression. Cox regression, or the Cox propositional-hazards model, is one of the most popular algorithms for survival analysis [3]. Apart from baing a classification algorithm which directly deal with binary or multi-class outcomes, Cox regression defines a semi-parametric model to directly relate the predictive variables with the real outcome which is in general the survival time (e.g., in years). Let T represent survival time. The so-called hazard function is a representation of the distribution of survival times, which assesses the instantaneous risk of demise at time t, conditional on survival to that time: h(t) = lim ∆t→0 Pr[(t ≤ T < t + ∆t)|T ≥ t] . ∆t The Cox regression model assumes a linear model for the log-hazard, or as a multiplicative model for the hazard: (4.6) log h(t) = α(t) + w′ x, where x denote the covariates for each observation, and the baseline hazard α(t) is unspecified. This model is semi-parametric because while the baseline hazard can take any form, the covariates enter the model linearly. 43 this is the first time that privacy preserving techniques are applied for survival analysis methods. 5 Application: 2-Year Survival Prediction for Non-Small Cell Lung Cancer Patients Radiotherapy, combined with chemotherapy, is treatment of choice for a large group of non-small cell lung cancer (NSCLC) patients. The treatment is not restricted to patients with mediastinal lymph node metastasis, but is also indicated for patients who are inoperable because of their physical condition. In addition, the marginal role of radiotherapy and chemotherapy for the survival of NSCLC patients has been changed into one of significant importance. Improved radiotherapy treatment techniques allow an increase of the radiation dose, while at the same time more effective chemoradiation schemes are being applied. These developments have lead to an improved outcome in terms of survival. Although the introduction of FDG-PET scans has enabled more accurate detection of positive lymph nodes and distant metastases, leading to stage migration, the TNM staging system is still highly inaccurate for the prediction of survival outcome for this group of patients [1]. In summary, an increasing number of patients is being treated successfully with (chemo) radiation, but an accurate estimation of the survival probability for an individual patient, taking into account patient, tumor as well as treatment characteristics and offering the possibility for treatment decision-making, is currently not available. At present, generally accepted prognostic factors for inoperable patients are performance status, weight loss, presence of comorbidity, use of chemotherapy in addition to radiotherapy, radiation dose and tumor size. For other factors such as gender and age the literature shows inconsistent results, making it impossible to draw definitive conclusions. In these studies CT-scans were used as the major staging tool. However, the increasing use of FDG-PET scans offers the possibility to identify and use new prognostic factors. In a recent study it was shown that number of involved nodal areas quantified by PET-CT was an important prognostic factor [1]. We performed this retrospective study to develop and validate several prediction models for 2-year survival of NSCLC patients, treated with (chemo) radiotherapy, taking into account all known prognostic factors. To the best of our knowledge, this is the first study of prediction models for NSCLC patients treated with (chemo)radiotherapy at the other two centers, the Gent hospital and the Leuven hospital, were also collected for this study. There are respectively 112 and 40 patients from the Gent and Leuven hospitals, and the same set of clinical variables as the MAASTRO patients were measured. 5.2 Radiotherapy Treatment Variables. No elective nodal irradiation was performed and irradiation was delivered 5 days per week. Radiotherapy planning was performed with a Focus (CMS) system, taking into account lung density and according to ICRU 50 guidelines. There were four different radiotherapy treatment regimes applied for these patients in this retrospective study, therefore to account for the different treatment time and number of fractions per day, the equivalent dose in 2 Gy fractions, corrected for overall treatment time (EQD2,T), was used as a measure for the intensity of chest radiotherapy 5.8. Adjustment for dose per fraction and time factors were made as follows:   d+β (5.8) EQD2, T = D − γ max(0, T − Tk ), 2+β where D is the total radiation dose, d is dose per fraction, β = 10 Gy, T is overall treatment time, Tk is the accelerated repopulation kick-off time which is 5.1 Patient Population. Between May 2002 and 28 days, and γ is the loss in dose per day due to January 2007, a total number of 455 inoperable NSCLC repopulation which is 0.66 Gy/day. patients, stage I-IIIB, were referred to MAASTRO clinic to be treated with curative intent. Clinical data of all 5.3 Experimental Setup. In this paper we focus on these patients were collected retrospectively by review2-year survival prediction for these NSCLC patients, ing the clinical charts. If PET was not used as a stagwhich is the most interesting prediction from clinical ing tool, patients were excluded from the study. This perspective. The survival status was evaluated in resulted in the inclusion of 399 patients. The primary December 2007. The following 6 clinical predictors gross tumor volume (GTVprimary ) and nodal gross tuare used to build the prediction models: gender (two mor volume (GTVnodal ) were calculated, as delineated groups: male/female), WHO performance status (three by the treating radiation oncologist, using a commercial groups: 0/1/ ≥ 2), lung function prior to treatment radiotherapy treatment planning system (Computerized (forced expiratory volume, in the range of 17 ∼ 139), Medical Systems, Inc, CMS). The sum of GTVprimary number of positive lymph node stations (five groups: and GTVnodal resulted in the GTV. For patients treated 0/1/2/3/ ≥ 4), natural logarithm of GTV (in the range with sequential chemotherapy these volumes were calof −0.17 ∼ 6.94), and the equivalent dose corrected by culated using the post-chemotherapy imaging informatime (EQD2,T) from (5.8). The mean values across tion. The creation of the volumes was based on PET patients are used to impute the missing entries if some and CT information only; bronchoscopic findings were of these predictors are missing for certain patients. To not taken into account. The number of positive lymph account for the very different number of patients from node stations was assessed by the nuclear medicine spethe three sites, a subset of MAASTRO patients were cialist using either an integrated FDG-PET-CT scan or selected for the following study. In the following we a CT-scan combined with FDG-PET-scan. T-stage and use the names “MAASTRO”, “Gent” and “Leuven” to N-stage were assessed using pre-treatment CT, PET and denote the data from the three different centers. mediastinoscopy when applicable. For patients treated For the SVM methods, since they can only deal with sequential chemotherapy stage as well as number with binary outcome, we only use the patients with 2of positive lymph node stations was assessed using preyear follow-up and create an outcome for them with +1 chemotherapy imaging information. meaning they survived 2 years, and −1 meaning they Additionally, a smaller number of patients treated didn’t survive 2 years. This setting leads to 70, 37 and 44 Performance Comparison for PP−SVM 0.74 0.73 0.73 0.72 0.72 AUC with errorbars AUC with errorbars Performance Comparison for PP−SVM 0.74 0.71 0.7 0.69 0.71 0.7 0.69 0.68 0.68 0.67 0.67 0.66 PP−SVM MAASTRO 0.66 Leuven Gent PP−SVM Gent MAASTRO Leuven Figure 2: AUC comparison for privacy preserving SVMs with 40% (left) and 60% (right) training patients. The error bars are calculated based on 100 times of random splits of the data. PP−SVM versus non−PP−SVM 0.9 PP−SVM non−PP−SVM AUC 0.8 0.75 0.7 0.65 0.8 0.7 0.75 0.7 0.65 0.65 0.6 0.6 0.6 0.55 PP−SVM non−PP−SVM 0.75 AUC 0.85 0.8 0.85 PP−SVM Performance in AUC 0.9 PP−SVM with Different Mapping Dimensions PP−SVM versus non−PP−SVM 0.95 10% 20% 30% 40% 50% 60% 70% 80% 90% Percentage of Training Patients 0.55 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Non−PP−SVM Performance in AUC 0.55 0 1 4 3 2 Mappping Dimensions 5 6 Figure 3: AUC comparison between PP-SVMs and non PP-SVMs (which explicitly use all the training data from different centers, and thus upper-bound the predictive performance of PP-SVMs). We compare the two with different percentages of training patients (left), in a scatter plot (middle), and with different dimensions m̄ for PP-SVMs (right) for a 40% split. 23 patients for the MAASTRO, Gent and Leuven sets, respectively. For the Cox regression methods, we can potentially use all the patients with the exact number of survived years, and do right censoring for those patients who are still alive. Under this setting we end up with 80, 85 and 40 patients for MAASTRO, Gent and Leuven, respectively. Under the privacy preserving setting, we are interested in assessing the predictive performance of a model combining the patient data from the three centers together, compared to the models trained based on each of these centers. The data combination needs to be done in a way that sensitive information is not uncovered. Therefore for our experiments we trained the following 4 models under each configuration: models using only the MAASTRO, Gent and Leuven training patients repectively. For each of the configurations, we vary the percentage of training patients in each of the centers, and report the Area Under the ROC Curve (AUC) for the test patients. Note that the testing was performed using all the test patients from all centers. 6 Results In Figure 2 we show the results for privacy preserving SVM models, with 2 example training percentages (40% and 60%). The other percentages yield similar results. The error bars are over 100 runs with random split of training/test patients for each center, and each time a • PP model: Apply the privacy preserving tech- random B matrix of dimensionality 5 × 6 is used for the niques we have introduced and train a model using PP-SVM models. As can be seen, the PP-SVM models achieve the best performance compared to other singlecombined data from the three centers. center based models. This is mainly because PP-SVM • MAASTRO, Gent and Leuven models: Train models are able to use more data in model training, at 45 Performance Comparison for PP−CoxReg Performance Comparison for PP−CoxReg 0.66 0.65 0.63 0.64 AUC with errorbars AUC with errorbars 0.62 0.61 0.6 0.59 0.63 0.62 0.61 0.6 0.59 0.58 0.58 0.57 0.57 PP−CoxReg MAASTRO Leuven Gent PP−CoxReg Gent MAASTRO Leuven Figure 4: AUC comparison for privacy preserving Cox regression models with 40% (left) and 60% (right) training patients. The error bars are calculated based on 100 times of random splits of the data. PP−CoxReg versus non−PP−CoxReg PP−CoxReg versus non−PP−CoxReg 0.8 PP−CoxReg with Different Mapping Dimensions 0.7 0.8 PP−CoxReg non−PP−CoxReg 0.65 0.6 0.55 PP−CoxReg non−PP−CoxReg 0.75 0.66 0.64 0.7 AUC 0.7 AUC 0.68 PP−CoxReg Performance in AUC 0.75 0.65 0.62 0.6 0.58 0.6 0.56 0.55 0.54 0.5 10% 20% 30% 40% 50% 60% 70% 80% 90% Percentage of Training Patients 0.5 0.5 0.75 0.7 0.65 0.6 0.55 Non−PP−CoxReg Performance in AUC 0.8 0.52 0 1 4 3 2 Mappping Dimensions 5 6 Figure 5: AUC comparison between PP-CoxReg and non PP-CoxReg (which explicitly use all the training data from different centers, and thus upper-bound the predictive performance of PP-CoxReg). We compare the two with different percentages of training patients (left), in a scatter plot (middle), and with different dimensions m̄ for PP-CoxReg (right) in a 40% split. the same time without violating the privacy regulations. When we increase the training percentages, all models will improve (compare Figure 2 right to left), and the single-center based models have a higher improvement. However the PP-SVM models still perform the best. It is easy to realize that PP-SVM will end up with a performance loss compared to a non PP-SVM model, which explicitly combines all the training patients from different centers and does not preserve privacy. This is because in PP-SVMs a random matrix B projects each patient into a lower dimensional space (for privacy preserving purpose), and thus leads to information loss. To empirically evaluate how much performance loss the PP-SVMs have, we show a more extensive comparison in Figure 3. On the left we show the comparison with different percentages of the training/test splits, and as can be seen the gaps between PP-SVMs and non PPSVMs are not very big. This indicates PP-SVMs can achieve similar predictive performance while satisfying 46 the privacy preserving requirement. The scatter plot in the middle is another way to visualize these results. On the right we vary the mapping dimensions m̄ for the B matrix we used in PP models, and as expected, bigger m̄ yield better predictive performance. Therefore, in practice we normally choose m̄ = n − 1 to maximize the performance of the PP models (which still perfectly satisfies the privacy preserving requirements). From this comparison we see that there is a big error bar for different B matrices, and one interesting future work is to identify the best B matrix for PP models. In Figure 4 we also empirically evaluate the results for privacy preserving Cox regression models, also with the 2 example training percentages (40% and 60%). They have the same trend as we have seen in Figure 2, but it is interesting that with a higher percentage of training data (e.g., 60% on the right), PPCoxReg performs the same as the model trained using only MAASTRO training patients. This indicates PP- CoxReg model is more sensitive to the different characteristics of the data from different centers. In practice, we need to carefully investigate the different data distributions to estimate the benefits of combining them. We also empirically compare the PP Cox regression models with non PP-CoxReg models in Figure 5. As can be seen, the gaps between PP-CoxReg and non PP-CoxReg models are even smaller than those between PP-SVM and non PP-SVM models, meaning PPCoxReg models are more accurate toward the non privacy preserving solutions. In practice we still need to choose m̄ = n − 1 to maximize the PP-CoxReg performance, and to choose the best B matrix if possible. 7 Discussion and Conclusions We have applied a simple recently proposed PP technique in a real clinical setting where data is shared across three European institutions in order to build more accurate predictive models than the ones obtained using only data from one institute. We have extended the previously proposed PP algorithm (originally suggested for SVM) to cox regression. As far as we know this is the first work that addresses privacy preserving concerns for survival models. The work presented here is based on preliminary results and we are already working on designing improved algorithms to address several concerns that arise when performing our experiments. One of the concerns that arise (as shown in section 6) is how to address the impact of the variability of the matrix B on the performance of the predictive models. For that, we are currently experimenting with formulations in which the B matrix is intended not only to “de-identify” the data but also to optimally improve model performance. Another relevant concern that we are looking into is, how to weight the importance of data from different institutions, assuming that the reliability of the data or the labels varies among institutions. References [1] Dehing-Oberije C, De Ruysscher D, van der Weide H, and et al. Tumor volume combined with number of positive lymph node stations is a more important prognostic factor than tnm stage for survival of non-small-cell lung cancer patients treated with (chemo)radiotherapy. Int J Radiat Oncol Biol Phys, (in press). [2] K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. In Proceedings of the Fifth International Conference of Data Mining (ICDM’05), pages 589–592. IEEE, 2005. [3] D. R. Cox. Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Series B 34:187–220, 1972. 47 [4] Wenliang Du, Yunghsiang Han, and Shigang Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification. In Proceedings of the Fourth SIAM International Conference on Data Mining, pages 222–233, 2004. http://citeseer.ist.psu.edu/du04privacypreserving.html. [5] G. Fung and O. L. Mangasarian. Finite Newton method for Lagrangian support vector machine classification. Special Issue on Support Vector Machines. Neurocomputing, 55:39–55, 2003. [6] Gang Kou, Yi Peng, Yong Shi, and Zhengxin Chen. Privacy-preserving data mining of medical data using data separation-based techniques. Data Science Journal, 6:429–434, 2007. [7] S. Laur, H. Lipmaa, and T. Mielikäinen. Cryptographically private support vector machines. In L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, editors, Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 618–624, 2006. [8] Y.-J. Lee and S.Y. Huang. Reduced support vector machines: A statistical theory. IEEE Transactions on Neural Networks, 18:1–13, 2007. [9] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, 2001. [10] L. Liu, J. Wang, Z. Lin, and J. Zhang. Waveletbased data distortion for privacy-preserving collaborative analysis. Technical Report 48207, Department of Computer Science, University of Kentucky, Lexington, KY 40506, 2007. http://www.cs.uky.edu/ jzhang/pub/MINING/lianliu1.pdf. [11] O. L. Mangasarian and E. Wild. Privacy-preserving classification of horizontally partitioned data via random kernels. Technical Report 07-03, Computer sciences department, university of Wisconsin - Madison, Madison, WI, 2007. [12] G. Schadow, S. J. Grannis, and C. J. McDonald. Privacy-preserving distributed queries for a clinical case research network. pages 55–65, 2002. [13] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, second edition, 2000. [14] V. Verykios, E. Bertino, I. Fovino, L. Provenza, Y. Saygin, and Y. Theodoridis. State-of-the-art in privacy preserving data mining. SIGMOD, 33:50–57, 2004. [15] Ming-Jun Xiao, Liu-Sheng Huang, Yong-Long Luo, and Hong Shen. Privacy preserving id3 algorithm over horizontally partitioned data. In PDCAT ’05: Proceedings of the Sixth International Conference on Parallel and Distributed Computing Applications and Technologies, pages 239–243, Washington, DC, USA, 2005. IEEE Computer Society. [16] Hwanjo Yu, Xiaoqian Jiang, and Jaideep Vaidya. Privacy-preserving svm using nonlinear kernels on horizontally partitioned data. In SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing, pages 603–610, New York, NY, USA, 2006. ACM.