Academia.eduAcademia.edu

Obfuscation of sensitive data in network flows

2012

In the last decade, the release of network flows has gained significant popularity among researchers and networking communities. Indeed, network flows are a fundamental tool for modeling the network behavior, identifying security attacks, and validating research results. Unfortunately, due to the sensitive nature of network flows, security and privacy concerns discourage the publication of such datasets. On the one hand, existing techniques proposed to sanitize network flows do not provide any formal guarantees. On the other hand, microdata anonymization techniques are not directly applicable to network flows. In this paper, we propose a novel obfuscation technique for network flows that provides formal guarantees under realistic assumptions about the adversary's knowledge. Our work is supported by extensive experiments with a large set of real network flows collected at an important Italian Tier II Autonomous System, hosting sensitive government and corporate sites. Experimental results show that our obfuscation technique preserves the utility of network flows for network traffic analysis.

Obfuscation of Sensitive Data in Network Flows Daniele Riboni∗ , Antonio Villani∗∗ , Domenico Vitali∗∗ , Claudio Bettini∗ and Luigi V. Mancini∗∗ Dipartimento di Informatica e Comunicazione, Università degli Studi di Milano. Via Comelico 39, Milan 20135, Italy. E-mail: {daniele.riboni, claudio.bettini}@unimi.it ∗∗ Dipartimento di Informatica, Università di Roma “La Sapienza” Via Salaria 113, Rome 00198, Italy. E-mail: {villani, vitali, lv.mancini}@di.uniroma1.it ∗ Abstract—In the last decade, the release of network flows has gained significant popularity among researchers and networking communities. Indeed, network flows are a fundamental tool for modeling the network behavior, identifying security attacks, and validating research results. Unfortunately, due to the sensitive nature of network flows, security and privacy concerns discourage the publication of such datasets. On the one hand, existing techniques proposed to sanitize network flows do not provide any formal guarantees. On the other hand, microdata anonymization techniques are not directly applicable to network flows. In this paper, we propose a novel obfuscation technique for network flows that provides formal guarantees under realistic assumptions about the adversary’s knowledge. Our work is supported by extensive experiments with a large set of real network flows collected at an important Italian Tier II Autonomous System, hosting sensitive government and corporate sites. Experimental results show that our obfuscation technique preserves the utility of network flows for network traffic analysis. I. I NTRODUCTION Recently, there has been a growing interest in releasing large dataset of network flows of Internet traffic. Indeed, such flows are a very valuable resource for researchers; for instance, to model the network behavior, to experiment new protocols, and to study security attacks. However, the release of network flows of Internet traffic poses serious concerns to the privacy and the security of the users of computer networks involved. For example, based on the analysis of network flows about the Web sites visited by a given individual, an adversary may infer sensitive data, such as political preferences, religious belief, health status, and so on. Network flows may also reveal personal communications among specific individuals, such as the existence of email exchanges, and chat sessions among them. Moreover, network flows can be exploited by adversaries to gather useful information in planning network attacks; for instance, for identifying possible bottlenecks in the target network in order to increase the impact of a Denial of Service attack. As a consequence, various research efforts have been carried on to protect privacy while preserving the practical applicative interest in using the released network flows. Early techniques were based on the substitution of the real IP addresses with pseudo-IDs (for instance, in Crypto-PAn [1]). However, it has been shown that this technique is insufficient, since an adversary may reconstruct the real IPs based on the values of other fields of the flows [2], exploiting his knowledge of the characteristics of network hosts (fingerprinting attacks), or by injecting peculiar flows in the monitored network (injection attacks). For this reason, more sophisticated techniques have been proposed, based on the perturbation of other data in the flows (e.g., [3], [4], [5], [6]). However, the techniques proposed till the time of writing do not provide any formal guarantee of protection, and it has been recently shown that they are prone to different kinds of attacks [7]. On the other hand, as we explain in Section II-B, well-known techniques proposed for microdata anonymization are not directly applicable to network flows. In this paper, we tackle the challenging research issue of sanitizing network traces while preserving the data utility and providing formal guarantees of confidentiality protection. The main contributions of this work are the following: • we formally model the problem of network flow obfuscation; • we propose a novel defense technique, named (k, j)obfuscation, and we formally prove that it guarantees protection of data confidentiality under realistic assumptions; • we present algorithms to enforce (k, j)-obfuscation, and we experimentally evaluate our technique with a very large dataset of real network flows; results show that our obfuscation technique preserves the utility of data. The dataset used for our experiments was collected from an Italian transit tier II Autonomous System (AS). This network is connected to the three main network infrastructures present in Italy (Commercial, Research and Public Administration networks), and to several international providers. The rest of the paper is structured as follows. Section II discusses related work. In Section III, we formally model network flow obfuscation and the adversary knowledge. In Section IV, we introduce our defense and formally prove its guarantees of confidentiality protection. In Section V, we present the algorithm to enforce our defense. Section VI illustrates the experimental evaluation. Section VII concludes the paper. II. R ELATED WORK Early techniques for network flow obfuscation were based on the encryption of source and destination IP addresses. However, those techniques proved to be ineffective, since an adversary might be able to re-identify message source and destination by other values in a network flow, or in a sequence of flows (see, e.g., [3], [8], [9], [10]). King et al. in [2] propose an extensive taxonomy of attacks against network flow sanitization methods; techniques fall into two main categories: Fingerprinting: re-identification is performed by matching flows fields’ values to the characteristics of the target environment (such as knowledge of network topology and its settings, types of OS and services of target hosts, etc). Typical re-identifying values for network flows are: Type of Service (tos), TCP Flags, number of bytes, and number of packet per flow. • Injection: the adversary injects a sequence of flows in the network to be logged, that are easily recognized due to their specific characteristics; e.g., marked with uncommon TCP flags, or following particular patterns. Additional techniques can be used to exploit the results of the above attacks to decrypt IP addresses of new network flows. In particular, if the IP address encryption is performed with the same key across the whole set of flows (as in most existing defense techniques), and the adversary discovers an IP mapping in one flow, he can decrypt the same IP address in any other flow. • A. Defenses tailored to network flows Several effort have been devoted to the implementation of frameworks (e.g., [5]) or configurable tools (e.g., [6]) through which the network administrator can define ad-hoc and perfield obfuscation policies. Roughly speaking, defenses that can be found in the literature (e.g., [3], [4] among many others) take a “reactionary” approach: typically, in those works, a new kind of attack is identified, and a defense technique is proposed for that attack, which is generally based on the permutation/generalization of some fields’ values. However, proposed techniques do not provide a general solution. Indeed, as theoretically proved by Brekne and Årnes in [9], and empirically shown by Burkhart et al. in [7], those techniques can be easily defeated by the injection of flows following complex patterns over sufficiently long periods of time. As a case study, we consider the well-known Crypto-PAn [1] technique, which is currently incorporated within several network flow collector tools. Crypto-PAn is a sanitization tool for network flows that encrypts IP addresses in a prefix-preserving manner. A malicious user, which acts inside the monitored network, can inject bogus and easily detectable flows in order to understand how one IP address is mapped to its encrypted value inside the obfuscated flow set. Thanks to the fact that each IP address’s octet (8 bit length integer value) is always mapped to the same encrypted value, an adversary can obtain the encrypted version of each one of the 255 possible octets values injecting a small number of bogus flows. The defense technique we propose adopts cryptographic primitives to hide real IP addresses, and obfuscation of flow fields’ values. However, differently from previous works, it provides strong confidentiality protection even when the adversary can reconstruct the mapping between an IP address and its encrypted value, possibly as a result of injection attacks. B. Microdata anonymization techniques Techniques proposed in the database area for microdata anonymization have the advantage of providing formal privacy guarantees, under specific assumptions. Hence, it is natural to investigate the application of these techniques to network flows. At first glance, network flow logs seem very similar in nature to any other recordset stored in a relational database (census data, medical records, etc.). However, as we explain below, those techniques are unfeasible to network flows, due to the peculiar characteristics of these data. The simplest microdata anonymity principle is kanonymity [11], which consists in making any record indistinguishable in a group of at least k records based on Quasi Identifier (QI) values; i.e., values that, joined with external information, may reduce the candidate set of record respondents. Any group of records having the same values for QI attributes is called a QI-group. The main criticism found in the literature about the application of this principle to network flow logs (see, e.g., [12]) regards loss of information: indeed, since many fields of network flows may act as QI, data quality would be degraded to an unacceptable extent. The same argument holds for more sophisticated privacy principles that guarantee not only anonymity but also sensitive value diversity, such as l-diversity [13] and t-closeness [14]. However, we observe that the above mentioned principles are not even applicable to the anonymization of network flows. Indeed, if the private value of each individual does not change in released microdata (this is the case of network flow logs, if IP address encryption is consistent across the whole set of flows), works in [11], [13], [14] are effective only under the assumption that each individual is the respondent of at most one record in the released microdata. Indeed, if the adversary knows that the same individual (in our case, an IP address I) is the respondent of one tuple in more than one QI-group, an adversary may be able to derive the confidential information (the encryption of I) by simply intersecting the private values of tuples in those QI-groups. Since the same IP address typically appears in multiple network flows, an appropriate privacy principle to be considered is m-invariance [15], which has been proposed to enforce both anonymity and diversity for incremental release of microdata. This principle ensures that i) all the QI-groups in which an individual’s records appear have the same set of private values, and ii) each QI-group does not contain records having the same private value. However, the application of m-invariance to network flow logs is unfeasible, since the cardinality of IP addresses is very large; this would result in a very coarse generalization of QI-values, and in the introduction of a large number of counterfeit flows to enforce property i). In order to overcome the above problems, in our technique we adopt a many-to-one mapping among IP addresses and encrypted values; this mapping is consistent across the whole set of network flows. III. P ROBLEM DEFINITION AND ADVERSARY MODEL In general, the fact that two specific hosts A and B exchanged some message may be considered confidential information. Hence, since an IP address uniquely identifies its host, we assume that confidential information in a network flow is the set of attributes {src addr, dst addr}. Indeed, if we could remove those fields from network logs, no confidentiality violation could reasonably be perpetrated. Unfortunately, removal of IP addresses from logs would completely disrupt the utility of the data. When joined with external information, fields in the network flow other than IP addresses may restrict the candidate set of source and/or destination hosts. For instance, as shown in [8], based on network flow data such as packet and Byte counts, it is possible to identify the Web server originating the request. We state that a network flow is obfuscated if it cannot be associated with high confidence to its source and destination IP addresses. A. Network flow obfuscation We denote by L an original set of network flows, and by L∗ the obfuscated version of L released by the data publisher. The fields of the flows include a confidential multi-value attribute Ap = {src addr,dst addr}, and a set of other fields Ai = {A1 , A2 , . . . , Am } that may be used to infer Ap . In particular, as explained in Section II, some flow fields may be exploited to identify Ap based on the hosts’ characteristics. In order to characterize those fields, we introduce the notion of fingerprint Quasi Identifier (fp-QI). Definition 1 (Fingerprint Quasi Identifier (fp-QI)): A field of a network flow is denoted as a fingerprint Quasi Identifier (fp-QI) if its value, possibly combined with external knowledge about the characteristics of the network hosts, can reduce the cardinality of the candidate set for source or destination IP addresses of the flow in L∗ . Clearly, which flow fields act as fp-QI strongly depends on the external knowledge available to the adversary. In order to state that two flows are indistinguishable based on the network hosts’ fingerprint, we introduce the following notion. Definition 2 (Fingerprint indistinguishability): Two network flows are fp-indistinguishable if their fp-QI values are identical. Given a flow f , and fields A, f [A] is the projection of f onto A; for instance, f [src addr,source port] is the pair hsource IP address, source porti of f . Flows are obfuscated by a defense function D() before being released. B. Adversary model At each release of a set L∗ , the goal of an adversary is to reconstruct, with a certain degree of confidence, the source and destination IP addresses of flows in L∗ . The considered adversary model is based on the following assumptions: 1) The adversary may observe L∗ . 2) The obfuscation function D() is publicly known. 3) The adversary may have external information about the characteristics of the target environment, including the fingerprint of network hosts. For example, the adversary may know the topology of the network to be logged, and the set of services offered by its hosts. This knowledge determines which fields act as fp-QI. 4) The adversary may know in advance where and when the flows will be collected, and may inject flows into the network. Note that we assume a powerful adversary, that may perform both fingerprinting and injection attacks. However, we claim that those assumptions are reasonable. Indeed, some information about the logged network and hosts can be acquired after the release of flows; for instance, by scanning the network to locate services. Moreover, in some cases (e.g., if logs are periodically collected and released from a given network), an adversary may also know in advance the network location and the schedules of flows collection, and send bogus messages to target hosts in the network. IV. (k, j)- OBFUSCATION DEFENSE In this section we present our defense technique, and we illustrate the achieved confidentiality guarantees. A. Defense strategy As anticipated in Section II, techniques based on a oneto-one mapping of each IP address in an encrypted value that is consistent across the whole set of flows are ineffective when an adversary gets to know the mapping among some IP addresses and their encrypted value. On the other hand, mapping the same IP address to different encrypted values in different flows would effectively counteract those attacks, but would result in an excessive loss of information; i.e., it would be tantamount to suppress the IP address fields from released flows. Hence, we propose a novel technique, named (k, j)obfuscation, that enforces a many-to-one mapping among IP addresses and pseudo-random group-ID values, which are substituted in obfuscated flows to the real IP addresses. With this solution, each IP address in the released flows is blurred in a set of at least k possible IP addresses. Note that, with the above solution, an adversary may still be able to identify the real IP address in the group of possible addresses based on the fingerprint of its host. For this reason, our technique includes a defense against fingerprinting attacks. At first, IP addresses are grouped based on the fingerprint of their corresponding host: IP addresses whose hosts have similar fingerprint are grouped together. Then, the fp-QI values of flows are obfuscated, such that, for each obfuscated flow f ∗ whose source IP s belongs to an IP-group α, there exist other j ≤ k obfuscated flows, whose source IP belongs to α but is different from s, and that are fp-indistinguishable from f ∗ . This way, even if the adversary knows the hosts’ fingerprint, as well as the mapping among IP addresses and group-IDs, he cannot associate each flow to less than j different IP addresses. B. Formal definition of (k, j)-obfuscation In the following, we define the properties that a (k, j)obfuscation function must guarantee. Definition 3 ((k, j)-obfuscation function): We denote as D : L× N× N → L∗ a partial function that transforms a set of network flows by substituting each IP address with a group-ID, and by possibly obfuscating the values of the other fields of the flows. Each IP address is mapped to its IP-group by a function group-ID; this mapping is consistent across the whole set of flows. We denote as f ∗ ∈ L∗ the transformation of f ∈ L obtained by the application of function D. We state that D is a (k, j)-obfuscation function if, for each set L of network flows, L∗ = D(L, k, j) satisfies the following properties: • p1: Each IP-group contains at least k different IP addresses. Formally, for each group-ID g appearing in a flow f ∗ ∈ L∗ , there exists a set A of at least k IP addresses appearing in a flow in L such that, for each a ∈ A, group-ID(a) = g. ∗ • p2: Each flow f is fp-indistinguishable in a set of at least j flows in L∗ originated by distinct IP addresses belonging to the same IP-group. D(L, k, j) is undefined if the above properties cannot be satisfied; i.e., if L involves less that k different IP addresses (it is impossible to enforce p1), or if L contains less than j flows (it is impossible to enforce p2). Table I reports a summary of the notation used in the paper. TABLE I S UMMARY OF NOTATION USED IN THE PAPER fp-QI src addr, dst addr f [A] L (resp. L∗ ) D(L, k, j) k j τ fingerprint Quasi Identifier (Definition 1) fields for source and destination IP projection of flow f onto field A original (resp. obfuscated) set of network flows (k, j)-obfuscation function (Definition 3) minimum number of IP addresses in a group minimum number of fp-indistinguishable flows time granularity used in Algorithm 3 C. Confidentiality guarantees In the following, we present the confidentiality guarantees enforced by our technique, based on different assumptions about the external knowledge available to an adversary. 1) Defense against knowledge of the IP mapping function: As explained in Section II, if an adversary discovers a mapping between the real and obfuscated IP address in one flow, and the IP address encryption is consistent across the whole set of flows, he can decrypt the same IP in any other flow in which it appears: such mappings can be easily discovered through injection. As demonstrated by the following theorem, (k, j)obfuscation counteracts this attack, by ensuring that no less than k different IP addresses are mapped to the same IP-group. Theorem 1: Consider a (k, j)-obfuscation function D, a set L of original network flows, its obfuscated version L∗ = D(L, k, j), and an obfuscated flow f ∗ ∈ L∗ . Suppose that the (obfuscated) source and destination IP addresses of f ∗ are α and β, respectively. Suppose also that an adversary got to know the function that maps original IP addresses to their group-IDs. Then, based on the knowledge of that function, he can associate the source and destination IP addresses of f to no less than k(k − 1) different pairs of possible addresses. 2) Defense against fingerprinting attacks: If an adversary knows the fingerprint of network hosts, he can decrease his uncertainty about the source IP of a flow. The following theorem demonstrates that the (k, j)-obfuscation technique protects against fingerprinting attacks. Theorem 2: Consider a (k, j)-obfuscation function D, a set L of original network flows, its obfuscated version L∗ = D(L, k, j), and an obfuscated flow f ∗ ∈ L∗ . Suppose that an adversary has accurate information about the hosts’ fingerprint. Then, based on this knowledge, he can associate the source IP address of f to no less than j possible addresses. 3) Defense against combined attacks: A more powerful threat to consider is when an adversary knows both the IP mapping function, and the hosts’ fingerprint. The following theorem demonstrates that (k, j)-obfuscation provides strong protection even under this assumption. Theorem 3: Consider a (k, j)-obfuscation function D, a set L of original network flows, its obfuscated version L∗ = D(L, k, j), and an obfuscated flow f ∗ ∈ L∗ . Suppose that the (obfuscated) source and destination IP addresses of f ∗ are α and β, respectively. Suppose also that an adversary got to know the function that maps original IP addresses to their group-IDs, and has accurate information about the fingerprint of network hosts. Then, based on this information, he can associate each pair hsrc addr, dst addri to no less than j(k − 1) different pairs of possible addresses. 4) Defense against linking attacks: In some cases, an adversary may be able to understand that a flow g ∗ is the response to a flow f ∗ . Different inferences may be used to link request and responses; for instance, by observing that a flow from α to β is immediately followed by a flow from β to α. The following theorem demonstrates that (k, j)-obfuscation is effective even against this kind of inferences. Theorem 4: Consider a (k, j)-obfuscation function D, a set L of original network flows, its obfuscated version L∗ = D(L, k, j), and two obfuscated flows f ∗ ∈ L∗ and g ∗ ∈ L∗ . Suppose that the adversary got to know that g ∗ is the response to f ∗ . Suppose also that he knows the function that maps original IP addresses to their group-IDs, and has accurate information about the fingerprint of network hosts. Then, based on this information, he can associate the source/destination IP addresses of f and g to no less than j(j − 1) different pairs of possible addresses. V. E NFORCING (k, j)- OBFUSCATION In this section, we present our technique and algorithms to enforce (k, j)-obfuscation of network flows. A. Overall obfuscation algorithm Finding the optimal transformation of flows that satisfies (k, j)-obfuscation (i.e., the one that minimizes the generalization of fp-QI values, and the suppression of flows) is an NP-hard problem; indeed, it is well-known that even the basic problem of optimal k-anonymous generalization is NPhard [16]. For this reason, we devised an approximate algorithm; its pseudocode is shown in Algorithm 1. At first (line 2), IP-groups are created by executing Algorithm 2 (Section V-B). Then (line 3), the real IPs in network flows are substituted by the identifier of the IP-group they belong to. After initializing Input: L: original set of network flows; fp-QI: set of fingerprint Quasi Identifiers; k: minimum group size; j: minimum number of fp-indistinguishable flows; τ : time granularity for enforcing fp-indistinguishability. Output: L∗ : set of obfuscated network flows. 1 2 3 4 5 6 7 8 9 10 11 Obfuscate(L, fp-QI, k, j, τ ) begin IP-groups G; IP-group identifiers GID := GroupCreation(L, fp-QI, k) L := SubstituteIPs(L, G, GID) L∗ := ∅ foreach IP-group Gα ∈ G do Lα := GetFlows(L, Gα ) L∗α := Bucketize(Lα , fp-QI, j, τ ) L∗ := L∗ ∪ L∗α end return L∗ end Algorithm 1: Network flow obfuscation algorithm Input: L: original set of network flows; fp-QI: set of fingerprint Quasi Identifiers; k: minimum group size. Output: IP-groups: G1 , . . . , Gj ; IP-groups identifiers: GID1 , . . . , GIDj . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 GroupCreation(L, fp-QI, k) begin set of IP addresses A := {f [src addr], f ∈ L} ∪ {f [dst addr], f ∈ L} if |A| < k then return null foreach IP address a ∈ A do foreach feature ∈ fp-QI do feature value vf := GetFeatureValue(L, a, feature) → → fingerprint feature vector − a := AddFeature(− a , vf ) end → Hilbert index aH := ComputeHilbertIndex(− a) end → − sorted list of IP addresses A := SortOnHilbertIndex(A) group index j := 0 for i := 1 to |A| do if (i % k) = 1 then if (i + k) < |A| then j := j + 1 IP-group Gj := ∅ IP-group identifier GIDj := CSPRNG() end end → − Gj := Gj ∪ A i end return G1 , . . . , Gj ; GID1 , . . . , GIDj end Algorithm 2: Fingerprint-based group creation the set of obfuscated flows L∗ (line 4), for each IP-group, we take the flows generated by the hosts of its IPs, we enforce fpindistinguishability by executing Algorithm 3 (Section V-C), and we add the obfuscated flows to L∗ (lines 5 to 9). Finally (line 10), we return the set of obfuscated flows. B. Fingerprint-based IP-groups creation. The goal of our fingerprint-based IP-groups creation method is to enforce property p1 of (k, j)-obfuscation while preserving the quality of obfuscated data. In order to reach this goal, IP-groups are created by grouping together IPs whose hosts have a similar fingerprint (i.e., they originate similar flows), so that fp-indistinguishability can be more easily enforced. The algorithm to group IPs takes as input the original set L of network flows, the set fp-QI of fingerprint Quasi Identifiers, and the minimum group size k. It returns the IP-groups and their identifiers. The pseudocode of the algorithm is shown in Algorithm 2; its main operations are the following: 1) If less than k IPs appear as source of a network flow in the original set L, it is impossible to create an IP-group of size greater than or equal to k. Hence, in this case, (k, j)-obfuscation cannot be enforced, and the algorithm terminates (line 3 in Algorithm 2). 2) Otherwise, for each IP, we build a fingerprint vector (lines 5 to 8), in which each dimension corresponds to a statistics (mean, standard deviation, . . . ) about the values of an fp-QI field of its flows. This vector represents the fingerprint of the host having that source IP. 3) We map each fingerprint vector in an integer value by exploiting the Hilbert space-filling curves [17] (line 9). A Hilbert space-filling curve is a function that maps a point in a multi-dimensional space into an integer. With this technique, two points that are close in the multidimensional space are also close, with high probability, in the one-dimensional space obtained by the Hilbert transformation. In our case, IPs whose hosts have a similar fingerprint are associated to close Hilbert indices. 4) We sort IPs based on their Hilbert index (line 11), and we create groups by partitioning IPs in groups of size k based on that order (lines 12 to 22); if the last group contains less than k IPs, it is merged with the previous one. Each group is identified by a value calculated by a cryptographically secure pseudorandom number generator (CSPRNG) function (line 18). Finally (line 23), we return the IP-groups G1 , . . . , Gj , as well as their identifiers GID1 , . . . , GIDj . C. Enforcing fp-indistinguishability. We enforce fp-indistinguishability by bucketizing the values of fp-QI fields. Bucketization consists in substituting the real value of a field with a multiset of possible values. For instance, to make fp-indistinguishable a set of three flows whose number of bytes are 250, 400, and 250, respectively, we substitute the real values in these flows with {250, 250, 400}. The bucketization algorithm considers one group of IPs at a time. It takes as input a set Lα of original flows (whose source IPs belong to the same group α), the set of fp-QI fields, the minimum number j of fp-indistinguishable flows, and a time granule τ (for instance, one minute). The algorithm considers flows in slots of one time granule at a time, in order to reduce computational and memory costs. This solution is needed when very large sets of flows are considered, as we do in our experimental evaluation. The algorithm returns the set of obfuscated flows L∗α . Its pseudocode is shown in Algorithm 3; the main steps are the following: 1) At first, we initialize two variables tstart and tend with Input: Lα : original set of network flows whose source IP belongs to IP-group α; fp-QI: set of fp-QI fields; j: minimum number of fp-indistinguishable flows; τ : time granularity. Output: L∗α : obfuscated flows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Bucketize(Lα , fp-QI, j, τ ) begin tstart := lowest timestamp of flows in Lα tend := highest timestamp of flows in Lα repeat foreach flow f ∈ Lα s.t. f [timestamp] ∈ [tstart , tstart + τ ) do → − fp-QI feature vector f = f [fp-QI] → − Hilbert index fH := ComputeHilbertIndex( f ) → − sorted list of flows F := SortOnHilbertIndex(Lα,τ ) end group index i := 1 group of IPs Gi := ∅ number of distinct IPs in Gi n := 0 for c := 1 to |Lα,τ | do − → if (∄f ∈ Gi s.t. f [src addr] = Fc [src addr]) then n := n + 1 end − → Gi := Gi ∪ {Fc } if (n = j) then i := i + 1; Gi := ∅; n := 0 end if (0 < n < j) then if i > 1 then Gi−1 := Gi−1 ∪ Gi ; Gi := ∅ else SuppressFlows(Gi ) end L∗α,τ := ∅ foreach group of flows G do G∗ := BucketizeFp-QI-values(G) L∗α,τ := L∗α,τ ∪ G∗ end tstart := tstart + τ until tstart > tend return L∗α,τ end Algorithm 3: Bucketization of fp-QI fields the lowest and highest timestamp of flows in L∗α , respectively (lines 2 and 3 in Algorithm 3). 2) For each flow f in the original set having source IP belonging to group α and timestamp in the interval → − [tstart , tstart + τ ), we build a fp-QI feature vector f , in which each dimension corresponds to the value of an fpQI field of the flow (line 6). We map each fp-QI feature vector in an integer value fH by exploiting the Hilbert space-filling curves (line 7), and we sort flows based on their Hilbert index (line 8). 3) We partition flows in groups based on their Hilbert order, ensuring that each group contains at least j flows having distinct source IPs (lines 10 to 19). If the diversity of source IPs in the last group is less than j, we merge it with the second-last group (line 21). If diversity cannot be enforced, these flows are suppressed (line 22). 4) For each group, we substitute the fp-QI values of its flows with buckets including the fp-QI values of each flow (lines 25 to 28). We repeat the above steps for the subsequent time intervals, until tend is reached. Finally, we return the set L∗α of obfuscated flows (line 31). D. Correctness and computational complexity Theorem 5: Algorithm 1 correctly computes a (k, j)obfuscation function. Due to the typically large dimension of network flow datasets, the defense algorithm needs to have low computational complexity. The most computationally expensive task of our algorithm is sorting, which is performed both for grouping IP addresses, and for bucketizing fp-QI fields (complexity in the average case is O(n log n), where n is the maximum between the number of IP addresses appearing in the flows and the number of flows to be made fp-indistinguishable). The calculation of the Hilbert index is an O(1) operation; it is executed (n + m) times, where n is the number of flows, and m is the number of distinct IP addresses appearing in flows. VI. E XPERIMENTAL EVALUATION The goal of our experiments was to evaluate the role of (k, j)-obfuscation parameters on the utility of obfuscated NetFlows. Of course, higher values of k and j determine stronger confidentiality protection but lower utility of obfuscated flows. In the following, we describe the dataset used in our experiments, and the achieved results. A. Experimental setup In order to carry out our experiments, we collected real traffic packets flowing through an important Italian transit tier II Autonomous System located in the city of Rome, that hosts several sensible corporate and governmental sites. During a day, an average value of about 2 billions of packets, corresponding to about 5 TBytes of data, flow through that system. In order to handle such large amounts of data, it is a common practice to summarize packets into network flows. In order to do so, we used the widely adopted CISCOTM NetFlow1 technology, which aggregates data obtained from layers 2 − 4 of the TCP/IP stack. The use of the NetFlow technology has several advantages with respect to raw packet sniffing, since it gives a lightweight and informative picture of the monitored network. NetFlow records proved to be useful for several applications, including intrusion detection systems, traffic classifiers, traffic accounting, DoS monitoring. The typical configuration to leverage the NetFlow protocol is made of a router with NetFlow capabilities and a probe able to summarize and store received data. A netflow record (graphically illustrated in Fig. 1) is defined as a unidirectional sequence of packets, all sharing source and destination IP address and port, IP protocol, ingress interface, and IP type of service. Other valuable data associated to the flow, like timestamp, duration, number of packets, and number of transmitted bytes, are also recorded; the packet payload is not recorded. In our monitored network, during a typical working day, we collect an average value of about 110 millions of NetFlows. 1 http://www.cisco.com/web/go/netflow Entropy on source IP addresses 9 k=10 k=20 8 7 6 5 13:00 12:55 12:50 12:45 12:40 12:35 12:30 12:25 12:20 12:15 12:10 12:05 NetFlow 12:00 Fig. 1. original flows k=5 Time of the day Entropy on source IP address 9 original flows k=5 k=10 k=20 7 5 Tue Mon Sun Sat Fri Thu Wed The first set of experiments was aimed at evaluating the role of parameter k of (k, j)-obfuscation; i.e., the minimum dimension of IP address groups. In order to study the effect of k in isolation, we applied Algorithm 2 to the original set of NetFlows in order to partition IP addresses in groups of dimension greater than or equal to k. Then, we substituted each real IP address in original NetFlows with its corresponding group-ID. The value of k determines the level of obfuscation of IP addresses. Due to the high complexity of the optimal algorithm for (k, j)-obfuscation, and the very large size of the dataset, we could not compare our defense algorithm with the performance of the optimal one. Instead, we have studied the impact of our defense algorithm on obfuscated NetFlows using an information theory perspective; in particular, we measure the network flows entropy Indeed, the network flows entropy evaluation is widespread in traffic analysis, for instance, for traffic anomaly detection [18], and traffic classification [19]. We modeled the distribution of IP addresses in flows collected during one-minute long time windows, in order to evaluate its temporal trend, both in original and in obfuscated flows. High values of entropy are correlated to high diversity of the IP address distribution of flows. Hence, this measure is important for network analysis: for instance, a distributed denial of service attack would determine low entropy on destination IP addresses (many flows are directed to the same host), and high entropy on source IP addresses (many different hosts are performing the attack). Fig. 2 shows the result during Entropy of source IP addresses distribution during one hour Tue B. Impact of parameter k: IP address grouping Fig. 2. Mon We implemented Algorithms 1, 2, and 3 using C and Python programming languages. Experiments were carried out on a workstation with IA64 Core i7 930, 2.80 GHz CPU (4 cores, 8 threads), and 12 GBytes DDR3 1066MHz of RAM, running a GNU/Linux kernel 2.6.32 OS. With this experimental setup, in a few hours we were able to obfuscate NetFlows collected during an entire working day. This is an acceptable time, since, for most applications, NetFlow obfuscation can be performed offline. An extension of our algorithms to support larger sets of NetFlows will be investigated in future work. For the sake of these experiments, we considered a model in which the adversary may have an in-depth knowledge of the network hosts’ fingerprint. In particular, we assumed that the fp-QI fields of NetFlows are: type of service (tos), protocol, TCP flags, number of packets, and dimension in Bytes. Day of the week Fig. 3. Entropy of source IP addresses distribution during 8 days a representative peak hour for Internet traffic (from noon to 1 PM), using the dataset of about 7.5 million NetFlows that were collected by our system. We obtained similar results considering other time intervals; in Fig. 2 we plot a onehour sample for the sake of readability. We executed the IP address grouping algorithm with values of k ranging from 5 to 20. As expected, the average value of entropy is inversely correlated to the value of k: indeed, the more IP addresses are grouped together, the less diverse the traffic and, consequently, the lower the entropy. However, algorithms for traffic analysis rely on fluctuations of the entropy value; not on its absolute value. For instance, in [18], traffic anomalies are detected by comparing the entropy in a fine-grained time window (e.g., from 12:00 to 12:01 of the current day) with its expected value (e.g., the entropy calculated on the same minute during multiple days). As it can be seen from Fig. 2, trends and temporal patterns are preserved by the transformation of real source IP addresses in group-IDs; we obtained analogous results considering destination IP addresses. These results were confirmed when we considered the distribution of IP addresses in the dataset of about 790 million NetFlows collected during 8 consecutive days; results are illustrated in Fig. 3. The above results indicate that our technique for IP address grouping preserves both traffic diversity and data utility for algorithms based on information theory measures. For the 60 0.5 j=2 j=3 j=4 j=5 j=6 j=7 50 0.4 Adversary’s confidence Suppressed flows (%) Combined attacks Linking attacks Knowledge of IP mapping function 0.45 40 30 20 0.35 0.3 0.25 0.2 0.15 0.1 10 0.05 0 0 12 4 Fig. 4. 8 16 Time granule τ (minutes) 32 Suppressed flows (k = 10) following experiments, we fixed the value of k to 10, since it provides a good tradeoff between confidentiality protection and data utility. C. Impact of parameter j: fp-indistinguishability The second set of experiments was aimed at evaluating the impact of parameter j on the data quality of obfuscated NetFlows. As explained in Section V, in order to reduce computational and memory costs, our algorithm to enforce fp-indistinguishability takes a temporal granularity τ as an additional parameter: NetFlows are processed by Algorithm 3 in slots of one time granule at a time. Using shorter time granules demands for less computational and memory resources. However, in some cases, fp-indistinguishability cannot be enforced, since there is no sufficient diversity of IP addresses in flows generated during a single time granule. In those unfortunate cases, our algorithm suppresses those flows that cannot be made fp-indistinguishable. We performed experiments to evaluate the number of suppressed flows, using different values of j (from 2 to 7) and τ (from one minute to 32 minutes). With our experimental setup, we were unable to use values of τ of one hour or more without incurring significant delay, due to swapping, for the specific hardware used in our setup. As it can be seen in Fig. 4, with low values of j (less than 5), the percentage of suppressed flows rapidly decreases; it is close to zero with τ equal to 32 minutes. On the contrary, with higher values of j, a relevant fraction of flows (> 20%) is suppressed. However, with k = 10, small values of j are sufficient to provide confidentiality protection. In Fig. 5 we report the adversary’s confidence based on the attacks considered in Section IV-C. As it can be observed, with j = 4, the confidence of the adversary about the association between a flow and its source and destination IP addresses is below 10%. We evaluated the utility of obfuscated NetFlows in terms of the precision in answering aggregate queries. For those fields having numerical domains (number of bytes and number of packets), we executed the queries considering ranges of different selectivity: e.g., “count the number of NetFlows at minute t whose number of packets is between 200 and 300”. We performed these queries considering each interval of dimension 100 (for the number of bytes) and 5 (for the number of packets), starting from 0, until the maximum dimension of 2 3 4 5 6 7 J Fig. 5. Adversary’s confidence based on different attacks (k = 10) bytes and packets in our dataset of NetFlows. For those fields having non-numerical domains (tos, protocol, and TCP flags), we executed queries about their specific values; e.g., “count the number of NetFlows at minute t whose protocol is TCP”. We executed queries for each possible value/range, and for each minute in a one-hour time window, for a total of about 120, 000 queries. For each query, we calculated the error rate by the following formula: e= r t − r′ t′ r t where r (resp. r′ ) is the result of the query on the original (resp. obfuscated) flows, and t (resp. t′ ) is the total number of original (resp. obfuscated) flows. Fig. 6 shows the average error rate for different values of j and τ , and the different fp-QI fields having non-numerical domains. Considering flag and protocol fields, with j equal to 4 or less, and τ equal to 16 minutes or more, the average error is below 10%. The average error rate was even smaller when we considered numerical fields (results are shown in Figure 7). We obtained a larger average error with the tos field; however, even with that field, the error becomes low using τ = 32 minutes and j ≤ 4. The average error increases considerably when larger values of j are used; this is due to the large number of flows that must be suppressed to achieve fp-indistinguishability. VII. C ONCLUSIONS AND FUTURE WORK In this paper, we addressed the challenging research issue of sensitive data obfuscation in network flows. We have formally modeled this issue, and proposed a novel defense technique. Differently from previous proposals, our technique provides formal protection guarantees under realistic assumptions about the adversary’s knowledge. An extensive experimental evaluation with a large set of real network flows showed that our technique preserves the utility of network flows. Future research work includes the investigation of algorithms for enforcing (k, j)-obfuscation on arbitrarily large sets of network flows, and the execution of new experiments on obfuscated flows using state of the art attack-detection algorithms to evaluate data utility. We are also investigating an extension of our defense to different adversary models; in particular, one in which the hosts’ fingerprint may change over time. 80 60 j=4 j=5 j=2 j=3 70 90 j=6 j=7 j=2 j=3 j=4 j=5 j=6 j=7 j=4 j=5 j=6 j=7 70 50 40 30 Average error (%) Average error (%) 60 Average error (%) j=2 j=3 80 50 40 30 20 20 60 50 40 30 20 10 10 10 0 0 12 4 8 16 Time granule τ (minutes) 32 0 12 (a) Protocol field query results Fig. 6. 60 j=2 j=3 4 8 16 Time granule τ (minutes) (b) Flag field query results j=4 j=5 j=6 j=7 Average error (%) 40 30 20 10 0 4 8 16 Time granule τ (minutes) 32 (a) Query on byte field 45 j=2 j=3 40 j=4 j=5 j=6 j=7 Average error (%) 35 30 25 20 15 10 5 0 12 4 8 12 4 8 16 Time granule τ (minutes) 32 (c) Tos field query results Average error rate for aggregate queries on obfuscated NetFlows (k = 10) 50 12 32 16 32 Time granule τ (minutes) (b) Query on packet field Fig. 7. Average error rate for aggregate queries on obfuscated NetFlows (k = 10) ACKNOWLEDGMENTS We would like to give a special thank to the CASPUR Consortium and to the staff of its Network Group for the significant help in collecting the data and in the processing of our experiments. This paper has been financially supported by the Prevention, Preparedness and Consequence Management of Terrorism and other Security-related Risks Programme, European Commission -Directorate General Home Affairs, under the ExTraBIRE project, HOME/2009/CIPS/AG/C2-065. R EFERENCES [1] J. Fan, J. Xu, M. H. Ammar, and S. B. Moon, “Prefix-preserving ip address anonymization: measurement-based security evaluation and a new cryptography-based scheme,” Comput. Netw., vol. 46, no. 2, pp. 253–272, 2004. [2] J. King, K. Lakkaraju, and A. J. Slagell, “A taxonomy and adversarial model for attacks against network log anonymization,” in Proc. of ACM SAC. ACM, 2009, pp. 1286–1293. [3] T. Brekne, A. Årnes, and A. Øslebø, “Anonymization of ip traffic monitoring data: Attacks on two prefix-preserving anonymization schemes and some proposed remedies,” in 5th Workshop on Privacy Enhancing Technologies, vol. 3856. Springer, 2006, pp. 179–196. [4] R. Pang, M. Allman, V. Paxson, and J. Lee, “The devil and packet trace anonymization,” Computer Communication Review, vol. 36, no. 1, pp. 29–38, 2006. [5] A. J. Slagell, K. Lakkaraju, and K. Luo, “FLAIM: A multi-level anonymization framework for computer and network logs,” in Proc. of Large Installation System Administration Conference. USENIX, 2006, pp. 63–77. [6] M. Foukarakis, D. Antoniades, and M. Polychronakis, “Deep packet anonymization,” in Proc. of EUROSEC. ACM, 2009, pp. 16–21. [7] M. Burkhart, D. Schatzmann, B. Trammell, E. Boschi, and B. Plattner, “The role of network trace anonymization under attack,” Computer Communication Review, vol. 40, no. 1, pp. 5–11, 2010. [8] T.-F. Yen, X. Huang, F. Monrose, and M. K. Reiter, “Browser fingerprinting from coarse traffic summaries: Techniques and implications,” in Proc. of Detection of Intrusions and Malware & Vulnerability Assessment, vol. 5587. Springer, 2009, pp. 157–175. [9] T. Brekne and A. Årnes, “Circumventing ip-address pseudonymization,” in Proc. of International Conference on Computer Communications and Networks. IASTED/ACTA Press, 2005, pp. 43–48. [10] S. E. Coull, C. V. Wright, F. Monrose, M. P. Collins, and M. K. Reiter, “Playing devil’s advocate: Inferring sensitive information from anonymized network traces,” in Proc. of NDSS. The Int. Soc., 2007. [11] P. Samarati, “Protecting Respondents’ Identities in Microdata Release,” IEEE Trans. on TDKE, vol. 13, no. 6, pp. 1010–1027, 2001. [12] S. E. Coull, F. Monrose, M. K. Reiter, and M. Bailey, “The challenges of effectively anonymizing network data,” in Proc. of Conference For Homeland Security. IEEE Comp. Soc., 2009, pp. 230–236. [13] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “l-Diversity: Privacy Beyond k-Anonymity,” in Proc. of International Conference on Data Engineering. IEEE Comp. Soc., 2006. [14] N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond k-anonymity and l-diversity.” in Proc. of International Conference on Data Engineering. IEEE Comp. Soc., 2007, pp. 106–115. [15] X. Xiao and Y. Tao, “m-invariance: towards privacy preserving republication of dynamic datasets,” in Proc. of SIGMOD. ACM, 2007, pp. 689–700. [16] A. Meyerson and R. Williams, “On the Complexity of Optimal kAnonymity,” in Proc. of SIGMOD/PODS’04. ACM Pub., 2004, pp. 223–228. [17] A. R. Butz, “Alternative algorithm for Hilbert’s space-filling curve,” IEEE Trans. Comput., vol. 20, pp. 424–426, 1971. [18] Y. Gu, A. McCallum, and D. F. Towsley, “Detecting anomalies in network traffic using maximum entropy estimation,” in Proc. of ACM SIGCOMM Internet Measurement Conference. USENIX Association, 2005, pp. 345–350. [19] J. Yuan, Z. Li, and R. Yuan, “Information entropy based clustering method for unsupervised internet traffic classification,” in Proc. of IEEE International Conference on Communications. IEEE Comp. Soc., 2008, pp. 1588–1592.