Fast Detection of Transformed Data Leaks
Fast Detection of Transformed Data Leaks
Fast Detection of Transformed Data Leaks
3, MARCH 2016
Abstract— The leak of sensitive data on computer systems tool searches for the occurrences of plaintext sensitive data
poses a serious threat to organizational security. Statistics show in the content of files or network traffic. It alerts users and
that the lack of proper encryption on files and communications administrators of the identified data exposure vulnerabilities.
due to human errors is one of the leading causes of data loss.
Organizations need tools to identify the exposure of sensitive For example, an organization’s mail server can inspect the
data by screening the content in storage and transmission, content of outbound email messages searching for sensitive
i.e., to detect sensitive information being stored or transmit- data appearing in unencrypted messages.
ted in the clear. However, detecting the exposure of sensitive Data leak detection differs from the anti-virus (AV) scanning
information is challenging due to data transformation in the (e.g., scanning file systems for malware signatures) or the
content. Transformations (such as insertion and deletion) result
in highly unpredictable leak patterns. In this paper, we utilize network intrusion detection systems (NIDS) (e.g., scanning
sequence alignment techniques for detecting complex data-leak traffic payload for malicious patterns) [5]. AV and NIDS
patterns. Our algorithm is designed for detecting long and typically employ automata-based string matching (e.g.,
inexact sensitive data patterns. This detection is paired with a Aho-Corasick [6], Boyer-Moore [7]), which match static or
comparable sampling algorithm, which allows one to compare regular patterns. However, data leak detection imposes new
the similarity of two separately sampled sequences. Our system
achieves good detection accuracy in recognizing transformed security requirements and algorithmic challenges:
leaks. We implement a parallelized version of our algorithms in • Data Transformation: The exposed data in the content
graphics processing unit that achieves high analysis throughput. may be unpredictably transformed or modified by users
We demonstrate the high multithreading scalability of our data or applications, and it may no longer be identical to
leak detection method required by a sizable organization. the original sensitive data, e.g., insertions of metadata
Index Terms— Data leak detection, content inspection, or formatting tags, substitutions of characters, and data
sampling, alignment, dynamic programming, parallelism. truncation (partial data leak). Thus, the detection algo-
rithm needs to recognize different kinds of sensitive data
I. I NTRODUCTION variations.
• Scalability: The heavy workload of data leak screening
R EPORTS show that the number of leaked sensitive data
records has grown 10 times in the last 4 years, and it
reached a record high of 1.1 billion in 2014 [3]. A significant
is due to two reasons.
a) Long Sensitive Data Patterns: The sensitive data
portion of the data leak incidents are due to human errors, (e.g., customer information, documents, source
for example, a lost or stolen laptop containing unencrypted code) can be of arbitrary length (e.g., megabytes).
sensitive files, or transmitting sensitive data without using end- b) Large Amount of Content: The detection needs to
to-end encryption such as PGP. A recent Kaspersky Lab survey rapidly screen content (e.g., gigabytes to terabytes).
shows that accidental leak by staff is the leading cause for Traffic scanning is more time sensitive than storage
internal data leaks in corporates [4]. The data-leak risks posed scanning, because the leak needs to be discovered
by accidents exceed the risks posed by vulnerable software. before the message is transmitted.
In order to minimize the exposure of sensitive data and Directly applying automata-based string matching
documents, an organization needs to prevent cleartext sensitive (e.g., [6], [8], [9]) to data leak detection is inappropriate
data from appearing in the storage or communication. and inefficient, because automata are not designed to support
A screening tool can be deployed to scan computer file sys- unpredictable and arbitrary pattern variations. In data leak
tems, server storage, and inspect outbound network traffic. The detection scenarios, the transformation of leaked data (in the
description of regular expression) is unknown to the detection
Manuscript received June 28, 2015; revised October 9, 2015; accepted method. Creating comprehensive automata models covering
November 4, 2015. Date of publication November 24, 2015; date of current
version December 24, 2015. This work was supported in part by Security all possible variations of a pattern is infeasible, which leads
and Software Engineering Research Center (S2ERC), an NSF sponsored to O(2n ) space complexity (for deterministic finite automata)
multi-university Industry/University Cooperative Research Center (I/UCRC). or O(2n ) time complexity (for nondeterministic finite
Preliminary versions of this work appeared in Proceedings of the Third Inter-
national Workshop on Security and Privacy in Big Data (BigSecurity 2015) automata) where n is the number of automaton states.
and Proceedings of the 5th ACM Conference on Data and Application Security Therefore, automata approaches cannot be used for detecting
and Privacy (CODASPY). The associate editor coordinating the review of this long and transformed data leaks.
manuscript and approving it for publication was Dr. Vrizlynn L. L. Thing.
The authors are with the Department of Computer Science, Virginia Tech, Existing data leak detection approaches are based on set
Blacksburg, VA 24060 USA (e-mail: subx@cs.vt.edu; zjing14@cs.vt.edu; intersection. Set intersection is performed on two sets of
danfeng@cs.vt.edu; feng@cs.vt.edu). n-grams, one from the content and one from sensitive data.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. The set intersection gives the amount of sensitive n-grams
Digital Object Identifier 10.1109/TIFS.2015.2503271 appearing in the content. The method has been used to detect
1556-6013 © 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted,
but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
SHU et al.: FAST DETECTION OF TRANSFORMED DATA LEAKS 529
similar documents on the web [10], shared malicious traffic window. The same sampled items are selected in spite of
patterns [11], malware [12], as well as email spam [13]. The different starting/ending points of sampling procedures.
advantage of n-grams is the extraction of local features of a Both of our algorithms are designed to be efficiently paral-
string, enabling the comparison to tolerate discrepancies. Some lelized. We parallelize our prototype on a multicore CPU and a
advanced versions of the set intersection method utilize Bloom GPU. We demonstrate the strong scalability of our design and
filter, e.g., [14], which trades accuracy for space complexity the high performance of our prototypes. Our GPU-accelerated
and speed. Shu and Yao extended the standard use of n-grams implementation achieves nearly 50 times of speedup over
and introduced data-leak detection as a service. They proposed the CPU version. Our prototype reaches 400Mbps analysis
the first solution for detecting accidental data leak with throughput. This performance potentially supports the rapid
semi-honest providers [15]. security scanning of storage and communication required by
However, set intersection is orderless, i.e., the ordering of a sizable organization.
shared n-grams is not analyzed. Thus, set-based detection We have presented the basic idea and preliminary evaluation
generates undesirable false alerts, especially when n is set results in our workshop paper [1]. In this paper, we formalize
to a small value to tolerant data transformation. In addition, and expand the description and analysis of our comparable
set intersection cannot effectively characterize the scenario sampling algorithm and sampling-oblivious alignment algo-
when partial data is leaked, which results in false negatives. rithm. We conduct new experiments in Section VI to system-
Therefore, none of the existing techniques is adequate for atically understand how sensitive our system is in response
detecting transformed data leaks. to data transformation in various degrees. We also include the
Our solution to the detection of transformed data leaks is a effectiveness evaluation of our sampling design in Section VII.
sequence alignment algorithm, executed on the sampled sen- Our solution detects inadvertent data leaks, where sensitive
sitive data sequence and the sampled content being inspected. data may be accidentally exposed. It is not designed for
The alignment produces scores indicating the amount of detecting data leaks caused by malicious insiders or attackers.
sensitive data contained in the content. Our alignment-based The detection of data leaks due to malicious insiders remains
solution measures the order of n-grams. It also handles arbi- a challenging open research problem.
trary variations of patterns without an explicit specification
of all possible variation patterns. Experiments show that our
alignment method substantially outperforms the set intersec- II. R ELATED W ORK
tion method in terms of detection accuracy in a multitude of Existing commercial data leak detection/prevention
transformed data leak scenarios. solutions include Symantec DLP [14], IdentityFinder [18],
We solve the scalability issue by sampling both the sensitive GlobalVelocity [19], and GoCloudDLP [20]. GlobalVelocity
data and content sequences before aligning them. We enable uses FPGA to accelerate the system. All solutions are likely
this procedure by providing the pair of a comparable sam- based on n-gram set intersection. IdentityFinder searches file
pling algorithm and a sampling-oblivious alignment algorithm. systems for short patterns of numbers that may be sensitive
The comparable sampling algorithm yields constant samples (e.g., 16-digit numbers that might be credit card numbers).
of a sequence wherever the sampling starts and ends. The It does not provide any in-depth similarity tests. Symantec
sampling-oblivious alignment algorithm infers the similarity DLP is based on n-grams and Bloom filters. The advantage
between the original unsampled sequences with sophisti- of Bloom filter is space saving. However, as explained in
cated traceback techniques through dynamic programming. the introduction, Bloom filter membership testing is based
The algorithm infers the lost information (i.e., sampled-out on unordered n-grams, which generates coincidental matches
elements) based on the matching results of their neighboring and false alarms. Bloom filter configured with a small number
elements. Evaluation results show that our design boosts of hash functions has collisions, which introduce additional
the performance, yet only incurs a very small amount of unwanted false positives.
mismatches. Network intrusion detection systems (NIDS) such as
Existing network traffic sampling techniques, e.g., [16], only Snort [21] and Bro [22] use regular expression to perform
sample the content. Our problem differs from existing sam- string matching in deep packet inspection [23]. Nondetermin-
pling problems that both sensitive data and content sequences istic finite automaton (NFA) with backtracking requires O(2n )
are sampled. The alignment is performed on the sampled time and O(n) space, where n is the number of automaton
sequences. Therefore, the samples of similar sequences should states. Deterministic finite automaton (DFA) has a time com-
be similar so that they can be aligned. We define a comparable plexity of O(n) and a space complexity of O(2n ) when used
sampling property, where the similarity of two sequences is with quantification. Quantification is for expressing optional
preserved. For example, if x is a substring of y, then x should characters and multiple occurrences in a pattern. DFA’s space
be a substring of y , where x and y are sampled sequences complexity can be reduced by grouping similar patterns into
of x and y, respectively. None of the existing sampling solu- one automaton [24], reducing the number of edges [25], [26].
tions satisfies this comparable sampling requirement. Deter- These improvements provide a coefficient level of speedup.
ministic sampling, e.g., [17], does not imply comparable However, existing string matching approaches based on
sampling, either. The key to our comparable sampling is to DFA or NFA cannot automatically match arbitrary and unpre-
consider the local context of a sequence while selecting items. dictable pattern variations. Modified data leak instances cannot
Sample items are selected deterministically within a sliding be matched or captured if the variation is not manually
530 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016
specified as a matching pattern. Enumerating all potential vari- detection capability is restricted by the underlying keyword
ation patterns takes exponential time and space with respect searching utility, which is not designed for detecting either
to the length of the pattern. Therefore, it is impractical. transformed data leaks or partial data leaks.
In comparison, our sequence alignment solution covers Bertino and Ghinita addressed the issue of data leaks in
all possible pattern variations in long sensitive data without database from the perspective of anomaly detection [41].
explicitly specifying them. Another drawback of automata is Normal user behaviors are monitored and modeled in DBMS,
that it yields binary results. In comparison, alignment pro- and anomalous activities are identified with respect to potential
vides precise matching scores and allows customized weight data leak activities. Bertino also discussed watermarking and
functions. Our alignment gives more accurate detection than provenance techniques used in data leak prevention and foren-
approximate string matching (e.g., [27], [28]). sics [41], which is investigated in details by Papadimitriou and
Alignment algorithms have been widely used in Garcia-Molina in [42].
computational biology applications, and features such as Privacy is a well-known issue in the cloud. Lin and
privacy-preserving sequence matching have been studied [29]. Squicciarini proposed a generic data protection framework
In security literature, NetDialign based on the well-known to enforce data security in the cloud [43]. A three-tier data
Dialign algorithms is proposed for network privacy [30]. protection framework was proposed by Squicciarini et al. to
It performs differential testing among multiple traffic flows. deal with the data leak caused by indexing in the cloud [44].
Kreibich and Crowcroft presented an alignment algorithm Privacy-preserving data leak detection was proposed and fur-
for traffic intrusion detection systems such as Bro [31]. ther developed in [45] and [46], where data leak detection
It is a variant of Jacobson-Vo alignment that calculates the operations are outsourced to a semi-honest third-party. The
longest common subsequence with the minimum number solution is a specialized set intersection method. Therefore,
of gaps. Researchers in [32] reported the use of dynamic it is different from this paper.
programming for computing the similarity of network
behaviors and presented a technique to handle behavioral III. M ODELS AND OVERVIEW
sequences with differing sampling rates. Masquerade attacks In our data leak detection model, we analyze two types of
in the context of user command sequences can be detected sequences: sensitive data sequence and content sequence.
with semi-global sequence alignment techniques [33], [34]. • Content sequence is the sequence to be examined for
Our data leak detection differs from the above network leaks. The content may be data extracted from file sys-
privacy and IDS problems, and it has new requirements as we tems on personal computers, workstations, and servers;
have explained in the introduction. Our alignment performs or payloads extracted from supervised network channels
complex inferences needed for aligning sampled sequences, (details are discussed below).
and our solution is also different from fast non-sample align- • Sensitive data sequence contains the information (e.g.,
ment in bioinformatics, e.g., BLAST [35]. customers’ records, proprietary documents) that needs
Another approach to the detection of sensitive data leak to be protected and cannot be exposed to unauthorized
is to track the data/metadata movement. Several tools are parties. The sensitive data sequences are known to the
developed for securing sensitive information on mobile analysis system.
platforms [36]–[38]. Nadkarni and Enck described an approach In this paper, we focus on detecting inadvertent data leaks,
to control the sharing of sensitive files among mobile appli- and we assume the content in file system or network traffic
cations [37]. File descriptors (not the content) are stored, (over supervised network channels) is available to the inspec-
tracked and managed. The access control on files is enforced tion system. A supervised network channel could be an unen-
through policies. Yang et al. presented a method aiming at crypted channel or an encrypted channel where the content
detecting the transmission of sensitive data that is not intended in it can be extracted and checked by an authority. Such
by smartphone users via symbolic execution analysis [38]. a channel is widely used for advanced NIDS where MITM
Hoyle et al. described a visualization method for informing (man-in-the-middle) SSL sessions are established instead of
mobile users of information exposure [36]. The information normal SSL sessions [47]. We do not aim at detecting stealthy
exposure may be caused by improper setting or configuration data leaks that an attacker encrypts the sensitive data secretly
of access policies. The visualization is through an avatar before leaking it. Preventing intentional or malicious data leak,
apparel approach. Croft and Caesar expand the data tracking especially encrypted leaks, requires different approaches and
from a single host to a network and use shadow packets to remains an active research problem [48].
distinguish normal traffic from leaks [39]. The security goals In our current security model, we assume that the analysis
and requirements in all these studies are very different from system is secure and trustworthy. Privacy-preserving data-leak
ours, leading to different techniques developed and used. detection can be achieved by leveraging special protocols
iLeak is a system for preventing inadvertent information and computation steps [49]. It is another functionality of a
leaks on a personal computer [40]. It takes advantages of the detection system, and the discussion is not within the scope
keyword searching utility present in many modern operating of this paper.
systems. iLeak monitors the file access activities of processes
and searches for system call inputs that involve sensitive data. A. Technical Challenges
Unlike our general data leak detection approach, iLeak is 1) High Detection Specificity: In our data-leak detection
designed to secure personal data on a single machine, and its model, high specificity refers to the ability to distinguish true
SHU et al.: FAST DETECTION OF TRANSFORMED DATA LEAKS 531
leaks from coincidental matches. Coincidental matches are IV. C OMPARABLE S AMPLING
false positives, which may lead to false alarms. Existing set- In this section, we define the sampling requirement needed
based detection is orderless, where the order of matched shin- in data leak detection. Then we present our solution and its
gles (n-grams) is ignored. Orderless detection may generate analysis.
coincidental matches, and thus having a lower accuracy of
the detection. In comparison, our alignment-based method has
high specificity. For example, a detection system can use A. Definitions
3-grams to represent the sensitive data. One great challenge in aligning sampled sequences is that
the sensitive data segment can be exposed at an arbitrary
Sensitive data abcdefg
position in a network traffic stream or a file system. The
3-grams abc, bcd, cde, def, efg sampled sequence should be deterministic despite the starting
Then, consider the content streams 1 and 2 below. and ending points of the sequence to be sampled. More-
Stream 1 contains a true leak, whereas stream 2 does not. over, the leaked sensitive data could be inexact but similar
to the original string due to unpredictable transformations.
Content stream 1 ....abcdefg... We first define substring and subsequence relations in
Content stream 2 ....efg...cde...abc... Definition 1 and Definition 2. Then we define the capa-
bility of giving comparable results from similar strings in
However, set intersection between 3-grams of the sensitive Definition 3.
data and the 3-grams of content stream 2 results in a significant Definition 1 (Substring): A substring is a consecutive
number of matching 3-grams (efg, cde, and abc), even segment of the original string.
though they are out of order compared to the sensitive data If x is a substring of y, one can find a prefix string
pattern. This problem is eliminated in alignment, i.e., the (denoted by y p ) and a suffix string (denoted by ys ) of y,
content stream 2 receives a low sensitivity score when aligned so that y equals to the concatenation of y p , x, and ys .
against the sensitive data. y p and ys could be empty.
2) Pervasive and Localized Modification: Sensitive data Definition 2 (Subsequence): Subsequence is a generaliza-
could be modified before it is leaked out. The modification tion of substring that a subsequence does not require its items
can occur throughout a sequence (pervasive modifica- to be consecutive in the original string.
tion). The modification can also only affect a local One can generate a subsequence of a string by removing
region (local modification). We describe some modification items from the original string and keeping the order of the
examples: remaining items. The removed items can be denoted as gaps
• Character replacement, e.g., WordPress replaces every
in the subsequence, e.g., lo-e is a subsequence of flower
space character with a + in HTTP POST requests.
(- indicates a gap).
• String insertion, e.g., HTML tags inserted throughout a
Definition 3 (Comparable Sampling): Given a string x and
document for formatting or embedding objects.
another string y that x is similar to a substring of y accord-
• Data truncation or partial data leak, e.g., one page of a
ing to a similarity measure M, a comparable sampling on
two-page sensitive document is transmitted.
x and y yields two subsequences x (the sample of x) and y
(the sample of y), so that x is similar to a substring
B. Overview of Our Approach of y according to M.
Our work presents an efficient sequence comparison If we restrict the similarity measure M in Definition 3 to
technique needed for analyzing a large amount of content identical relation, we get a specific instance of comparable
for sensitive data exposure. Our detection approach consists sampling in Definition 4.
of a comparable sampling algorithm and a sampling obliv- Definition 4 (Subsequence-Preserving Sampling): Given
ious alignment algorithm. The pair of algorithms computes x as a substring of y, a subsequence-preserving sampling on
a quantitative similarity score between the sensitive data x and y yields two subsequences x (the sample of x) and y
and the content. Local alignment – as opposed to global (the sample of y), so that x is a substring of y .
alignment [50] – is used to identify similar sequence segments. Because a subsequence-preserving sampling procedure is a
The design enables the detection of partial data leaks. restricted comparable sampling, so the subsequence-preserving
Our detection runs on continuous sequences of n bytes sampling is deterministic, i.e., the same input always yields the
(n-grams). n-grams are obtained from the content and sensitive same output. The vice versa may not be true.
data, respectively. Local alignment is performed between the In Example 1 with two sequences of integers, we illustrate
two (sampled) sequences to compute their similarity. The the differences between a comparable sampling algorithm and
purpose of our comparable sampling operation is to enhance a random sampling method, where a biased coin flipping at
the analysis throughput. We discuss the tradeoff between each position decides whether to sample or not. The input is
security and performance related to sampling in our evaluation a pair of two similar sequences. There is one modification
sections. Finally, we report the content that bears higher-than- (9 to 8), two deletions (7) and (3), and suffix padding (1, 4)
threshold similarity with respect to sensitive patterns. Given in the second sequence. Local patterns are preserved in a
a threshold T , content with a greater-than-T sensitivity is comparable sampling method, whereas the random sampling
reported as a leak. does not. The local patterns can then be digested by our
532 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016
sampling-oblivious alignment algorithm to infer the similarity Algorithm 1 A Subsequence-Preserving Sampling Algorithm
between the two original input sequences. Input: an array S of items, a size |w| for a sliding window w, a
Example 1: Comparable sampling. selection function f (w, N) that selects N smallest items from a
window w, i.e., f = min(w, N)
Inputs: Output: a sampled array T
1: initialize T as an empty array of size |S|
1 1 9 4 5 7 3 5 9 7 6 6 3 3 7 1 6 2: w ← read(S, |w|)
1 1 9 4 5 7 3 5 8 6 6 3 7 1 6 1 4 3: let w.head and w.tail be indices in S corresponding to the
higher-indexed end and lower-indexed end of w, respectively
Comparable sampling may give: 4: collection m c ← min(w, N)
1 1 - 4 - - 3 5 - - - - 3 3 - 1 - 5: while w is within the boundary of S do
6: m p ← mc
1 1 - 4 - - 3 - - 6 - 3 - 1 - 1 4 7: move w toward high index by 1
8: m c ← min(w, N)
Random sampling may give: 9: if m c = m p then
1 - - 4 - - 3 5 - 7 - 6 - - 7 1 - 10: item en ← collectionDiff(m c , m p )
- 1 9 - 5 - - 5 - 6 - 3 7 - 6 1 - 11: item eo ← collectionDiff(m p , m c )
12: if en < eo then
13: write value en to T at w.head’s position
B. Our Sampling Algorithm 14: else
15: write value eo to T at w.tail’s position
We present our comparable sampling algorithm. The advan- 16: end if
tage of our algorithm is its context-aware selection, i.e., the 17: end if
selection decision of an item depends on how it compares 18: end while
with its surrounding items according to a selection function.
As a result, the sampling algorithm is deterministic and collection B. The operation is similar to the set difference,
subsequence-preserving. except that it works on collections and does not eliminate
Our comparable sampling algorithm takes in S, an input list duplicates.
of items (preprocessed n-grams of sensitive data or content),1 T output by Algorithm 1 takes the same space as S does.
and outputs T , a sampled list of the same length; the sampled Null items can be combined, and T is turned into a
list contains null values, which correspond to items that are compact representation L, which is consumed by our
not selected. The null regions in T can be aggregated, and sampling-oblivious alignment algorithm in the next phase.
T can be turned into a compact representation L. Each item We show how our sampling algorithm works in Table I.
in L contains the value of the sampled item and the length We set our sampling procedure with a sliding window of
of the null region between the current sampled item and the size 6 (i.e., |w| = 6) and N = 3. The input sequence
preceding one. is 1,5,1,9,8,5,3,2,4,8. The initial sliding window
T is initialized as an empty list, i.e., a list of null items. w = [1,5,1,9,8,5] and collection m c = {1,1,5}.
The algorithm runs a small sliding window w on S. w is Sampling Algorithm Analysis: Our sampling algorithm is
initialized with the first |w| items in S (line 2 in Algorithm 1). deterministic, i.e., given a fixed selection function f : same
The algorithm then utilizes a selection function to decide inputs yield the same sampled string. However, deterministic
what items in w should be selected for T . The selection sampling (e.g., [17]) does not necessarily imply subsequence
decision is made based on not only the value of that item, preserving. One can prove using a counterexample. Consider
but also the values of its neighboring items in w. Therefore, a sampling method that selects the first of every 10 items
unlike a random sampling method where a selection decision from a sequence, e.g., 1-st, 11-th, 21-st, … It is deterministic,
is stochastic, our method satisfies the subsequence-preserving but it does not satisfy the subsequence-preserving requirement.
and comparable sampling requirements. Some sampling methods such as coresets [52], [53] do not
In Algorithm 1, without loss of generality, we describe imply determinism.
our sampling method with a specific selection function Our sampling algorithm is not only deterministic, but also
f = min(w, N). f takes in an array w and returns the subsequence-preserving as presented in Theorem 1.
N smallest items (integers) in w. f is deterministic, and Theorem 1: Algorithm 1 (denoted by ) is subsequence-
it unbiasedly selects items when items (n-grams) are pre- preserving. Given two strings x and y, where x is a substring
processed with the min-wise independent Rabin’s finger- of y, then (x) is a substring of (y).
print [51]. f can be replaced by other functions that are also Proof: Let L[m : n] denote the substring of L starting
min-wise independent. The selection results at each sliding from the m-th item and ending at the n-th item. Consider
window position determine what items are chosen for the sam- strings L 1 and L 2 and their sampled sequences S1 and S2 ,
pled list. The parameters N and |w| determine the sampling respectively. We prove that the theorem holds in four cases.
rate. collectionDiff(A,B) in lines 10 and 11 outputs Case 1: L 2 equals to L 1 . Because our comparable sampling
the collection of all items of collection A that are not in algorithm is deterministic, the same string yields the same
1 We preprocess n-grams with Rabin’s fingerprint to meet the min-wise sampled sequence. Thus, S2 is a substring of S1 .
independent requirement of selection function f described next. Each item in Case 2: L 2 is a prefix of L 1 . The sampling of L 1 can be
S is a fingerprint/hash value (integer) of an n-gram. split into two phases.
SHU et al.: FAST DETECTION OF TRANSFORMED DATA LEAKS 533
TABLE I
I LLUSTRATION O UR S AMPLING P ROCEDURE
Phase 1 The head of the sliding moves within and S2 [k + 1 : si ze(L 2 )] is a substring of S1 [m +
L 1 [si ze(wi n) : si ze(L 2 )], i.e., from the start of L 1 k + 1 : n]. Thus, S2 is a substring of S1 .
to the exact position in L 1 where L 2 ends. Since L 2 is In summary, Theorem 1 holds in every case.
a prefix of L 1 , and the window only moves within the Our algorithm is unbiased, meaning that it gives an equal
scope of the prefix, the sample of L 1 generated in this probability for every unit in the string to be selected.
subprocess is the same as S2 , the final sample of L 2 . To achieve bias-free property, we hash inputs using a min-
Phase 2 The head of the sliding window moves within wise independent function, namely Rabin’s fingerprint [54].
L 1 [si ze(L 2 ) + 1 : si ze(L 2 ) + si ze(wi n)]. The tail of It guarantees that the smallest N items come equally from
the sample window sweeps L 1 [si ze(L 2 )−si ze(wi n)+ any items in the original string. This hashing is performed in
1 : si ze(L 2 )] and yields zero or more sampled items P REPROCESSING operation in our prototypes.
on S1 [si ze(L 2 ) − si ze(wi n) + 1 : si ze(L 2 )]. The complexity of sampling using the min(w, N) selection
S1 [1 : si ze(L 2 ) − si ze(wi n)] is solely generated in function is O(n log |w|), or O(n) where n is the size of the
Phase 1. Thus, it is the same as S2 [1 : si ze(L 2 ) − input, |w| is the size of the window, The factor O(log |w|)
si ze(wi n)]. In Phase 2, we know that S1 [si ze(L 2 ) − comes from maintaining the smallest N items within the
si ze(wi n) + 1 : si ze(L 2 )] contains zero or more sample window w.
items than S2 [si ze(L 2 )−si ze(wi n)+1 : si ze(L 2 )]. Thus, Sampling rate α ∈ [ |w| N
, 1] approximates |w|
N
for random
S2 [si ze(L 2 ) − si ze(wi n) + 1 : si ze(L 2 )] is a substring inputs, where |w| is the size of the sliding window, and
of S1 [si ze(L 2 ) − si ze(wi n) + 1 : si ze(L 2 )]. Thus, S2 is N is the number of items selected within the sliding window.
a substring of S1 . For arbitrary inputs, the actual sampling rate depends on the
Case 3: L 2 is a suffix of L 1 . The proof is similar to Case 2. characteristics of the input space and the selection function
The sampling of L 1 can be split into two phases. used. The sampling rate in our evaluations on common Internet
N
Phase 1 The tail of the sliding window moves within traffic is around 1.2 |w| .
L 1 [si ze(L 1 ) − si ze(L 2 ) + 1 : si ze(L 1 ) − si ze(wi n)]. Sufficient number of items need to be sampled from
The generated sampled sequence is the same as S2 , sequences in order to warrant an accurate detection. Our
which is the final sample of L 2 . empirical result in Section VI-B shows that with 0.25 sampling
Phase 2 The tail of the sliding window moves rate our alignment method can detect as short as 32-byte-long
within L 1 [si ze(L 1 ) − si ze(L 2 ) − si ze(wi n) + 1 : sensitive data segments.
si ze(L 1 ) − si ze(L 2)]. The head of the window sweeps
L 1 [si ze(L 1 ) − si ze(L 2 ) + 1 : si ze(L 1 ) − si ze(L 2 ) + V. A LIGNMENT A LGORITHM
si ze(wi n)] and yields zero or more sampled items on
In this section, we describe the requirements for a sample-
L 1 [si ze(L 1 ) − si ze(L 2 ) + 1 : si ze(L 1 ) − si ze(L 2 ) +
based alignment algorithm and present our solution.
si ze(wi n)].
S1 [si ze(L 1 ) − si ze(L 2 ) + si ze(wi n) + 1 : si ze(L 1 ) −
si ze(L 2 )] is the same as S2 [si ze(L 1 ) − si ze(L 2 ) + A. Requirements and Overview
si ze(wi n) + 1 : si ze(L 1 ) − si ze(L 2 )]. In addition, We design a specialized alignment algorithm that runs on
S2 [si ze(L 1) − si ze(L 2 ) + 1 : si ze(L 1 ) − si ze(L 2 ) + compact sampled sequences La and Lb to infer the similarity
si ze(wi n)] is a substring of S1 [si ze(L 1 ) − si ze(L 2 ) + 1 : between the original sensitive data sequence S a and the
si ze(L 1 )−si ze(L 2 )+si ze(wi n)]. Thus, S2 is a substring original content sequence S b . It needs to satisfy the require-
of S1 . ment of sampling oblivion, i.e., the result of a sampling-
Case 4: All others. This case is when L 2 is a substring oblivious alignment on sampled sequences La and Lb
of L 1 , but not a prefix or suffix, i.e., L 2 [1 : si ze(L 2 )] = should be consistent with the alignment result on the
L 1 [m : n]. We align L 1 and L 2 and cut the two strings at original S a and S b .
a position where they are aligned. Denote the position in Conventional alignment may underestimate the similarity
L 2 by k. We obtain L 2 [1 : k] as a suffix of L 1 [m : m + k] between two substrings of the sampled lists, causing misalign-
and L 2 [k + 1 : si ze(L 2 )] as a prefix of L 1 [m + k + 1 : n]. ment. Regular local alignment without the sampling oblivion
Based on the proofs in Case 2 and Case 3, we con- property may give inaccurate alignment on sampled sequences
clude that S2 [1 : k] is a substring of S1 [m : m + k], as illustrated in Example 2.
534 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016
Example 2: Sampling-oblivious alignment vs. regular local Algorithm 2 Recurrence Relation in Dynamic Programming
alignment Input: A weight function fw , visited cells in H matrix that are
Original lists: adjacent to H (i, j ): H (i − 1, j − 1), H (i, j − 1), and H (i − 1, j ),
5627983857432546397824366 and the i-th and j -th items Lai , Lbj in two sampled sequences La
5627983966432546395 and Lb , respectively.
Sampled sequences need to be aligned as: Output: H (i, j )
--2---3-5---2---3-7-2-3-- 1: h up .scor e ← f w (Lai , -, H (i − 1, j ))
Fig. 1. Illustration of the notation used in the weight function f w () for the match case (i. e., x.value = y.value) in the alignment view (a) and matrix
view (b). The milestone cell in (b) is for inference due to sampling.
with a gap, and iii) aligning the current element in Lb with a 1) (Gap) h up
gap. A cell candidate h is generated for each situation; its
score h.score is computed via the weight function f w f w (x, −, c) = c.scor e + m × p + g × q
(lines 1 to 3 in Algorithm 2). The other two fields, nullrow where
and nullcol , are updated for each candidate cell (lines 4 to 9).
This update may utilize the null region value stored in the p = min(c.n r + x.span + 1, c.n c )
span field of an item. All three cell candidates h up , h le f t , q = diff(c.nr + x.span + 1, c.n c )
and h dia are prepared. The cell candidate having the high-
est score is chosen as H (i, j ), and the score is stored 2) (Gap) h le f t
in H (i, j ).score. f w (−, y, c) = c.scor e + m × p + g × q
TABLE II
D ATASETS IN A CCURACY & S CALABILITY E XPERIMENTS
Fig. 4. The detection success rate of AlignDLD in partial data leaks under
various detection thresholds. Each content sequence contains a consecutive
portion of a 1KB sensitive text, ranging from 32 bytes to 1KB. AlignDLD
achieves 100% detection rates when the threshold is equal or smaller than 0.6.
TABLE IV
S AMPLING R ATES OF AlianDLD ON A.E NRON AND
B.S OURCE -C ODE D ATA S ETS |w| = 100
Fig. 5. Capability of differentiating real leak from coincidental matches of our AlignDLD approach with different sampling rates and Coll-Inter. (a) Enron.
(b) Source-code.
Fig. 6. Parallel realization of our alignment algorithm. Lai and Lbj are the Fig. 7. High scalability of parallel sampling and alignment algorithms.
current items to be aligned. All cells on the concurrent diagonal of (Lai , Lbj )
can be computed simultaneously. GPU platforms. In this section, we aim to answer the following
questions:
capability of our approach is generally stable with respect to 1) How well does our detection scale?
different sampling rates. We observe that sampling rates have (Sections VII-B and VII-C)
a noticeable but insignificant impact on the results. SNRdB 2) What is the speedup of sampling? (Section VII-D)
slightly increases when the sampling rate is small, e.g., 3%.
Our previous experiments in Section VI-B show that thresh- A. Parallel Detection Realization
olds ≥ 0.2 give a strong separation between true leaks We implement two parallel versions of our proto-
and coincidental matches. Thus, the evidence shows that our type on a hybrid CPU-GPU machine equipped with an
method achieves high recall with zero or low false positive Intel Core i5 2400 (Sandy-Bridge micro-architecture) and
rate. In comparison, the collection intersection method reports an NVIDIA Tesla C2050 GPU (Fermi architecture with
higher sensitivity scores for the content without any leak, e.g., 448 GPU cores):
62% for Enron emails. High sensitivity scores in coincidental 1) a multithreading AlignDLD program on CPU,6
matches lead to a high false positive rate for the collection 2) a parallel AlignDLD program on GPU.7
intersection method as illustrated in Figure 2. Smith-Waterman alignment was parallelized in OpenGL [58]
Summary: The experimental results provide strong and CUDA [59]. Our parallel alignment algorithms differ from
evidences supporting that our method is resilience against the existing ones, as we address several implementation issues
various types of modifications evaluated. Our alignment in parallel computing due to our complex weight function.
algorithm provides a high specificity (i.e., low number of In the multithreading CPU version, we parallelize both the
coincidental matches), compared to the collection intersection S AMPLING and A LIGNMENT procedures with the pthread
method. Our approach is capable of detecting leaks of various library. We parallelize the S AMPLING operation by loading
sizes, ranging from tens of bytes to megabytes. different strings onto different threads. Long streams are split
6 The multithreaded CPU version is written in C, compiled using
VII. PARALLELIZATION AND E VALUATION
gcc 4.4.5 with flag -O2.
In order to achieve high analysis throughput, we parallelize 7 The GPU version is written in CUDA compiled using CUDA 4.2 with
our algorithms on CPU as well as on general-purpose flag -O2 -arch sm 20 and NVIDIA driver v295.41.
540 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016
Fig. 8. Speedup of multithreading alignment and GPU-accelerated alignment over the single thread version on different combinations of sensitive data and
traffic data. (a) Enron. (b) MiscNet.
[36] R. Hoyle, S. Patil, D. White, J. Dawson, P. Whalen, and A. Kapadia, [58] W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-Wittig, “Bio-
“Attire: Conveying information exposure through avatar apparel,” in sequence database scanning on a GPU,” in Proc. 20th Int. Parallel
Proc. Conf. Comput. Supported Cooperat. Work Companion (CSCW), Distrib. Process. Symp., Apr. 2006, pp. 1–8.
2013, pp. 19–22. [59] S. A. Manavski and G. Valle, “CUDA compatible GPU cards as efficient
[37] A. Nadkarni and W. Enck, “Preventing accidental data disclosure in hardware accelerators for Smith–Waterman sequence alignment,” BMC
modern operating systems,” in Proc. 20th ACM Conf. Comput. Commun. Bioinformat., vol. 9, no. 2, p. 10, 2008.
Secur., 2013, pp. 1029–1042. [60] Y. Liu, D. L. Maskell, and B. Schmidt, “CUDASW++: Optimizing
[38] Z. Yang, M. Yang, Y. Zhang, G. Gu, P. Ning, and X. S. Wang, Smith–Waterman sequence database searches for CUDA-enabled graph-
“AppIntent: Analyzing sensitive data transmission in Android for privacy ics processing units,” BMC Res. Notes, vol. 2, no. 1, p. 73, May 2009.
leakage detection,” in Proc. 20th ACM Conf. Comput. Commun. Secur., [61] M. A. Jamshed et al., “Kargus: A highly-scalable software-based intru-
2013, pp. 1043–1054. sion detection system,” in Proc. ACM Conf. Comput. Commun. Secur.,
[39] J. Croft and M. Caesar, “Towards practical avoidance of information 2012, pp. 317–328.
leakage in enterprise networks,” in Proc. 6th USENIX Conf. Hot Topics [62] K. Lee, H. Lin, and W.-C. Feng, “Performance characterization of
Secur. (HotSec), 2011, p. 7. data-intensive kernels on AMD fusion architectures,” Comput. Sci.-Res.
[40] V. P. Kemerlis, V. Pappas, G. Portokalidis, and A. D. Keromytis, Develop., vol. 28, no. 2, pp. 175–184, May 2013.
“iLeak: A lightweight system for detecting inadvertent information
leaks,” in Proc. 6th Eur. Conf. Comput. Netw. Defense, Oct. 2010, Xiaokui Shu received the bachelor’s degree in infor-
pp. 21–28. mation security from the University of Science and
[41] E. Bertino and G. Ghinita, “Towards mechanisms for detection and Technology of China (USTC), in 2010. He is cur-
prevention of data exfiltration by insiders: Keynote talk paper,” in rently pursuing the Ph.D. degree in computer science
Proc. 6th ACM Symp. Inf., Comput. Commun. Secur. (ASIACCS), 2011, with Virginia Tech. His research focuses on system
pp. 10–19. and network security. Being the top student in the
[42] P. Papadimitriou and H. Garcia-Molina, “Data leakage detection,” IEEE class, he graduated with the Guo Moruo Award
Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 51–63, Jan. 2011. and was awarded the SIMIT Chinese Academy of
[43] D. Lin and A. Squicciarini, “Data protection models for service provi- Sciences Scholarship in 2008. He succeeded at his
sioning in the cloud,” in Proc. 15th ACM Symp. Access Control Models first penetration test at USTC and won the first prize
Technol., 2010, pp. 183–192. in Virginia Tech Inaugural Cyber Security Summit
[44] A. Squicciarini, S. Sundareswaran, and D. Lin, “Preventing information Competition in 2011.
leakage from indexing in the cloud,” in Proc. 3rd IEEE Int. Conf. Cloud
Comput., Jul. 2010, pp. 188–195. Jing Zhang received the B.S. degree from the
[45] X. Shu, D. Yao, and E. Bertino, “Privacy-preserving detection of College of Computer, National University of
sensitive data exposure,” IEEE Trans. Inf. Forensics Security, vol. 10, Defense Technology (NUDT), China. He is currently
no. 5, pp. 1092–1103, May 2015. pursuing the Ph.D. degree with the Department of
[46] F. Liu, X. Shu, D. Yao, and A. R. Butt, “Privacy-preserving scanning Computer Science, Virginia Tech. His research inter-
of big content for sensitive data exposure with MapReduce,” in ests are high performance computing, bioinformat-
Proc. 5th ACM Conf. Data Appl. Secur. Privacy (CODASPY), 2015, ics, and cloud computing. His major research topics
pp. 195–206. focus on the optimization of irregular applications
[47] Y. Jang, S. P. Chung, B. D. Payne, and W. Lee, “Gyrus: A framework for on heterogeneous platforms.
user-intent monitoring of text-based networked applications,” in Proc.
23rd USENIX Secur. Symp., 2014, pp. 79–93.
[48] K. Borders and A. Prakash, “Quantifying information leaks in outbound Danfeng (Daphne) Yao received the Ph.D. degree
Web traffic,” in Proc. 30th IEEE Symp. Secur. Privacy (SP), May 2009, in computer science from Brown University, in
pp. 129–140. 2007. She is an Associate Professor and an
[49] S. Jha, L. Kruger, and V. Shmatikov, “Towards practical privacy for L-3 Faculty Fellow with the Department of Com-
genomic computation,” in Proc. IEEE Symp. Secur. Privacy, May 2008, puter Science, Virginia Tech, Blacksburg. She
pp. 216–230. received the NSF CAREER Award in 2010 for her
[50] V. Polyanovsky, M. A. Roytberg, and V. G. Tumanyan, “Comparative work on human-behavior driven malware detection,
analysis of the quality of a global algorithm and a local algorithm for and most recently ARO Young Investigator Award
alignment of two sequences,” Algorithms Molecular Biol., vol. 6, no. 1, for her semantic reasoning for mission-oriented
p. 25, 2011. security work in 2014. She received the Outstanding
[51] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min- New Assistant Professor Award from the Virginia
wise independent permutations,” J. Comput. Syst. Sci., vol. 60, no. 3, Tech College of Engineering in 2012. She has received several best paper
pp. 630–659, Jun. 2000. awards (ICICS ’06, CollaborateCom’09, and ICNP’12). She was given the
[52] P. K. Agarwal and R. Sharathkumar, “Streaming algorithms for extent Award for Technological Innovation from Brown University in 2006. She
problems in high dimensions,” in Proc. 21st Annu. ACM-SIAM Symp. held a U.S. patent for her anomaly detection technologies. She is an Asso-
Discrete Algorithms (SODA), 2010, pp. 1481–1489. ciate Editor of the IEEE T RANSACTIONS ON D EPENDABLE AND S ECURE
[53] D. Feldman, M. Monemizadeh, C. Sohler, and D. P. Woodruff, “Coresets C OMPUTING. She serves as a PC Member in numerous computer security
and sketches for high dimensional subspace approximation problems,” in conferences, including ACM CCS.
Proc. 21st Annu. ACM-SIAM Symp. Discrete Algorithms (SODA), 2010,
pp. 630–649.
[54] M. O. Rabin, “Fingerprinting by random polynomials,” Center Wu-Chun Feng received B.S. (Hons.) degrees
Res. Comput. Technol., Harvard Univ., Cambridge, MA, USA, in computer engineering and music and the
Tech. Rep. 15-81, 1981. M.S. degree in computer engineering from Penn
[55] T. F. Smith and M. S. Waterman, “Identification of common molec- State University, in 1988 and 1990, respectively,
ular subsequences,” J. Molecular Biol., vol. 147, no. 1, pp. 195–197, and the Ph.D. degree in computer science from
Mar. 1981. the University of Illinois at Urbana–Champaign, in
[56] C. Kalyan and K. Chandrasekaran, “Information leak detection in 1996. He is a Professor with the Department of
financial e-mails using mail pattern analysis under partial information,” Computer Science, Virginia Tech (VT). He leads
in Proc. 7th WSEAS Int. Conf. Appl. Informat. Commun. (AIC), vol. 7. the Synergy Lab and serves as a Site Codirector
2007, pp. 104–109. of the NSF Center for High-Performance Reconfig-
[57] C. Wüest and E. Florio, “Firefox and malware: When browsers attack,” urable Computing with VT. His research interests
Symantec Corp., Mountain View, CA, USA, White Paper, Oct. 2009. encompass a broad range of topics in efficient parallel computing, including
[Online]. Available: https://www.symantec.com/content/en/us/enterprise/ high-performance computing and networking, energy-efficient (or green)
media/security_response/whitepapers/firefox_and_malware.pdf, accessed supercomputing, accelerator-based computing, cloud computing, grid
Feb. 2015. computing, bioinformatics, and computer science pedagogy for K-12.