Hiding in The Crowd: Privacy Preservation On Evolving Streams Through Correlation Tracking

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Hiding in the Crowd: Privacy Preservation on Evolving Streams through

Correlation Tracking

Feifei Li‡ Jimeng Sun§ Spiros Papadimitriou† George A. Mihaila† Ioana Stanoi†

Boston University, § Carnegie Mellon University, † IBM T.J.Watson Research Center
lifeifei@cs.bu.edu, jimeng@cs.cmu.edu, {spapadim, mihaila,irs}@us.ibm.com

Abstract
been proposed to achieve a desired balance between the two
We address the problem of preserving privacy in streams, [3, 22, 26, 10, 12, 27, 35, 15].
which has received surprisingly limited attention. For static
Prior related work [3, 2, 22, 19] consists of additive
data, a well-studied and widely used approach is based on
random perturbation for the offline, conventional relational
random perturbation of the data values. However, streams
data model, where the noise is distributed along the prin-
pose additional challenges. First, analysis of the data has to
cipal components of the original data in order to achieve
be performed incrementally, using limited processing time
maximum privacy, given a fixed utility. We show that these
and buffer space, making batch approaches unsuitable. Sec-
offline algorithms are no longer optimal when applied to
ond, the characteristics of streams evolve over time. Conse-
numerical, non-stationary (or, time-evolving) data streams.
quently, approaches based on global analysis of the data are
The dynamic correlations and autocorrelations, if not care-
not adequate. We show that it is possible to efficiently and
fully considered, may allow for the reconstruction of the
effectively track the correlation and autocorrelation struc-
original streams.
ture of multivariate streams and leverage it to add noise
which maximally preserves privacy, in the sense that it is Guaranteeing data privacy is especially challenging in
very hard to remove. Our techniques achieve much better the case of stream data [6, 28], mainly for two reasons:
results than previous static, global approaches, while re- 1) Performance requirement: The continuous arrival of
quiring limited processing time and memory. We provide new tuples prohibits storage of the entire stream for analy-
both a mathematical analysis and experimental evaluation sis, rendering the current offline algorithms inapplicable.
on real data to validate the correctness, efficiency, and ef- 2) Time evolution: Data streams are usually evolving, and
fectiveness of our algorithms. correlations and autocorrelations [33, 24] can change over
time. These characteristics make most offline algorithms
for static data inappropriate, as we show later.
1. Introduction To the best of our knowledge, privacy preservation on
Recently, there has been an increasing concern regarding streams has not yet been addressed in the literature, despite
privacy breaches, especially those involving sensitive per- the wide use of data streams in a large range of sensitive ap-
sonal data of individuals [14]. As a result, restrictions and plications such as financial, retail, defense, and health care.
regulations in publishing sensitive personal data have been For example, consider two financial firms that would like to
tightened [38]; these address data owned by government or- collaboratively monitor clusters over their streaming real-
ganizations [29] as well as corporations [7]. It is therefore time transactions [39]. However, none of them is willing to
not surprising that the data management community has be- publish the original data streams. The best resolution is to
come increasingly focused on ways to guarantee the privacy find ways to guarantee both the utility and privacy of data
of sensitive data. in an online fashion. The scheme should general and not
Meanwhile, unprecedented massive data from various restricted to any specific mining operation.
sources are providing us with great opportunity for data In this paper we fill the gap in the area of data stream
mining and information integration. Unfortunately, the pri- privacy, by proposing efficient online streaming algorithms
vacy requirement and data mining applications pose ex- that guarantee the privacy of single or multiple non-
actly opposite expectations from data publishing [14, 36]. stationary data streams. This work focuses on numerical
The utility of the published data w.r.t the mining appli- data streams, such as environmental sensor data, perfor-
cation decreases with increasing levels of privacy guaran- mance measurements, or stock trading prices. Our goal is to
tees [23]. Previous work has noticed this important trade- insert random perturbation that “mirrors” the streams’ sta-
off between privacy and utility and various techniques have tistical properties, in an online fashion. A number of impor-
Symbol Description
v a vector (lowercase bold)
[30, 18] can be applied such as, for example, linear interpo-
v(i) the i-th element of vector v lation (for upsampling) or decimated moving average (aka.
X a matrix (uppercase bold) tumbling average, for downsampling). We will thus assume
XT the transpose of X a time granularity such that, during each time interval, there
Xi or Xj i-th row or j-th column of X
is exactly one recorded incoming value from each stream.
Xji the entry (i, j) of X
T the number of timestamps up to now Therefore, for the purposes of our analysis and without
N the number of streams loss of generality, the input consist of N data streams, de-
A original stream collection in RT ×N noted as A1 , . . . , AN . For any i-th data stream Ai , its value
A∗ the perturbed stream collection
at time t is Ait . The stream collection is written as A = [Ai
à the reconstructed stream collection
An the n-th stream for 1 ≤ i ≤ N ]. Formally, the stream collection A can be
At the values from all streams at time t considered as a T × N matrix where N is the number of
E the random noise in RT ×N streams and T is the current timestamp, which grows indef-
D(A, A∗ ) the discrepancy on original and perturbed streams initely. The values from all streams at time t are At ∈ RN ,
Table 1. Description of notation. i.e., t-th row of A.

tant mining operations can still be performed, by controlling


2.2 Discrepancy, Utility and Privacy
perturbation magnitude. However, the original data streams
cannot be reconstructed with high confidence. To ensure privacy of streaming data, the values of in-
To the best of our knowledge, our work is the first to pro- coming tuples are modified by adding noise. We denote the
vide the basic building blocks sufficient for a general solu- random noise as E ∈ RT ×N where each entry Eit is the
tion for privacy of numerical streams. More specifically, we noise added to the i-th stream at time t. Therefore, the per-
focus on the fundamental cases of correlation across multi- turbed streams are A∗ = A + E. Without loss of generality,
ple streams and of autocorrelation within one stream. we assume the noise has zero mean.
Our contributions are: 1) define the notion of utility and Discrepancy: To facilitate the discussion on utility and pri-
privacy for perturbed data streams, 2) explore the effect of vacy, we define the concept of discrepancy D between two
evolving correlations and autocorrelation in data streams, versions of the data, A and B, as the normalized squared
and their implications in designing additive random per- Frobenius norm1,
turbation techniques, 3) design efficient online algorithms
under the additive random perturbation framework, which 1
D(A, B) := kA − Bk2F , where A, B ∈ RT ×N .
maximally preserve the privacy of data streams given a fixed T
utility while, additionally, better preserving the statistical
properties of the data, and 4) provide both theoretical argu- Utlity: Considering the perturbed versus the original data,
ments and experimental evaluation to validate our ideas. the larger the amplitude of the perturbation (i.e., the vari-
ance of the added noise), the larger the distortion of the
The rest of the paper is organized as follows:
original values. However, as the distortion increases, the
Section 2 introduces definitions and problem formulation
and Section 3 discusses the related work. Section 4 studies usefulness of the data decreases: a larger distortion hides
the original values better but it also hides more informa-
privacy preservation for multiple streams through correla-
tion about their relationships. The discrepancy D(A, A∗ )
tion tracking. Section 5 further exploits the autocorrelation
between original and perturbed data measures precisely the
property to preserve privacy. Finally, the experimental eval-
squared distortion. We naturally define the utility to be the
uation on real data streams is performed in Section 6.
inverse of this discrepancy. However, throughout the paper,
we typically use discrepancy, since the two are essentially
2. Preliminaries interchangeable.
Privacy: Distorting the original values is only part of the
2.1 Data Stream Model
story. We also have to make sure that this distortion can-
Our model assumes that the input consists of multiple not be filtered out. Thus, to measure the privacy, we have
continuous streams. Without loss of generality, we may as- to consider the power of an adversary in reconstructing
sume that each tuple consists of a single attribute. Further- the original data. Specifically, suppose that à are the re-
more, we assume that all streams are resampled to a com- constructed data streams obtained by the adversary, in a
mon rate, which is between the arrival rate of the fastest and way that will be formalized shortly. Then the privacy is
the slowest stream. The common sampling rate can be cho- the discrepancy between the original and the reconstructed
sen based on arrival rate, data characteristics and available streams, i.e., D(A, Ã).
processing capacity—details are beyond the scope of this
1 The j 2
paper. Subsequently, any standard resampling technique squared Frobenius norm is defined as kAk2F :=
P
i,j (Ai )
2.3 Problem Formulation perturbation approach and we will focus on discussing re-
We formulate two problems: data reconstruction and lated work in this area.
data perturbation. From his side, the adversary wants to Data perturbation can be further classified in two groups:
recover the original streams from the perturbed data. retention replacement perturbation [4, 15] and data value
perturbation [3, 26, 10]. For each element in a column j,
Problem 1 (Reconstruction). Given the perturbed streams the retention replacement perturbation retains this element
A∗ , how to compute the reconstruction streams à such that with probability pj and with probability 1 − pj replaces it
D(A, Ã) is minimized? with an item generated from the replacing p.d.f. on this col-
umn. This approach works for categorical data as well, and
In this paper, we focus on linear reconstruction meth- it has been applied to privacy preserving association min-
ods which have been used by many existing works [22, 19, ing [15]. Our work focuses on numerical data value per-
26, 10]. Intuitively, the adversary can only use linear trans- turbation. Initial solutions in this category, [3, 2], proposed
formations on the perturbed data, such as projections and adding random i.i.d. noise to the original data and showed
rotations, in the reconstruction step. that, with knowledge of the noise distribution, the distribu-
tion of the original data can be estimated from the perturbed
Definition 1 (Linear reconstruction). Given the perturbed
data, and aggregate values are preserved. In [22, 19] the au-
streams A∗ , the linear reconstruction is à = A∗ R, such
thors pointed out that adding random i.i.d. noise is not op-
that D(A, Ã) is minimized
timal for privacy preservation. They showed how to recon-
If both the perturbed streams A∗ and the original streams struct the original data (individual data values) using Spec-
A are available, the solution à can be easily identified us- tral Filtering (SF) or the equivalent PCA method. The main
ing linear regression. However, A is not available. There- conclusion is that random noise should be distributed along
fore, in order to estimate Ã, some additional constraints or the principal components of the original data, so that linear
assumptions must be imposed to make the problem solv- reconstruction methods cannot separate the noise from the
able. A widely adopted assumption [21] is that the data original data. Motivated by this observation and in similar
lie in a static low dimensional subspace (i.e, global correla- spirit, [10] proposed the random rotation technique for pri-
tion exists). This is reasonable, since if no correlations are vacy preserving classification and [26] proposed data per-
present, then i.i.d. perturbations are already sufficient to ef- turbation based on random projection. The work of [14]
fectively hide the data. However, real data typically exhibit discussed a method to quantify the privacy breach for pri-
such correlations. In this paper, as we will formally show vacy preserving algorithms, namely α − β analysis or γ-
later, we rely on the dynamic (rather than static) correlations amplification. The basic idea is that, on the perturbed data,
among streams, as well as on dynamic autocorrelations. the adversaries’ knowledge measured by their confidence
From their side, data owners want to prevent the recon- about a given property of the original data should not be
struction from happening. increased more than a certain amount. The work in [5] con-
sidered the problem of setting the perturbation parameters
Problem 2 (Perturbation). Given the original streams A while maintaining γ-amplification.
and the desirable discrepancy threshold σ 2 , how to obtain All these techniques have been developed for the tradi-
the perturbed streams A∗ such that 1) D(A, A∗ ) = σ 2 and tional relational data model. There is no prior work on pri-
2) for any linear reconstruction Ã, D(A, Ã) ≥ σ 2 . vacy preservation on data streams, except the work on pri-
vate search over data streams [31, 8]. However, the goal
Perturbation has exactly the opposite goal from the re- there is to protect the privacy of the query over data stream,
construction. However, the correlation and autocorrelation not of the data stream itself. Finally, our data perturbation
properties of the streams are still the keys in the solution of techniques rely on PCA for data streams w.r.t both correla-
both problems, as shown later. tions and autocorrelations. Streaming PCA and eigenspace
tracking of correlations (but not autocorrelation) among
3. Related Work multiple data streams has been studied in [32, 17].
Privacy preserving data mining was first proposed in [3]
and [2]. This work paved the road for an expanding field, 4. Privacy with Dynamic Correlations
and various privacy preservation techniques have been pro- We first give the insight behind our data perturba-
posed since. These methods apply to the traditional rela- tion and reconstruction methods, then present methods for
tional data model, and can be classified as data perturbation correlation-based noise perturbation and reconstruction.
[3, 2, 26, 10, 15, 4], k-anonymity [23, 35, 1, 27, 38] and se-
cure multiparty computation [25, 37]. Our work focuses on Insight and intuition Let us first illustrate how the per-
privacy preservation in the context of the randomized data turbation and reconstruction work in detail (see figure 1(a)
blue: original data, grean: perturbed data blue: original data, green: perturbed data
A*t 20 20

15 15

σ2
10 10

Removed 5 5

2
noise

Stream A

Stream A
0 0

At −5 −5

Remaining Projection −10 −10

noise error −15 w1 −15

~ Principal w1
Privacy At Direction
−20
−30 −20 −10 0
Stream A1
10 20 30
−20
−30 −20 −10 0
Stream A1
10 20 30

(a) noise decomposition (b) i.i.d random noise (c) correlated noise
Figure 1. Impact of Correlation on Perturbing the Data

for the visualization). For the perturbation process, the


stream measurements At at time t, represented as an N - 20
1

dimensional point, are mapped to the perturbed measure- 15


w1 0.5
w1
ments A∗t with discrepancy σ 2 . For any reconstruction ef-

A : cos(t)
10
0
5

2
Stream A

2
−0.5
fort, the goal is to transform the perturbed measurements, 0
−1
w2
A∗t onto Ãt so that, hopefully, D(At , Ãt ) is small.
−5
2
−10
1
w
A principled way of reconstruction is to project the data −15 2
2000
0
w3 40
−20 30

onto the principal component subspace [19] such that most 20


0
−20
−40 0
1000
A1: sin(t)+noise
−1

−2 0
10
20

Time t Time t
noise is removed, while the original data are maximally pre- Stream A
1

served, i.e, not much additional error is included. The idea (a) Example one. (b) Example two.
is illustrated in figure 1(a). When A∗t is projected onto the Figure 2. Dynamic correlations in Data
principal direction, the projection is exactly the reconstruc-
Streams
tion Ãt . Note that the distance between A∗t and Ãt consists
of two parts: 1) removed noise, i.e., the perturbation that is Algorithm 1: SCAN
removed by the reconstruction and 2) projection error, i.e,
the new error introduced by the reconstruction. Finally, the Input : Original tuple At , utility threshold σ 2
distance between At and Ãt , i.e., the privacy, comes from Old subspace U ∈ RN ×k , Λ ∈ Rk × k
two sides: 1) remaining noise, i.e, the perturbation noise Output: Perturbed tuple A∗t , new subspace U, Λ
that has not been removed, and 2) projection error. update eigenvector U, eigenvalue Λ based on At )

When the noise is added exactly along the principal di- Initialize δ, η to 0 k
rection, the removed noise becomes zero. However, addi- //add noise in top-k principal component subspace
tional projection error is included. In this case, the pertur- for 1 ≤ i ≤k do
Λ(i)
bation is robust towards this reconstruction attempt, in the δ(i)=σ 2 × k(Λ)k
sense that D(A, Ã) = D(A, A∗ ). In general, a good prac- η(i) = gaussian noise with variance δ(i)
tice is to add correlated noise following the trends present in // rotation back to the original space
the streams. Consider the example shown in Figure 1 where Et =η × UT and A∗t =At +Et
blue points represent the original data and green points rep-
resent the perturbed data with same amount of noise. Figure
1(b) and 1(c) show the i.i.d. noise and the correlated noise Streaming Correlated Additive Noise (SCAN) SCAN
on the same data, respectively. Clearly, correlated noise has does two things whenever new tuples arrive from the N in-
been successfully “hidden” in the original data and, there- put streams: 1) update the estimation of local principal com-
fore, is hard to remove. ponents; and 2) distribute noise along the principal compo-
Data streams often present strong correlations and these nents in proportional to their eigenvalues.
correlations change dynamically [39, 33, 32]. Consider the An important property of the SCAN algorithm is that
examples in figure 2, where the principal components are when the noise is rotated back to the data space (line 6), its
changing over time. In such case, online PCA is necessary variance will be equal to the specified discrepancy thresh-
to better characterize the evolving, local trends. Global, of- old σ 2 . Intuitively, SCAN tracks the covariance matrix and
fline PCA will fail to identify these important properties adds noise with essentially the same covariance as the data
as we will show later in the experiments. Next, we will streams—proof details are ommited for space.
show how to dynamically insert noise using online correla-
tion tracking [32]. Theorem 1. At any time instant T , the perturbed data
Algorithm 2: SCOR where R is a projection matrix, meaning that R = UUT
with U ∈ RN ×k orthonormal. Since the subspaces tracked
Input : Perturbed tuple A∗t , utility threshold σ 2
by both SCOR and SCAN are the same, the remaining noise
Old subspace U ∈ RN ×k , Λ ∈ Rk × k
is σ 2 , i.e., no noise is removed. Therefore, D(A, Ã) ≥ σ 2
Output: Perturbed tuple Ãt , new subspace U, Λ
by the triangle inequality.
1 update eigenvector U, eigenvalue Λ based on At )
2 //project to the estimated online principal components Note that the projection error for SCOR is small, pro-
Ab t =A∗ × UN ×k × UT vided that the data are locally correlated. Therefore, the
t N ×k
reconstruction error (i.e., privacy, as defined in Section 2.2)
of SCOR is approximately σ 2 , i.e., equal to the original dis-
streams A∗ from SCAN satisfy D(A, A∗ ) = σ 2 . Addition- crepancy. Moreover, when σ 2 is small compared to the orig-
ally, SCAN preserves the eigenvectors of the (uncentered) inal data, other reconstruction methods will result in higher
covariance matrix of A. error, due to the large projection error.

Therefore, the SCAN perturbation will not affect any 5 Dynamic autocorrelation
mining algorithms that rely on the second moments of the
So far, we have presented the methods based on correla-
data (i.e., linear correlations).
tion across many streams. Now we exploit another impor-
Streaming correlation online reconstruction (SCOR): tant property, autocorrelation on a single stream and then
The privacy achieved by SCAN is determined by the best propose the corresponding perturbation and reconstruction.
linear reconstruction an adversary could perform on A∗ (see Intuition: The noise added should mirror the dominant
Section 2.2). For evolving data streams as illustrated in fig- trends in the series. Consider the following simple exam-
ure 2, the best choice for the adversary is to utilize online ples: If the stream always has a constant value, the right
estimation of local principal components for reconstruction. way to hide this value is to add the same noise throughout
The ability to estimating the local principal components of time. Any other noise can be easily filtered out by simple
the original data streams depends on how the noise has been averaging. The situation is similar for a linear trend (this is
added. For SCAN, we know that the principal component also an example that cannot be captured by Fourier). If the
directions are preserved, since the noise is added along their stream is a sine wave, the right way to hide it is by adding
direction (Theorem 1). In general, we may assume the noise noise with the same frequency (but potentially a different
is small compared to the data—otherwise, the utility of the phase); anything else can be filtered out. Our algorithm is
perturbed data is too low to be useful. Then, tracking the the generalization, in a principled manner, of these notions.
principal components of the perturbed streams A∗ can give For example, the green and blue curves in figure 3(b)
a good estimate of the principal components of the original are the autocorrelated noise and the original stream, re-
streams A. Formally, cov(A∗ ) ≈ cov(A). spectively, where the noise follows the same trends as the
Intuitively, SCOR reconstruction removes all the noise streams, over time. In comparison, figure 3(a) shows i.i.d.
orthogonal to the local principal components and inserts lit- noise, which can be easily filtered out. The goal is to find
tle additional projection error, since local PCA can usually a principled way to automatically determine what is the
track the data accurately. In other words, i.i.d. noise can “right” noise, which is “most similar” to the stream.
usually be successfully removed, provided that the streams Connection to correlation: In the previous section, we
are correlated. However, the perturbation from SCAN can- showed how to track the local statistical properties of the N -
not be removed at all since the the noise is distributed along dimensional sequence of the vectors At , indexed over time
the “instantaneous” correlation in the streams. t. More specifically, we track the principal subspace of this
matrix, thereby focusing on the most dominant (in a least-
Theorem 2. The reconstruction error of SCOR on the per- squares sense) of these relationships. We subsequently add
turbation from SCAN is ≈ σ 2 . noise that “mirrors” those relationships, making it indistin-
guishable from the original data.
Proof. Formally, given a linear reconstruction à = A∗ R,
Next, we will show that the same principles used to cap-
the privacy can be decomposed as
ture relationships across many attributes can be used to cap-
D(A, Ã) = kA − A∗ Rk2F ture relationships of one attribute across time. In fact, there
= kA − (A + E)Rk2F is a natural way to move between the original time domain
and a high-dimensional sequence space, which is formal-
= kA(I − R) + ERk2F .
ized next. The t-th window of the time series stream a(t) is
2
= k A(I − UUT ) + EUU T
| {z } kF an h-dimensional point,
| {z }
remaining error
projection error Wt := [a(t), a(t + 1), . . . , a(t + h − 1)]T ∈ Rh .
3.5 3.5

3
Data
3 Data a=[1 2 3 4 5 6 ]T 1 2 3
Noise Noise W1 2 3 4
2.5 2.5
W2 W =
2 2
W3
3 4 5
1.5 1.5 W4 4 5 6
1 1

0.5 0.5 .1 -.1 .2


W* = W+E
0 0
-.1 .2 .3
E =
−0.5 −0.5
.2 .3 .1
−1 −1 a* = [1.1 1.9 3.2 4.3 5.1 5.8* ]T .3 .1 -.2*
−1.5 −1.5
0 1000 2000 3000 4000
Time
5000 6000 7000 8000 0 1000 2000 3000 4000
Time
5000 6000 7000 8000
el er
(a) iid noise (b) Autocorrelated noise (c) Streaming autocorrelation additive noise
Figure 3. Dynamic Autocorrelation

The window matrix W has the windows Wt as rows. Thus, Algorithm 3: SACAN
Wij = a((i − 1)h + j) by construction. The space spanned
Input : Original value a∗ (t), utility σ 2
by the sequence of windows Wt is known as the h-th order
Old subspace U ∈ Rh×k , Λ ∈ Rk × k
phase space of the series a(t) [16]. Subsequently, we can
Output: Perturbed value a∗ (t), new subspace U, Λ
essentially apply the same technique as before, using W in T
1 Construct window Wt−h+1 = [a(t−h+1), . . . , a(t)]
place of A. All of the previous discussion and properties of
2 Update U, V using Wt−h+1
our algorithm can be directly transferred to the autocorre-
3 every k arriving values do
lation case. An example is shown in the top of figure 3(c).
4 Let [wTl | wTr ]T ≡ Wt+h+1
However, there are some additional properties and issues
5 Solve equation 2 to obtain er
that need to be resolved.
6 Rescale er based on σ 2
Hankel Constraint: Notice that the window matrix W
7 Perturbed values w∗r = wr + er
is a Hankel matrix, i.e., the anti-diagonals are constants:
8 Publish values a∗ (t−k+i−1) = w∗r (i), 1 ≤ i ≤ k
Wij = Wi−1 j−1
. Under the assumption that the series is
stationary, the autocovariance matrix WT W is, in expec-
tation is circulant, i.e., it is symmetric with constant diag-
onals. Additionally, if we perform a batch eigen-analysis onto the orthogonal complement.
on the global window matrix of a static series, the sample Assume that we have chosen the noise values up to time
autocovariance matrix computed from the actual data (i.e., t − k. Based on these and on the current estimate of U,
WT W above) is also circulant. In this case, the eigenvec- we will determine the next k noise values (where k is the
tors of WT W essentially provide the same information as principal subspace dimension)—the reason for determining
the Fourier coefficients of the series a. In that sense, our them simultaneously will become clear soon. Let
approach includes traditional Fourier analysis. If these as-
Et−h+1 ≡ [e(t−h+1), . . . , e(t−k) | e(t−k+1), . . . , e(t)]T
sumptions do not hold, the technique we employ is more ro-
bust and effective. Detailed discussion is beyond the scope ≡ [eTl | eTr ]T ,
of this paper—interested readers may consult [16, 34, 18]
for more details. where | denotes elementwise concatenation (for example,
Constraint on autocorrelated noise: Next, we address the [1, 2 | 3, 4] results into two row vectors [12] and [34]. The
issues that arise from the fact that W is a Hankel matrix. first block el ∈ Rh−k consists of h − k known values,
Similarly, the noise matrix E has to be a Hankel matrix (see whereas the second block er ∈ Rk consists of the k un-
figure 3(c) for an example). Similar to the correspondence known noise values we wish to determine. Similarly de-
between a and W, the noise matrix E has a corresponding composing Q ≡ [Ql | Qr ] into blocks Ql ∈ Rh×(h−k) and
noise sequence e, such that Qr ∈ Rh×k , we can rewrite equation 1 as

Et ≡ [e(t), e(t + 1), . . . , e(t + h − 1)]T ∈ Rh . Ql el + Qr er = 0 or Qr er = −Ql el . (2)


We will essentially use the same insight, that Et has to lie in This is a linear equation system with k variables and k un-
the subspace of U, but in a different way. Formally stated, knowns. Since the principal subspace has dimension k by
the residual Et − UUT Et must be zero, or construction, the linear system is full-rank and can always
be solved. The bottom right of 3(c) highlights the known el
(I − UUT )Et ≡ QEt = 0, (1)
and unknown er (with one principle component k = 1).
T
where P = UU is the projection operator onto the sub- The above equation cannot be applied for initial values
space of U and Q = I − P = I − UUT is the projector of the noise; we will use i.i.d. noise for those. Initially, we
Algorithm 4: SACOR Data Streams Dimension Description
Chlorine [13] 4310×166 Environmental sensors
Input : Perturbed value a∗ (t)
Lab [11] 7712×198 Room sensors
Old subspace U ∈ RN ×k , Λ ∈ Rk × k
Stock [20] 8000×2 Stock price
Output: Reconstruction ã(t), new subspace U, Λ
T
1 Construct window Wt−h+1 = [a(t−h+1), . . . , a(t)] Table 2. Three Real Data Sets
2 Update U, V using Wt−h+1
6. Experiment
T
3 Project onto est. eigenspace W̃ = UU Wt−h+1
h
We have implemented the proposed algorithms and study
4 Reconstruction is the last element of W̃, ã(t) = W̃t
their performance on real data streams. Specifically, we
show that: 1) in terms of preserving the input streams’ pri-
vacy, SCAN and SACAN outperform both i.i.d. noise as
do not know anything about the patterns present in the sig- well as noise added based on offline analysis; 2) SCOR and
nal, therefore i.i.d. noise is the best choice, since there are SACOR achieve smaller reconstruction error than static, of-
no correlations yet. However, the adversary has also not fline algorithms; 3) all proposed algorithms have consider-
observed any correlations that can be leveraged to remove ably small computation and memory overhead.
that noise. The important point is that, as soon as correla-
tions become present, our method will learn them and use 6.1 Setup
them to intelligently add the noise, before the adversary can The real-world data sets we use are summarized in table
exploit this information. 2. Chlorine water quality in a drinking water distribution
Figures 3(b) and 6 clearly show that our approach accu- system and Lab measures light, humidity, temperature and
rately tracks the dominant local trends, over a wide range of voltage of sensors in the Intel Research Berkeley lab.
stream characteristics. Algorithm 3 shows the pseudocode. For simplicity, both discrepancy and reconstruction error
The algorithm for reconstructing the original data is sim- are always expressed relative to the energy of the original
pler; we only need to project each window Wt onto the streams, i.e., D(A, A∗ )/kAkF and D(A, Ã)/kAkF , re-
current estimate of U, exactly as we did for the correlation spectively. Equivalently, the streams are normalized to zero
case. The pseudocode is shown in Algorithm 4. mean and unit variance, which does not change their corre-
The analogues of Theorems 1 and 2 are summarized in lation or autocorrelation properties. The random noise dis-
the following theorem (proof ommited for space). tribution is zero mean Gaussian, with variance determined
Theorem 3. The perturbed stream from SACAN satisfies by the discrepancy parameter. Maximum discrepancy is
D(A, A∗ ) = σ 2 and preserves the eigenvectors of the au- 30%, as large noise will destroy the utility of the perturbed
tocovariance matrix. The squared reconstruction error of data, making them practically useless for the mining appli-
SACOR on this perturbed stream is approximately σ 2 . cation. Without loss of generality and to facilitate presen-
tation, we assume that perturbation and reconstruction use
Preserving the autocorrelation properties, in addition to the same number of principal components—see discussion
the privacy, is desirable, since several fundamental mining in Section 6.5. Our prototype is implemented in Matlab and
operations, such as autoregressive modelling and forecast- all experiments are performed on an Intel P4 2.0GHz CPU.
ing as well as periodicity detection [9], rely on them.
6.2 Dynamic Correlation
Multi-dimensional extension. If we wish to capture The perturbation and reconstruction methods investi-
both correlations as well as autocorrelations on multi- gated in our experiments are summarized in Table 3, where
dimensional streams, we can decompose the problem in a “N” stands for noise and “R” for reconstruction. The of-
fashion very similar to [32]. Details are beyond the scope fline algorithms, for both perturbation and reconstruction,
of this work, but we briefly present the main idea. We track are essentially the existing work on the static, relational
the eigenspace of the covariance matrix. However, instead data model, using PCA on the entire stream history to iden-
of using it only for adding noise, we also perform PCA on tify correlations and add or remove noise accordingly. Ex-
the stream collection, to obtain k ≪ N streams of “hidden cept otherwise specified, we set the number of principal
variables.” Subsequently, we can apply our autocorrelation components k to 10. Although the offline algorithms may
tracking scheme independently on each of these uncorre- not be applicable in a streaming setting due to large stor-
lated (across dimension) streams. SPIRIT performs pre- age requirements, they are included for comparison and we
cisely the same decomposition of the problem (while con-
trolling the PCA approximation error) [32], except it does
Perturbation i.i.d-N offline-N online-N:SCAN
so for multi-dimensional autoregression, rather than auto-
Reconstruction baseline offline-R offline-R:SCOR
correlation tracking.
Table 3. Perturbation/Reconstruction Method
0.45
0.35
SCAN i.i.d−N i.i.d−N
0.4 i.i.d−N
0.3 offline−N 0.3 0.3
offline-R offline−N offline−N
0.35 online−N
Reconstruction Error

SCAN SCAN

Reconstruction error
0.25 0.25 0.25
dash: offline−R
0.3
solid: SCAN baseline baseline

Privacy

Privacy
0.2 0.25 0.2 0.2

0.15 0.2 0.15 0.15

0.1 0.15
0.1 0.1
0.1
0.05
0.05 0.05
0.05
0
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
i.i.d-N offline-N online-N 5 10 15 20 25
Number of principal components k Discrepancy Discrepancy

(a) Reconstruction Error: SCAN vs. (b) Reconstruction Error: vary k (c) Privacy vs. Discrepancy, online-R:(d) Privacy vs. Discrepancy, online-R:
offline-R Lab data set Chlorine data set

Figure 4. Privacy Preservation for Streams with Dynamic Correlations

struction error than offline-R, for all types of noise. This


i.i.d−N i.i.d−N
0.3
SACAN
0.3
SACAN reinforces our decision measure to the privacy of the per-
0.25 baseline 0.25 baseline turbed data using online-R (SCOR).
Privacy

Privacy

0.2 0.2
Perturbation Performance Next, we measure the ability
0.15 0.15
of different perturbation methods to preserve privacy of data
0.1 0.1

0.05 0.05
streams with dynamic correlations. Results on the Lab and
0 0
Chlorine data are presented in figures 4(c) and 4(d). Clearly,
0 0.05 0.1 0.15 0.2
Discrepancy
0.25 0.3 0.35 0 0.05 0.1 0.15
Discrepancy
0.2 0.25 0.3 0.35
for both data streams, SCAN achieves the best privacy over
(a) Privacy vs. Discrepancy: Chlorine (b) Privacy vs. Discrepancy: Stock all discrepancy values. SCAN effectively achieves the best
privacy w.r.t. the allowed discrepancy. Compared to the
Figure 5. Privacy vs. Discrepancy: Online baseline method, online-R removes no noise at all from
Reconstruction SCAN.
show that, besides high overheads, their performance is sub- 6.3 Dynamic Autocorrelation
optimal due to the time-evolving nature of streams. Finally,
baseline reconstruction is simply the perturbed data them- This section presents the results that demonstrate the cor-
selves (i.e., no attempt to recover the original data). rectness and effectiveness of our algorithms for data pertur-
bation in streams with autocorrelation. In all experiments,
Reconstruction Error Figure 4(a) shows the reconstruc- except otherwise specified, the window size h is set to 300
tion error of online and offline reconstruction, w.r.t all types and the number of principal components k is 10. Since there
of noise. The figure presents results from Lab data with dis- is no previous work that explores autocorrelation in the of-
crepancy is set to 10%. In all cases, SCOR clearly outper- fline case for privacy preservation, we compare our method
forms the offline method. The main reason is that offline-R against i.i.d. noise and we use the online reconstruction al-
has considerable projection error for streams with dynami- gorithm proposed in this paper, SACOR, in order to mea-
cally evolving correlation, whereas SCOR has almost neg- sure privacy.
ligible projection error and its reconstruction error is domi-
Effectiveness of SACAN The key idea of SACAN is to
nated by the remaining noise. Similar phenomena were ob-
track the data stream’s autocorrelation in an online fashion
served for other discrepancy values. Threfore, online recon-
and produce noise with similar autocorrelation. To illus-
struction should be the candidate for measuring the privacy
trate the effectiveness of SACAN, we apply SACAN and
of the perturbed data.
i.i.d. noise on different types of streams. We show the re-
Effect of k The number of principal components k will sults from the Chlorine and Stock data sets in figure 6. The
affect both the projection error and the remaining noise, discrepancy is set to 10% for all experiments. We observe
which in turn have an impact on the overall reconstruction from figure 6(a) and 6(c) that SACAN initally produces
error. Figure 4(b) studies the effect of k on both offline- i.i.d. noise but it is quickly able to estimate and track the
R and online-R on the Lab data with discrepancy fixed at autocorrelation of the input stream. Hence, SACAN adds
10%. For both approaches, reconstruction errors decrease random noise that closely follows the estimated autocorre-
as k increases. Two interesting facts are reflected. First, lation. Intuitively, the noise generated by SACAN exhibits:
online-R requires smaller k to reach a “flat,” stable recon- 1) the same frequency as the input data stream; 2.) ampli-
struction error. This is beneficial since both the computation tude that is determined by the discrepancy. Thus, since the
and memory cost increase in proportion to k, as we will see SACAN perturbation follows the same trend as the input
in Section 6.4. Second, online-R achieves smaller recon- stream, it is hardly distinguishable or separable once added.
4 3 2.5 2.5
Data Data Data Data
Noise Noise 2 Noise 2 Noise
3
2
1.5
1.5
2 1
1 1
0.5
1 0.5
0 0
0 0
−0.5
−1 −0.5
−1 −1
−1
−1.5
−2
−2
−2 −1.5

−3 −3 −2.5 −2
0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000
Time Time Time Time

(a) Chlorine: SACAN (b) Chlorine: i.i.d (c) Stock: SACAN (d) Stock: i.i.d.
Figure 6. Online Random Noise for Stream with Autocorrelation
12 12 process input from hundreds of streams in a few millisec-
SCAN on Lab Data Set 100 streams
11 SCAN on Lab Data Set
10
11
onds. The same experiments performed for other online
Time per tuple (ms)

Time per tuple (ms)

9
10
algorithms (SCOR, SACAN, SACOR) lead to similar ob-
8 9
servations.
7

6
8
Memory consumption for all of our algorithms is domi-
5
7
nated by two factors: 1) the online estimation of local prin-
6
4
cipal components, either for correlation or for autocorre-
3 5
50 100
Number of streams
150 200 5 10 15 20
Number of principal components
25 30
lation; 2) the number of tuples that need to be buffered.
(a) Time vs. Number of Streams (b) Time vs. Number of Principal
For SCAN and SCOR no buffering is required. Hence,
Components memory consumption memory is required only for the k
local principal component directions, each of which is rep-
Figure 7. Running Time of SCAN resented by an N -dimensional vector, where N is the num-
The advantage of SACAN becomes clear when comparing ber of input streams. Therefore, the total memory required
the results shown in figures 6(b) (SACAN) and 6(d) (i.i.d is N k|R|, where |R| is the floating point representation
noise). size. For SACAN/SACOR, each principal direction is an
h-dimensional vector. Additionally, the last h values need
Privacy of SACAN and Effectiveness of SACOR Using to be buffered, since the entire window is needed to update
SACOR for online reconstruction, more than 13 of the i.i.d. the principal direction estimates. Hence, the total memory
noise added to the Chlorine and Stock streams can be re- consumption is hk|R| + h|R|.
moved. However, the noise produced by SACAN could not
be removed at all. Figure 5 demonstrates these results, for 6.5 Discussion
varying discrepancy. The reconstruction error from SACOR
The experimental evaluation validates the superiority of
is used to measure privacy. Baseline is the discrepancy be-
our approaches. Essentially, for both correlations and au-
tween perturbed and input streams (i.e., no attempt at re-
tocorrelation, our algorithms are able to produce random
construction). Since SACAN produces noise that follows
noise that follows the same trend as the input streams, in a
the same trend as the input data stream, it is hard to remove
online fashion. In addition, our algorithms have small com-
any such noise from the perturbed data.
putation and memory overhead. Finally, we should point
out that, with relatively small amount of noise injected into
6.4 Cost Analysis the original data streams, regardless of the type of the noise,
The cost metrics include the computational require- the principal components of the perturbed data will be a
ments, measured in seconds, and the memory consump- good approximation of input data streams. Of course, there
tion, measured in bytes. Figure 7 shows the running time of are ways to mathematically infer the principal components
SCAN on the Lab data set. Figure 7(a) investigates the ef- of the input data streams, given the perturbed data streams
fects of increasing the number of input streams N while fix- and some knowledge of the noise properties, such as its dis-
ing k = 10, whereas the second experiment studies the im- tribution and variance. However, in this paper, we do not
pact of keeping different number of principal components consider releasing any knowledge about the noise. In fact,
with N = 100 input streams. In both cases, we take the av- we believe this is the right choice as the goal is to maxi-
erage processing time per tuple, consisting of N elements, mally preserve the privacy, while maintaining certain utility.
one from each input stream. The running time of SCAN is Releasing information about the noise might seriously com-
linear w.r.t. the number of input streams and almost linear promise privacy. The assumption that injected noise cannot
w.r.t. the number of principal components. This indicates be too large is automatically guaranteed by the utility re-
that it has good scalability. On average, SCAN is able to quirement of the target mining application. Hence, the esti-
mates for the number of dominant principal components at [13] EPANET, 2002. http://www.epa.gov/ORD/NRMRL/
reconstruction will be similar to that of perturbation. This wswrd/epanet.html.
justifies using the same number of principal components for [14] A. Evfimevski, J. Gehrke, and R. Srikant. Limiting privacy
breaches in privacy preserving data mining. In PODS, 2003.
both reconstruction and perturbation. However, different k [15] A. Evfimievski, R. Srikant, R. Agarwal, and J. Gehrke. Pri-
in perturbation and reconstruction does not affect the trend vacy preserving mining of association rules. In SIGKDD,
and phenomena in our experiments. 2002.
[16] M. Ghil, M. Allen, M. Dettinger, K. Ide, D. Kondrashov,
M. Mann, A. Robertson, A. Saunders, Y. Tian, F. Varadi, and
7. Conclusion P. Yiou. Advanced spectral methods for climatic time series.
Data streams in the real world typically exhibit both sig- Rev. Geophys., 40(1), 2002.
nificant correlations as well as autocorrelation, thereby pro- [17] S. Guha, D. Gunopulos, and N. Koudas. Correlating syn-
viding ample opportunities for adversaries to breach pri- chronous and asynchronous data streams. In KDD, 2003.
[18] S. Haykin. Adaptive Filter Theory. Prentice-Hall, 4th edi-
vacy. In this paper we develop the basic building blocks tion, 2002.
for privacy preservation on numerical streams. In particu- [19] Z. Huang, W. Du, and B. Chen. Deriving private information
lar, we focus on the fundamental cases of correlation across from randomized data. In SIGMOD, 2005.
multiple streams and of autocorrelation within one stream. [20] INET ATS, Inc. http://www.inetats.com/.
[21] I. T. Jolliffe. Principal Component Analysis. Springer, 2nd
We present methods to dynamically track both and subse- edition, 2002.
quently add noise that “mirrors” these statistical properties, [22] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On
making it indistinguishable from the original data. Thus, the privacy preserving properties of random data perturba-
our techniques prevent adversaries from leveraging these tion techniques. In ICDM, 2003.
[23] D. Kifer and J. Gehrke. Injecting utility into anonymized
properties to remove the noise and thereby breach privacy.
datasets. In SIGMOD, 2006.
We provide both a mathematical analysis and experimen- [24] F. Li, C. Chang, G. Kollios, and A. Bestavros. Characteriz-
tal evaluation on real data to validate the correctness, effi- ing and explorting reference locality for data stream applica-
ciency, and effectiveness of our algorithms. Our techniques tions. In ICDE, 2006.
track the evolving nature of these relationships and achieve [25] Y. Lindell and B. Pinkas. Privacy preserving data mining. In
much better results than previous static, global approaches. CRYTO, 2000.
[26] K. Liu, H. Kargupta, and J. Ryan. Random Projection-Based
Furthermore, to the best of our knowledge, autocorrelation- Multiplicative Data Perturbation for Privacy Preserving Dis-
based attacks have not been previously addressed. tributed Data Mining. IEEE TKDE, 18(1), 2006.
[27] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkita-
References subramaniam. l-diversity: Privacy beyond k-anonymity. In
ICDE, 2006.
[28] S. Muthukrishnan. Data streams: Algorithms and applica-
[1] C. C. Aggarwal and P. S. Yu. A condensation approach to tions. Technical report, Computer Science Department, Rut-
privacy preserving data mining. In EDBT, 2004. gers University, 2003.
[2] D. Agrawal and C. C. Aggarwal. On the design and quan- [29] E. U. on Privacy Protection, 2002.
tification of privacy preserving data mining algorithms. In http://europa.eu.int/eur-lex/pri/en/oj/
PODS, 2001. dat/2002/l 201/l 20120020731en00370047.pdf.
[3] R. Agrawal and R. Srikant. Privacy preserving data mining. [30] A. V. Oppenheim and A. S. Wilsky. Signals and Systems.
In SIGMOD, 2000. Prentice-Hall, 1983.
[4] R. Agrawal, R. Srikant, and D. Thomas. Privacy preserving [31] R. Ostrovsky and W. Skeith. Private searching on streaming
olap. In SIGMOD, 2005. data. In CRYPTO, 2005.
[5] S. Agrawal and J. R. Haritsa. A framework for high-accuracy [32] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern
privacy-preserving mining. In ICDE, 2005. discovery in multiple time-series. In VLDB, 2005.
[6] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. [33] S. Papadimitriou and P. Yu. Optimal multi-scale patterns in
Models and issues in data stream systems. In PODS, 2002. time series streams. In SIGMOD, 2006.
[7] E. Bertino, B. Ooi, Y. Yang, and R. Deng. Privacy and owner- [34] R. O. Schmidt. Multiple emitter location and signal parame-
ship preserving of outsourced medical data. In ICDE, 2005. ter estimation. IEEE Trans. Ant. Prop., 34(3), 1986.
[8] J. Bethencourt, D. Song, and B. Waters. Constructions and [35] L. Sweeney. k-anonymity: a model for protecting privacy.
practical applications for private stream searching. In IEEE Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5), 2002.
Symposium on Security and Privacy, 2006. [36] K. Thearling. Data mining and privacy: A conflict in making.
[9] P. J. Brockwell and R. A. Davis. Introduction to Time Series In DS*, 1998.
and Forecasing. Springer, 2nd edition, 2003. [37] J. Vaidya and C. W. Clifton. Privacy prserving association
[10] K. Chen and L. Liu. Privacy preserving data classification rule mining in vertically partitionaed data. In SIGKDD,
with rotation perturbation. In ICDM, 2005. 2002.
[11] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and [38] K. Xiao and Y. Tao. Personalized privacy preservation. In
W. Hong. Model-driven data acquisition in sensor networks. SIGMOD, 2006.
In VLDB, 2005. [39] Y. Zhu and D. Shasha. Statstream: Statistical monitoring of
[12] W. Du and Z. Zhan. Using randomized response techniques thousands of data streams in real time. In VLDB, 2002.
for privacy-preserving data mining. In SIGKDD, 2003.

You might also like