Hiding in The Crowd: Privacy Preservation On Evolving Streams Through Correlation Tracking
Hiding in The Crowd: Privacy Preservation On Evolving Streams Through Correlation Tracking
Hiding in The Crowd: Privacy Preservation On Evolving Streams Through Correlation Tracking
Correlation Tracking
Feifei Li‡ Jimeng Sun§ Spiros Papadimitriou† George A. Mihaila† Ioana Stanoi†
‡
Boston University, § Carnegie Mellon University, † IBM T.J.Watson Research Center
lifeifei@cs.bu.edu, jimeng@cs.cmu.edu, {spapadim, mihaila,irs}@us.ibm.com
Abstract
been proposed to achieve a desired balance between the two
We address the problem of preserving privacy in streams, [3, 22, 26, 10, 12, 27, 35, 15].
which has received surprisingly limited attention. For static
Prior related work [3, 2, 22, 19] consists of additive
data, a well-studied and widely used approach is based on
random perturbation for the offline, conventional relational
random perturbation of the data values. However, streams
data model, where the noise is distributed along the prin-
pose additional challenges. First, analysis of the data has to
cipal components of the original data in order to achieve
be performed incrementally, using limited processing time
maximum privacy, given a fixed utility. We show that these
and buffer space, making batch approaches unsuitable. Sec-
offline algorithms are no longer optimal when applied to
ond, the characteristics of streams evolve over time. Conse-
numerical, non-stationary (or, time-evolving) data streams.
quently, approaches based on global analysis of the data are
The dynamic correlations and autocorrelations, if not care-
not adequate. We show that it is possible to efficiently and
fully considered, may allow for the reconstruction of the
effectively track the correlation and autocorrelation struc-
original streams.
ture of multivariate streams and leverage it to add noise
which maximally preserves privacy, in the sense that it is Guaranteeing data privacy is especially challenging in
very hard to remove. Our techniques achieve much better the case of stream data [6, 28], mainly for two reasons:
results than previous static, global approaches, while re- 1) Performance requirement: The continuous arrival of
quiring limited processing time and memory. We provide new tuples prohibits storage of the entire stream for analy-
both a mathematical analysis and experimental evaluation sis, rendering the current offline algorithms inapplicable.
on real data to validate the correctness, efficiency, and ef- 2) Time evolution: Data streams are usually evolving, and
fectiveness of our algorithms. correlations and autocorrelations [33, 24] can change over
time. These characteristics make most offline algorithms
for static data inappropriate, as we show later.
1. Introduction To the best of our knowledge, privacy preservation on
Recently, there has been an increasing concern regarding streams has not yet been addressed in the literature, despite
privacy breaches, especially those involving sensitive per- the wide use of data streams in a large range of sensitive ap-
sonal data of individuals [14]. As a result, restrictions and plications such as financial, retail, defense, and health care.
regulations in publishing sensitive personal data have been For example, consider two financial firms that would like to
tightened [38]; these address data owned by government or- collaboratively monitor clusters over their streaming real-
ganizations [29] as well as corporations [7]. It is therefore time transactions [39]. However, none of them is willing to
not surprising that the data management community has be- publish the original data streams. The best resolution is to
come increasingly focused on ways to guarantee the privacy find ways to guarantee both the utility and privacy of data
of sensitive data. in an online fashion. The scheme should general and not
Meanwhile, unprecedented massive data from various restricted to any specific mining operation.
sources are providing us with great opportunity for data In this paper we fill the gap in the area of data stream
mining and information integration. Unfortunately, the pri- privacy, by proposing efficient online streaming algorithms
vacy requirement and data mining applications pose ex- that guarantee the privacy of single or multiple non-
actly opposite expectations from data publishing [14, 36]. stationary data streams. This work focuses on numerical
The utility of the published data w.r.t the mining appli- data streams, such as environmental sensor data, perfor-
cation decreases with increasing levels of privacy guaran- mance measurements, or stock trading prices. Our goal is to
tees [23]. Previous work has noticed this important trade- insert random perturbation that “mirrors” the streams’ sta-
off between privacy and utility and various techniques have tistical properties, in an online fashion. A number of impor-
Symbol Description
v a vector (lowercase bold)
[30, 18] can be applied such as, for example, linear interpo-
v(i) the i-th element of vector v lation (for upsampling) or decimated moving average (aka.
X a matrix (uppercase bold) tumbling average, for downsampling). We will thus assume
XT the transpose of X a time granularity such that, during each time interval, there
Xi or Xj i-th row or j-th column of X
is exactly one recorded incoming value from each stream.
Xji the entry (i, j) of X
T the number of timestamps up to now Therefore, for the purposes of our analysis and without
N the number of streams loss of generality, the input consist of N data streams, de-
A original stream collection in RT ×N noted as A1 , . . . , AN . For any i-th data stream Ai , its value
A∗ the perturbed stream collection
at time t is Ait . The stream collection is written as A = [Ai
à the reconstructed stream collection
An the n-th stream for 1 ≤ i ≤ N ]. Formally, the stream collection A can be
At the values from all streams at time t considered as a T × N matrix where N is the number of
E the random noise in RT ×N streams and T is the current timestamp, which grows indef-
D(A, A∗ ) the discrepancy on original and perturbed streams initely. The values from all streams at time t are At ∈ RN ,
Table 1. Description of notation. i.e., t-th row of A.
15 15
σ2
10 10
Removed 5 5
2
noise
Stream A
Stream A
0 0
At −5 −5
~ Principal w1
Privacy At Direction
−20
−30 −20 −10 0
Stream A1
10 20 30
−20
−30 −20 −10 0
Stream A1
10 20 30
(a) noise decomposition (b) i.i.d random noise (c) correlated noise
Figure 1. Impact of Correlation on Perturbing the Data
A : cos(t)
10
0
5
2
Stream A
2
−0.5
fort, the goal is to transform the perturbed measurements, 0
−1
w2
A∗t onto Ãt so that, hopefully, D(At , Ãt ) is small.
−5
2
−10
1
w
A principled way of reconstruction is to project the data −15 2
2000
0
w3 40
−20 30
−2 0
10
20
Time t Time t
noise is removed, while the original data are maximally pre- Stream A
1
served, i.e, not much additional error is included. The idea (a) Example one. (b) Example two.
is illustrated in figure 1(a). When A∗t is projected onto the Figure 2. Dynamic correlations in Data
principal direction, the projection is exactly the reconstruc-
Streams
tion Ãt . Note that the distance between A∗t and Ãt consists
of two parts: 1) removed noise, i.e., the perturbation that is Algorithm 1: SCAN
removed by the reconstruction and 2) projection error, i.e,
the new error introduced by the reconstruction. Finally, the Input : Original tuple At , utility threshold σ 2
distance between At and Ãt , i.e., the privacy, comes from Old subspace U ∈ RN ×k , Λ ∈ Rk × k
two sides: 1) remaining noise, i.e, the perturbation noise Output: Perturbed tuple A∗t , new subspace U, Λ
that has not been removed, and 2) projection error. update eigenvector U, eigenvalue Λ based on At )
⇀
When the noise is added exactly along the principal di- Initialize δ, η to 0 k
rection, the removed noise becomes zero. However, addi- //add noise in top-k principal component subspace
tional projection error is included. In this case, the pertur- for 1 ≤ i ≤k do
Λ(i)
bation is robust towards this reconstruction attempt, in the δ(i)=σ 2 × k(Λ)k
sense that D(A, Ã) = D(A, A∗ ). In general, a good prac- η(i) = gaussian noise with variance δ(i)
tice is to add correlated noise following the trends present in // rotation back to the original space
the streams. Consider the example shown in Figure 1 where Et =η × UT and A∗t =At +Et
blue points represent the original data and green points rep-
resent the perturbed data with same amount of noise. Figure
1(b) and 1(c) show the i.i.d. noise and the correlated noise Streaming Correlated Additive Noise (SCAN) SCAN
on the same data, respectively. Clearly, correlated noise has does two things whenever new tuples arrive from the N in-
been successfully “hidden” in the original data and, there- put streams: 1) update the estimation of local principal com-
fore, is hard to remove. ponents; and 2) distribute noise along the principal compo-
Data streams often present strong correlations and these nents in proportional to their eigenvalues.
correlations change dynamically [39, 33, 32]. Consider the An important property of the SCAN algorithm is that
examples in figure 2, where the principal components are when the noise is rotated back to the data space (line 6), its
changing over time. In such case, online PCA is necessary variance will be equal to the specified discrepancy thresh-
to better characterize the evolving, local trends. Global, of- old σ 2 . Intuitively, SCAN tracks the covariance matrix and
fline PCA will fail to identify these important properties adds noise with essentially the same covariance as the data
as we will show later in the experiments. Next, we will streams—proof details are ommited for space.
show how to dynamically insert noise using online correla-
tion tracking [32]. Theorem 1. At any time instant T , the perturbed data
Algorithm 2: SCOR where R is a projection matrix, meaning that R = UUT
with U ∈ RN ×k orthonormal. Since the subspaces tracked
Input : Perturbed tuple A∗t , utility threshold σ 2
by both SCOR and SCAN are the same, the remaining noise
Old subspace U ∈ RN ×k , Λ ∈ Rk × k
is σ 2 , i.e., no noise is removed. Therefore, D(A, Ã) ≥ σ 2
Output: Perturbed tuple Ãt , new subspace U, Λ
by the triangle inequality.
1 update eigenvector U, eigenvalue Λ based on At )
2 //project to the estimated online principal components Note that the projection error for SCOR is small, pro-
Ab t =A∗ × UN ×k × UT vided that the data are locally correlated. Therefore, the
t N ×k
reconstruction error (i.e., privacy, as defined in Section 2.2)
of SCOR is approximately σ 2 , i.e., equal to the original dis-
streams A∗ from SCAN satisfy D(A, A∗ ) = σ 2 . Addition- crepancy. Moreover, when σ 2 is small compared to the orig-
ally, SCAN preserves the eigenvectors of the (uncentered) inal data, other reconstruction methods will result in higher
covariance matrix of A. error, due to the large projection error.
Therefore, the SCAN perturbation will not affect any 5 Dynamic autocorrelation
mining algorithms that rely on the second moments of the
So far, we have presented the methods based on correla-
data (i.e., linear correlations).
tion across many streams. Now we exploit another impor-
Streaming correlation online reconstruction (SCOR): tant property, autocorrelation on a single stream and then
The privacy achieved by SCAN is determined by the best propose the corresponding perturbation and reconstruction.
linear reconstruction an adversary could perform on A∗ (see Intuition: The noise added should mirror the dominant
Section 2.2). For evolving data streams as illustrated in fig- trends in the series. Consider the following simple exam-
ure 2, the best choice for the adversary is to utilize online ples: If the stream always has a constant value, the right
estimation of local principal components for reconstruction. way to hide this value is to add the same noise throughout
The ability to estimating the local principal components of time. Any other noise can be easily filtered out by simple
the original data streams depends on how the noise has been averaging. The situation is similar for a linear trend (this is
added. For SCAN, we know that the principal component also an example that cannot be captured by Fourier). If the
directions are preserved, since the noise is added along their stream is a sine wave, the right way to hide it is by adding
direction (Theorem 1). In general, we may assume the noise noise with the same frequency (but potentially a different
is small compared to the data—otherwise, the utility of the phase); anything else can be filtered out. Our algorithm is
perturbed data is too low to be useful. Then, tracking the the generalization, in a principled manner, of these notions.
principal components of the perturbed streams A∗ can give For example, the green and blue curves in figure 3(b)
a good estimate of the principal components of the original are the autocorrelated noise and the original stream, re-
streams A. Formally, cov(A∗ ) ≈ cov(A). spectively, where the noise follows the same trends as the
Intuitively, SCOR reconstruction removes all the noise streams, over time. In comparison, figure 3(a) shows i.i.d.
orthogonal to the local principal components and inserts lit- noise, which can be easily filtered out. The goal is to find
tle additional projection error, since local PCA can usually a principled way to automatically determine what is the
track the data accurately. In other words, i.i.d. noise can “right” noise, which is “most similar” to the stream.
usually be successfully removed, provided that the streams Connection to correlation: In the previous section, we
are correlated. However, the perturbation from SCAN can- showed how to track the local statistical properties of the N -
not be removed at all since the the noise is distributed along dimensional sequence of the vectors At , indexed over time
the “instantaneous” correlation in the streams. t. More specifically, we track the principal subspace of this
matrix, thereby focusing on the most dominant (in a least-
Theorem 2. The reconstruction error of SCOR on the per- squares sense) of these relationships. We subsequently add
turbation from SCAN is ≈ σ 2 . noise that “mirrors” those relationships, making it indistin-
guishable from the original data.
Proof. Formally, given a linear reconstruction à = A∗ R,
Next, we will show that the same principles used to cap-
the privacy can be decomposed as
ture relationships across many attributes can be used to cap-
D(A, Ã) = kA − A∗ Rk2F ture relationships of one attribute across time. In fact, there
= kA − (A + E)Rk2F is a natural way to move between the original time domain
and a high-dimensional sequence space, which is formal-
= kA(I − R) + ERk2F .
ized next. The t-th window of the time series stream a(t) is
2
= k A(I − UUT ) + EUU T
| {z } kF an h-dimensional point,
| {z }
remaining error
projection error Wt := [a(t), a(t + 1), . . . , a(t + h − 1)]T ∈ Rh .
3.5 3.5
3
Data
3 Data a=[1 2 3 4 5 6 ]T 1 2 3
Noise Noise W1 2 3 4
2.5 2.5
W2 W =
2 2
W3
3 4 5
1.5 1.5 W4 4 5 6
1 1
The window matrix W has the windows Wt as rows. Thus, Algorithm 3: SACAN
Wij = a((i − 1)h + j) by construction. The space spanned
Input : Original value a∗ (t), utility σ 2
by the sequence of windows Wt is known as the h-th order
Old subspace U ∈ Rh×k , Λ ∈ Rk × k
phase space of the series a(t) [16]. Subsequently, we can
Output: Perturbed value a∗ (t), new subspace U, Λ
essentially apply the same technique as before, using W in T
1 Construct window Wt−h+1 = [a(t−h+1), . . . , a(t)]
place of A. All of the previous discussion and properties of
2 Update U, V using Wt−h+1
our algorithm can be directly transferred to the autocorre-
3 every k arriving values do
lation case. An example is shown in the top of figure 3(c).
4 Let [wTl | wTr ]T ≡ Wt+h+1
However, there are some additional properties and issues
5 Solve equation 2 to obtain er
that need to be resolved.
6 Rescale er based on σ 2
Hankel Constraint: Notice that the window matrix W
7 Perturbed values w∗r = wr + er
is a Hankel matrix, i.e., the anti-diagonals are constants:
8 Publish values a∗ (t−k+i−1) = w∗r (i), 1 ≤ i ≤ k
Wij = Wi−1 j−1
. Under the assumption that the series is
stationary, the autocovariance matrix WT W is, in expec-
tation is circulant, i.e., it is symmetric with constant diag-
onals. Additionally, if we perform a batch eigen-analysis onto the orthogonal complement.
on the global window matrix of a static series, the sample Assume that we have chosen the noise values up to time
autocovariance matrix computed from the actual data (i.e., t − k. Based on these and on the current estimate of U,
WT W above) is also circulant. In this case, the eigenvec- we will determine the next k noise values (where k is the
tors of WT W essentially provide the same information as principal subspace dimension)—the reason for determining
the Fourier coefficients of the series a. In that sense, our them simultaneously will become clear soon. Let
approach includes traditional Fourier analysis. If these as-
Et−h+1 ≡ [e(t−h+1), . . . , e(t−k) | e(t−k+1), . . . , e(t)]T
sumptions do not hold, the technique we employ is more ro-
bust and effective. Detailed discussion is beyond the scope ≡ [eTl | eTr ]T ,
of this paper—interested readers may consult [16, 34, 18]
for more details. where | denotes elementwise concatenation (for example,
Constraint on autocorrelated noise: Next, we address the [1, 2 | 3, 4] results into two row vectors [12] and [34]. The
issues that arise from the fact that W is a Hankel matrix. first block el ∈ Rh−k consists of h − k known values,
Similarly, the noise matrix E has to be a Hankel matrix (see whereas the second block er ∈ Rk consists of the k un-
figure 3(c) for an example). Similar to the correspondence known noise values we wish to determine. Similarly de-
between a and W, the noise matrix E has a corresponding composing Q ≡ [Ql | Qr ] into blocks Ql ∈ Rh×(h−k) and
noise sequence e, such that Qr ∈ Rh×k , we can rewrite equation 1 as
SCAN SCAN
Reconstruction error
0.25 0.25 0.25
dash: offline−R
0.3
solid: SCAN baseline baseline
Privacy
Privacy
0.2 0.25 0.2 0.2
0.1 0.15
0.1 0.1
0.1
0.05
0.05 0.05
0.05
0
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
i.i.d-N offline-N online-N 5 10 15 20 25
Number of principal components k Discrepancy Discrepancy
(a) Reconstruction Error: SCAN vs. (b) Reconstruction Error: vary k (c) Privacy vs. Discrepancy, online-R:(d) Privacy vs. Discrepancy, online-R:
offline-R Lab data set Chlorine data set
Privacy
0.2 0.2
Perturbation Performance Next, we measure the ability
0.15 0.15
of different perturbation methods to preserve privacy of data
0.1 0.1
0.05 0.05
streams with dynamic correlations. Results on the Lab and
0 0
Chlorine data are presented in figures 4(c) and 4(d). Clearly,
0 0.05 0.1 0.15 0.2
Discrepancy
0.25 0.3 0.35 0 0.05 0.1 0.15
Discrepancy
0.2 0.25 0.3 0.35
for both data streams, SCAN achieves the best privacy over
(a) Privacy vs. Discrepancy: Chlorine (b) Privacy vs. Discrepancy: Stock all discrepancy values. SCAN effectively achieves the best
privacy w.r.t. the allowed discrepancy. Compared to the
Figure 5. Privacy vs. Discrepancy: Online baseline method, online-R removes no noise at all from
Reconstruction SCAN.
show that, besides high overheads, their performance is sub- 6.3 Dynamic Autocorrelation
optimal due to the time-evolving nature of streams. Finally,
baseline reconstruction is simply the perturbed data them- This section presents the results that demonstrate the cor-
selves (i.e., no attempt to recover the original data). rectness and effectiveness of our algorithms for data pertur-
bation in streams with autocorrelation. In all experiments,
Reconstruction Error Figure 4(a) shows the reconstruc- except otherwise specified, the window size h is set to 300
tion error of online and offline reconstruction, w.r.t all types and the number of principal components k is 10. Since there
of noise. The figure presents results from Lab data with dis- is no previous work that explores autocorrelation in the of-
crepancy is set to 10%. In all cases, SCOR clearly outper- fline case for privacy preservation, we compare our method
forms the offline method. The main reason is that offline-R against i.i.d. noise and we use the online reconstruction al-
has considerable projection error for streams with dynami- gorithm proposed in this paper, SACOR, in order to mea-
cally evolving correlation, whereas SCOR has almost neg- sure privacy.
ligible projection error and its reconstruction error is domi-
Effectiveness of SACAN The key idea of SACAN is to
nated by the remaining noise. Similar phenomena were ob-
track the data stream’s autocorrelation in an online fashion
served for other discrepancy values. Threfore, online recon-
and produce noise with similar autocorrelation. To illus-
struction should be the candidate for measuring the privacy
trate the effectiveness of SACAN, we apply SACAN and
of the perturbed data.
i.i.d. noise on different types of streams. We show the re-
Effect of k The number of principal components k will sults from the Chlorine and Stock data sets in figure 6. The
affect both the projection error and the remaining noise, discrepancy is set to 10% for all experiments. We observe
which in turn have an impact on the overall reconstruction from figure 6(a) and 6(c) that SACAN initally produces
error. Figure 4(b) studies the effect of k on both offline- i.i.d. noise but it is quickly able to estimate and track the
R and online-R on the Lab data with discrepancy fixed at autocorrelation of the input stream. Hence, SACAN adds
10%. For both approaches, reconstruction errors decrease random noise that closely follows the estimated autocorre-
as k increases. Two interesting facts are reflected. First, lation. Intuitively, the noise generated by SACAN exhibits:
online-R requires smaller k to reach a “flat,” stable recon- 1) the same frequency as the input data stream; 2.) ampli-
struction error. This is beneficial since both the computation tude that is determined by the discrepancy. Thus, since the
and memory cost increase in proportion to k, as we will see SACAN perturbation follows the same trend as the input
in Section 6.4. Second, online-R achieves smaller recon- stream, it is hardly distinguishable or separable once added.
4 3 2.5 2.5
Data Data Data Data
Noise Noise 2 Noise 2 Noise
3
2
1.5
1.5
2 1
1 1
0.5
1 0.5
0 0
0 0
−0.5
−1 −0.5
−1 −1
−1
−1.5
−2
−2
−2 −1.5
−3 −3 −2.5 −2
0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000
Time Time Time Time
(a) Chlorine: SACAN (b) Chlorine: i.i.d (c) Stock: SACAN (d) Stock: i.i.d.
Figure 6. Online Random Noise for Stream with Autocorrelation
12 12 process input from hundreds of streams in a few millisec-
SCAN on Lab Data Set 100 streams
11 SCAN on Lab Data Set
10
11
onds. The same experiments performed for other online
Time per tuple (ms)
9
10
algorithms (SCOR, SACAN, SACOR) lead to similar ob-
8 9
servations.
7
6
8
Memory consumption for all of our algorithms is domi-
5
7
nated by two factors: 1) the online estimation of local prin-
6
4
cipal components, either for correlation or for autocorre-
3 5
50 100
Number of streams
150 200 5 10 15 20
Number of principal components
25 30
lation; 2) the number of tuples that need to be buffered.
(a) Time vs. Number of Streams (b) Time vs. Number of Principal
For SCAN and SCOR no buffering is required. Hence,
Components memory consumption memory is required only for the k
local principal component directions, each of which is rep-
Figure 7. Running Time of SCAN resented by an N -dimensional vector, where N is the num-
The advantage of SACAN becomes clear when comparing ber of input streams. Therefore, the total memory required
the results shown in figures 6(b) (SACAN) and 6(d) (i.i.d is N k|R|, where |R| is the floating point representation
noise). size. For SACAN/SACOR, each principal direction is an
h-dimensional vector. Additionally, the last h values need
Privacy of SACAN and Effectiveness of SACOR Using to be buffered, since the entire window is needed to update
SACOR for online reconstruction, more than 13 of the i.i.d. the principal direction estimates. Hence, the total memory
noise added to the Chlorine and Stock streams can be re- consumption is hk|R| + h|R|.
moved. However, the noise produced by SACAN could not
be removed at all. Figure 5 demonstrates these results, for 6.5 Discussion
varying discrepancy. The reconstruction error from SACOR
The experimental evaluation validates the superiority of
is used to measure privacy. Baseline is the discrepancy be-
our approaches. Essentially, for both correlations and au-
tween perturbed and input streams (i.e., no attempt at re-
tocorrelation, our algorithms are able to produce random
construction). Since SACAN produces noise that follows
noise that follows the same trend as the input streams, in a
the same trend as the input data stream, it is hard to remove
online fashion. In addition, our algorithms have small com-
any such noise from the perturbed data.
putation and memory overhead. Finally, we should point
out that, with relatively small amount of noise injected into
6.4 Cost Analysis the original data streams, regardless of the type of the noise,
The cost metrics include the computational require- the principal components of the perturbed data will be a
ments, measured in seconds, and the memory consump- good approximation of input data streams. Of course, there
tion, measured in bytes. Figure 7 shows the running time of are ways to mathematically infer the principal components
SCAN on the Lab data set. Figure 7(a) investigates the ef- of the input data streams, given the perturbed data streams
fects of increasing the number of input streams N while fix- and some knowledge of the noise properties, such as its dis-
ing k = 10, whereas the second experiment studies the im- tribution and variance. However, in this paper, we do not
pact of keeping different number of principal components consider releasing any knowledge about the noise. In fact,
with N = 100 input streams. In both cases, we take the av- we believe this is the right choice as the goal is to maxi-
erage processing time per tuple, consisting of N elements, mally preserve the privacy, while maintaining certain utility.
one from each input stream. The running time of SCAN is Releasing information about the noise might seriously com-
linear w.r.t. the number of input streams and almost linear promise privacy. The assumption that injected noise cannot
w.r.t. the number of principal components. This indicates be too large is automatically guaranteed by the utility re-
that it has good scalability. On average, SCAN is able to quirement of the target mining application. Hence, the esti-
mates for the number of dominant principal components at [13] EPANET, 2002. http://www.epa.gov/ORD/NRMRL/
reconstruction will be similar to that of perturbation. This wswrd/epanet.html.
justifies using the same number of principal components for [14] A. Evfimevski, J. Gehrke, and R. Srikant. Limiting privacy
breaches in privacy preserving data mining. In PODS, 2003.
both reconstruction and perturbation. However, different k [15] A. Evfimievski, R. Srikant, R. Agarwal, and J. Gehrke. Pri-
in perturbation and reconstruction does not affect the trend vacy preserving mining of association rules. In SIGKDD,
and phenomena in our experiments. 2002.
[16] M. Ghil, M. Allen, M. Dettinger, K. Ide, D. Kondrashov,
M. Mann, A. Robertson, A. Saunders, Y. Tian, F. Varadi, and
7. Conclusion P. Yiou. Advanced spectral methods for climatic time series.
Data streams in the real world typically exhibit both sig- Rev. Geophys., 40(1), 2002.
nificant correlations as well as autocorrelation, thereby pro- [17] S. Guha, D. Gunopulos, and N. Koudas. Correlating syn-
viding ample opportunities for adversaries to breach pri- chronous and asynchronous data streams. In KDD, 2003.
[18] S. Haykin. Adaptive Filter Theory. Prentice-Hall, 4th edi-
vacy. In this paper we develop the basic building blocks tion, 2002.
for privacy preservation on numerical streams. In particu- [19] Z. Huang, W. Du, and B. Chen. Deriving private information
lar, we focus on the fundamental cases of correlation across from randomized data. In SIGMOD, 2005.
multiple streams and of autocorrelation within one stream. [20] INET ATS, Inc. http://www.inetats.com/.
[21] I. T. Jolliffe. Principal Component Analysis. Springer, 2nd
We present methods to dynamically track both and subse- edition, 2002.
quently add noise that “mirrors” these statistical properties, [22] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On
making it indistinguishable from the original data. Thus, the privacy preserving properties of random data perturba-
our techniques prevent adversaries from leveraging these tion techniques. In ICDM, 2003.
[23] D. Kifer and J. Gehrke. Injecting utility into anonymized
properties to remove the noise and thereby breach privacy.
datasets. In SIGMOD, 2006.
We provide both a mathematical analysis and experimen- [24] F. Li, C. Chang, G. Kollios, and A. Bestavros. Characteriz-
tal evaluation on real data to validate the correctness, effi- ing and explorting reference locality for data stream applica-
ciency, and effectiveness of our algorithms. Our techniques tions. In ICDE, 2006.
track the evolving nature of these relationships and achieve [25] Y. Lindell and B. Pinkas. Privacy preserving data mining. In
much better results than previous static, global approaches. CRYTO, 2000.
[26] K. Liu, H. Kargupta, and J. Ryan. Random Projection-Based
Furthermore, to the best of our knowledge, autocorrelation- Multiplicative Data Perturbation for Privacy Preserving Dis-
based attacks have not been previously addressed. tributed Data Mining. IEEE TKDE, 18(1), 2006.
[27] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkita-
References subramaniam. l-diversity: Privacy beyond k-anonymity. In
ICDE, 2006.
[28] S. Muthukrishnan. Data streams: Algorithms and applica-
[1] C. C. Aggarwal and P. S. Yu. A condensation approach to tions. Technical report, Computer Science Department, Rut-
privacy preserving data mining. In EDBT, 2004. gers University, 2003.
[2] D. Agrawal and C. C. Aggarwal. On the design and quan- [29] E. U. on Privacy Protection, 2002.
tification of privacy preserving data mining algorithms. In http://europa.eu.int/eur-lex/pri/en/oj/
PODS, 2001. dat/2002/l 201/l 20120020731en00370047.pdf.
[3] R. Agrawal and R. Srikant. Privacy preserving data mining. [30] A. V. Oppenheim and A. S. Wilsky. Signals and Systems.
In SIGMOD, 2000. Prentice-Hall, 1983.
[4] R. Agrawal, R. Srikant, and D. Thomas. Privacy preserving [31] R. Ostrovsky and W. Skeith. Private searching on streaming
olap. In SIGMOD, 2005. data. In CRYPTO, 2005.
[5] S. Agrawal and J. R. Haritsa. A framework for high-accuracy [32] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern
privacy-preserving mining. In ICDE, 2005. discovery in multiple time-series. In VLDB, 2005.
[6] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. [33] S. Papadimitriou and P. Yu. Optimal multi-scale patterns in
Models and issues in data stream systems. In PODS, 2002. time series streams. In SIGMOD, 2006.
[7] E. Bertino, B. Ooi, Y. Yang, and R. Deng. Privacy and owner- [34] R. O. Schmidt. Multiple emitter location and signal parame-
ship preserving of outsourced medical data. In ICDE, 2005. ter estimation. IEEE Trans. Ant. Prop., 34(3), 1986.
[8] J. Bethencourt, D. Song, and B. Waters. Constructions and [35] L. Sweeney. k-anonymity: a model for protecting privacy.
practical applications for private stream searching. In IEEE Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5), 2002.
Symposium on Security and Privacy, 2006. [36] K. Thearling. Data mining and privacy: A conflict in making.
[9] P. J. Brockwell and R. A. Davis. Introduction to Time Series In DS*, 1998.
and Forecasing. Springer, 2nd edition, 2003. [37] J. Vaidya and C. W. Clifton. Privacy prserving association
[10] K. Chen and L. Liu. Privacy preserving data classification rule mining in vertically partitionaed data. In SIGKDD,
with rotation perturbation. In ICDM, 2005. 2002.
[11] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and [38] K. Xiao and Y. Tao. Personalized privacy preservation. In
W. Hong. Model-driven data acquisition in sensor networks. SIGMOD, 2006.
In VLDB, 2005. [39] Y. Zhu and D. Shasha. Statstream: Statistical monitoring of
[12] W. Du and Z. Zhan. Using randomized response techniques thousands of data streams in real time. In VLDB, 2002.
for privacy-preserving data mining. In SIGKDD, 2003.