0% found this document useful (0 votes)
21 views

Streaming Algorithm

The document introduces the data stream model where a massive sequence of data arrives as a stream that is too large to store. It describes estimating the number of distinct elements in a stream using a random subset of the elements and analyzing the probability of success. The document also discusses representing the stream as a vector and estimating other metrics like Lp norms and frequency moments in a data stream using small space linear sketches.

Uploaded by

sarvesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Streaming Algorithm

The document introduces the data stream model where a massive sequence of data arrives as a stream that is too large to store. It describes estimating the number of distinct elements in a stream using a random subset of the elements and analyzing the probability of success. The document also discusses representing the stream as a vector and estimating other metrics like Lp norms and frequency moments in a data stream using small space linear sketches.

Uploaded by

sarvesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Streaming Algorithms, etc.

MIT
Piotr Indyk
Data Streams

• A data stream is a sequence of data that is too


large to be stored in available memory
(disk, memory, cache, etc.)
• Examples:
– Network traffic
– Database transactions
– Sensor networks
– Satellite data feed
Example application: Monitoring
Network Traffic
• Router routs packets
(many packets)
– Where do they come from ?
– Where do they go to ?
• Ideally, would like to maintain a traffic
matrix x[.,.]
destination
– For each (src,dst) packet, increment xsrc,dst
– Requires way too much space!
(232 x 232 entries)

source
– Need to maintain a compressed version of the
matrix

x
Data Streams

• A data stream is a (massive) sequence of data


– Too large to store (on disk, memory, cache, etc.)
• Examples:
– Network traffic (source/destination)
– Database transactions
– Sensor networks
– Satellite data feed
– …
• Approaches:
– Ignore it
– Develop algorithms for dealing with such data
This course
• Systematic introduction to the area
– Emphasis on common themes
– Connections between streaming, sketching,
compressed sensing, communication complexity, …
– First Second of its kind
(previous edition from Fall’07: see my web page at MIT)

• Style: algorithmic/theoretical…
– Background in linear algebra and probability
Topics
• Streaming model. Estimating distinct elements (L0 norm)

• Estimating L2 norm (AMS), Johnson Lindenstrauss

• Lp norm (p<2), other norms, entropy

• Heavy hitters: L1 norm, L2 norm, sparse approximations

• Sparse recovery via LP decoding

• Lower bounds: communication complexity, indexing, L2 norm

• Options: MST, bi-chromatic matching, insertions-only streams,


Fourier sampling,
Plan For This Lecture
• Introduce the data stream model(s)
• Basic algorithms
– Estimating number of distinct elements in a
stream
– Into to frequency moments and norms
Basic Data Stream Model
• Single pass over the data: i1, i2,…,in
– Typically, we assume n is known
• Bounded storage (typically nα or logc n)
– Units of storage: bits, words or „elements”
(e.g., points, nodes/edges)
• Fast processing time per element
– Randomness OK (in fact, almost always necessary)

8 2 1 9 1 9 2 4 6 3 9 4 2 3 4 2 3 8 5 2 5 6 ...
Counting Distinct Elements
• Stream elements: numbers from {1...m}
• Goal: estimate the number of distinct elements DE in
the stream
– Up to 1±ε
– With probability 1-P
• Simpler goal: for a given T>0, provide an algorithm
which, with probability 1-P:
– Answers YES, if DE> (1+ε)T
– Answers NO, if DE< (1-ε)T
• Run, in parallel, the algorithm with
T=1, 1+ε, (1+ε)2,..., n
– Total space multiplied by log1+εn ≈ log(n)/ ε
– Probability of failure multiplied by the same factor
Vector Interpretation
Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5

Vector X:
1 2 3 4 5 6 7 8 9

• Initially, x=0
• Insertion of i is interpreted as
xi = xi +1
• Want to estimate DE(x) = ||x||0
Estimating DE(x)
Vector X:
1 2 3 4 5 6 7 8 9
Set S: + ++ (T=4)

• Choose a random set S of coordinates 0.8


– For each i, we have Pr[i∈S]=1/T
• Maintain SumS(x) = Σi∈S xi 0.7

• Estimation algorithm A: 0.6

– YES, if SumS(x)>0 Pr
0.5
– NO, if SumS(x)=0
• Analysis: 0.4 Series1

– Pr=Pr[SumS(x)=0] = (1-1/T)DE 0.3

– For T “large enough”: (1-1/T)DE ≈e-DE/T 0.2


– Using calculus, for ε small enough:
• If DE> (1+ε)T, then Pr ≈ e-(1+ε) < 1/e - ε/3 0.1

• if DE< (1-ε)T, then Pr ≈ e-( 1-ε) > 1/e + ε/3 0


1 3 5 7 9 11 13 15 17 19

DE
Estimating DE(x) ctd.
• We have Algorithm A:
– If DE> (1+ε)T, then Pr<1/e-ε/3
– if DE< (1-ε)T, then Pr>1/e+ε/3
• Algorithm B:
– Select sets S1 … Sk , k=O(log(1/P)/ε2)
– Let Z = number of SumSj(x) that are equal to 0
– By Chernoff bound (define), with probability >1-P
• If DE> (1+ε)T, then Z<k/e
• if DE< (1-ε)T, then Z>k/e

• Total space: O( log(n)/ε log (1/P)/ε2 ) numbers


in range 0…n
• Can remove the log(n)/ε factor
• Bibliographic note: [Flajolet-Martin’85]
Interlude – Chernoff bound
• Let Z1…Zk be i.i.d. Bernoulli variables, with
Pr[Zj=1]=p
• Let Z=∑j Zj

• For any 1>ε>0, we have


Pr[ |E[Z]-Z| > εE[Z] ]≤2exp( -ε2E[Z]/3 )
Comments
• Implementing S:
– Choose a hash function h: {1..m} -> {1..T}
– Define S={i: h(i)=1}
• Implementing h
– Pseudorandom generators. More later.
• Better algorithms known:
– Theory: O( log(1/ε)/ε2 +log n) bits
[Bar-Yossef-Jayram-Kumar-Sivakumar-Trevisan’02]
– Practice: need 128 bytes for all works of
Shakespeare , ε≈10% [Durand-Flajolet’03]
More comments

Vector X:
1 2 3 4 5 6 7 8 9

• The algorithm uses “linear sketches”


SumSj(x)=Σi∈Sj xi
• Can implement decrements xi=xi-1
– I.e., the stream can contain deletions of elements
(as long as x≥0)
– Other names: dynamic model, turnstile model
More General Problem
• What other functions of a vector x can we maintain in small space ?
• Lp norms:
||x||p = ( ∑i |xi|p )1/p
– We also have ||x||∞ =maxi |xi|
– … and ||x||0 = DE(x), since ||x||pp =∑i |xi|p→DE(x) as p→0
• Alternatively: frequency moments Fp = p-th power of Lp norms
(exception: F0 = L0 )
• How much space do you need to estimate ||x||p (for const. ε) ?
• Theorem:
– For p∈[0,2]: polylog n space suffices
– For p>2: n1-2/p polylog n space suffices and is necessary

[Alon-Matias-Szegedy’96, Feigenbaum-Kannan-Strauss-Viswanathan’99,
Indyk’00, Coppersmith-Kumar’04, Ganguly’04, Bar-Yossef-Jayram-
Kumar-Sivakumar’02’03, Saks-Sun’03, Indyk-Woodruff’05]

You might also like