Streaming Algorithm

The document introduces the data stream model where a massive sequence of data arrives as a stream that is too large to store. It describes estimating the number of distinct elements in a stream using a random subset of the elements and analyzing the probability of success. The document also discusses representing the stream as a vector and estimating other metrics like Lp norms and frequency moments in a data stream using small space linear sketches.

Uploaded by

sarvesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Streaming Algorithm

Uploaded by

sarvesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Streaming Algorithms, etc.

MIT
Piotr Indyk
Data Streams

• A data stream is a sequence of data that is too

large to be stored in available memory
(disk, memory, cache, etc.)
• Examples:
– Network traffic
– Database transactions
– Sensor networks
– Satellite data feed
Example application: Monitoring
Network Traffic
• Router routs packets
(many packets)
– Where do they come from ?
– Where do they go to ?
• Ideally, would like to maintain a traffic
matrix x[.,.]
destination
– For each (src,dst) packet, increment xsrc,dst
– Requires way too much space!
(232 x 232 entries)

source
– Need to maintain a compressed version of the
matrix

x
Data Streams

• A data stream is a (massive) sequence of data

– Too large to store (on disk, memory, cache, etc.)
• Examples:
– Network traffic (source/destination)
– Database transactions
– Sensor networks
– Satellite data feed
– …
• Approaches:
– Ignore it
– Develop algorithms for dealing with such data
This course
• Systematic introduction to the area
– Emphasis on common themes
– Connections between streaming, sketching,
compressed sensing, communication complexity, …
– First Second of its kind
(previous edition from Fall’07: see my web page at MIT)

• Style: algorithmic/theoretical…
– Background in linear algebra and probability
Topics
• Streaming model. Estimating distinct elements (L0 norm)

• Estimating L2 norm (AMS), Johnson Lindenstrauss

• Lp norm (p<2), other norms, entropy

• Heavy hitters: L1 norm, L2 norm, sparse approximations

• Sparse recovery via LP decoding

• Lower bounds: communication complexity, indexing, L2 norm

• Options: MST, bi-chromatic matching, insertions-only streams,

Fourier sampling,
Plan For This Lecture
• Introduce the data stream model(s)
• Basic algorithms
– Estimating number of distinct elements in a
stream
– Into to frequency moments and norms
Basic Data Stream Model
• Single pass over the data: i1, i2,…,in
– Typically, we assume n is known
• Bounded storage (typically nα or logc n)
– Units of storage: bits, words or „elements”
(e.g., points, nodes/edges)
• Fast processing time per element
– Randomness OK (in fact, almost always necessary)

8 2 1 9 1 9 2 4 6 3 9 4 2 3 4 2 3 8 5 2 5 6 ...
Counting Distinct Elements
• Stream elements: numbers from {1...m}
• Goal: estimate the number of distinct elements DE in
the stream
– Up to 1±ε
– With probability 1-P
• Simpler goal: for a given T>0, provide an algorithm
which, with probability 1-P:
– Answers YES, if DE> (1+ε)T
– Answers NO, if DE< (1-ε)T
• Run, in parallel, the algorithm with
T=1, 1+ε, (1+ε)2,..., n
– Total space multiplied by log1+εn ≈ log(n)/ ε
– Probability of failure multiplied by the same factor
Vector Interpretation
Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5

Vector X:
1 2 3 4 5 6 7 8 9

• Initially, x=0
• Insertion of i is interpreted as
xi = xi +1
• Want to estimate DE(x) = ||x||0
Estimating DE(x)
Vector X:
1 2 3 4 5 6 7 8 9
Set S: + ++ (T=4)

• Choose a random set S of coordinates 0.8

– For each i, we have Pr[i∈S]=1/T
• Maintain SumS(x) = Σi∈S xi 0.7

• Estimation algorithm A: 0.6

– YES, if SumS(x)>0 Pr
0.5
– NO, if SumS(x)=0
• Analysis: 0.4 Series1

– Pr=Pr[SumS(x)=0] = (1-1/T)DE 0.3

– For T “large enough”: (1-1/T)DE ≈e-DE/T 0.2

– Using calculus, for ε small enough:
• If DE> (1+ε)T, then Pr ≈ e-(1+ε) < 1/e - ε/3 0.1

• if DE< (1-ε)T, then Pr ≈ e-( 1-ε) > 1/e + ε/3 0

1 3 5 7 9 11 13 15 17 19

DE
Estimating DE(x) ctd.
• We have Algorithm A:
– If DE> (1+ε)T, then Pr<1/e-ε/3
– if DE< (1-ε)T, then Pr>1/e+ε/3
• Algorithm B:
– Select sets S1 … Sk , k=O(log(1/P)/ε2)
– Let Z = number of SumSj(x) that are equal to 0
– By Chernoff bound (define), with probability >1-P
• If DE> (1+ε)T, then Z<k/e
• if DE< (1-ε)T, then Z>k/e

• Total space: O( log(n)/ε log (1/P)/ε2 ) numbers

in range 0…n
• Can remove the log(n)/ε factor
• Bibliographic note: [Flajolet-Martin’85]
Interlude – Chernoff bound
• Let Z1…Zk be i.i.d. Bernoulli variables, with
Pr[Zj=1]=p
• Let Z=∑j Zj

• For any 1>ε>0, we have

Pr[ |E[Z]-Z| > εE[Z] ]≤2exp( -ε2E[Z]/3 )
Comments
• Implementing S:
– Choose a hash function h: {1..m} -> {1..T}
– Define S={i: h(i)=1}
• Implementing h
– Pseudorandom generators. More later.
• Better algorithms known:
– Theory: O( log(1/ε)/ε2 +log n) bits
[Bar-Yossef-Jayram-Kumar-Sivakumar-Trevisan’02]
– Practice: need 128 bytes for all works of
Shakespeare , ε≈10% [Durand-Flajolet’03]
More comments

Vector X:
1 2 3 4 5 6 7 8 9

• The algorithm uses “linear sketches”

SumSj(x)=Σi∈Sj xi
• Can implement decrements xi=xi-1
– I.e., the stream can contain deletions of elements
(as long as x≥0)
– Other names: dynamic model, turnstile model
More General Problem
• What other functions of a vector x can we maintain in small space ?
• Lp norms:
||x||p = ( ∑i |xi|p )1/p
– We also have ||x||∞ =maxi |xi|
– … and ||x||0 = DE(x), since ||x||pp =∑i |xi|p→DE(x) as p→0
• Alternatively: frequency moments Fp = p-th power of Lp norms
(exception: F0 = L0 )
• How much space do you need to estimate ||x||p (for const. ε) ?
• Theorem:
– For p∈[0,2]: polylog n space suffices
– For p>2: n1-2/p polylog n space suffices and is necessary

[Alon-Matias-Szegedy’96, Feigenbaum-Kannan-Strauss-Viswanathan’99,
Indyk’00, Coppersmith-Kumar’04, Ganguly’04, Bar-Yossef-Jayram-
Kumar-Sivakumar’02’03, Saks-Sun’03, Indyk-Woodruff’05]

Employment Contract (E) 01
50% (2)
Employment Contract (E) 01
6 pages
Regular Bail Application
100% (11)
Regular Bail Application
4 pages
AMSN Scope Standards MS Nursing
No ratings yet
AMSN Scope Standards MS Nursing
27 pages
Untitled
No ratings yet
Untitled
254 pages
CS85: Data Stream Algorithms Lecture Notes, Fall 2009: Amit Chakrabarti Dartmouth College
No ratings yet
CS85: Data Stream Algorithms Lecture Notes, Fall 2009: Amit Chakrabarti Dartmouth College
61 pages
Algorithms for Massive Data Problems
No ratings yet
Algorithms for Massive Data Problems
28 pages
Chakrabarthi-Streaming Alg Book
No ratings yet
Chakrabarthi-Streaming Alg Book
94 pages
Streaming Algorithms
No ratings yet
Streaming Algorithms
73 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
CSE291 Course Notes
No ratings yet
CSE291 Course Notes
69 pages
Lecture3 Sampling
No ratings yet
Lecture3 Sampling
36 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
Cs Theorists Toolkit
No ratings yet
Cs Theorists Toolkit
95 pages
Randomized Algosnotes
No ratings yet
Randomized Algosnotes
362 pages
Notes on Randomized Algorithms
No ratings yet
Notes on Randomized Algorithms
539 pages
BBNFBDFZ
No ratings yet
BBNFBDFZ
20 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
No ratings yet
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
453 pages
Lec 8
No ratings yet
Lec 8
4 pages
Estimating Frequency Moments of Data Streams Using Random Linear Combinations
No ratings yet
Estimating Frequency Moments of Data Streams Using Random Linear Combinations
12 pages
Notes Streaming
No ratings yet
Notes Streaming
8 pages
Notestream
No ratings yet
Notestream
5 pages
ACM - ICPC Advanced Complete Syllabus
No ratings yet
ACM - ICPC Advanced Complete Syllabus
9 pages
Approximate Frequency Counts Over Data Streams
No ratings yet
Approximate Frequency Counts Over Data Streams
87 pages
Data Structures & Algorithms Cheatsheet
No ratings yet
Data Structures & Algorithms Cheatsheet
5 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
01 Streaming PDF
No ratings yet
01 Streaming PDF
8 pages
Notes
No ratings yet
Notes
422 pages
L - B F E A: Earning Ased Requency Stimation Lgorithms
No ratings yet
L - B F E A: Earning Ased Requency Stimation Lgorithms
20 pages
Notes PDF
No ratings yet
Notes PDF
407 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
9 RA MIRI SamplingDS
No ratings yet
9 RA MIRI SamplingDS
66 pages
Cmu850 f20
No ratings yet
Cmu850 f20
309 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
GP of Tokyo Editorial
No ratings yet
GP of Tokyo Editorial
13 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
124 pages
Optimal Quantile Estimation
No ratings yet
Optimal Quantile Estimation
39 pages
Algoritmos Distribuidos
No ratings yet
Algoritmos Distribuidos
221 pages
Lectures Handout
No ratings yet
Lectures Handout
23 pages
Complexity, The Changing Minimum and Closest Pair: 1 Las Vegas and Monte Carlo Algorithms
No ratings yet
Complexity, The Changing Minimum and Closest Pair: 1 Las Vegas and Monte Carlo Algorithms
5 pages
Algorithms
No ratings yet
Algorithms
501 pages
Lecture1 Intro Streaming
No ratings yet
Lecture1 Intro Streaming
37 pages
Lossy Counting
No ratings yet
Lossy Counting
39 pages
Approximate Counting of Linear Extensions in Practice
No ratings yet
Approximate Counting of Linear Extensions in Practice
39 pages
week_11
No ratings yet
week_11
40 pages
Lecture Notes On Bucket Algorithms - Luc Devroye
No ratings yet
Lecture Notes On Bucket Algorithms - Luc Devroye
154 pages
CP CP Algori Algorithms Thms
No ratings yet
CP CP Algori Algorithms Thms
6 pages
Algorithm Lectures
No ratings yet
Algorithm Lectures
117 pages
Tanaman Indah Dan Bersih
No ratings yet
Tanaman Indah Dan Bersih
5 pages
Wavelet Decomposition of Data Streams: by Dragana Veljkovic
No ratings yet
Wavelet Decomposition of Data Streams: by Dragana Veljkovic
34 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Lecture Notes For Algorithm Analysis and Design: JNTU World
No ratings yet
Lecture Notes For Algorithm Analysis and Design: JNTU World
128 pages
Software For Enumerative and Analytic Combinatorics
100% (1)
Software For Enumerative and Analytic Combinatorics
47 pages
Probab 10
No ratings yet
Probab 10
3 pages
ProbabilisticCombinatorics 15 MAR 2019
No ratings yet
ProbabilisticCombinatorics 15 MAR 2019
114 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
300+ Python Algorithms: Mastering the Art of Problem-Solving
From Everand
300+ Python Algorithms: Mastering the Art of Problem-Solving
Hernando Abella
5/5 (1)
Recursive Analysis
From Everand
Recursive Analysis
R. L. Goodstein
No ratings yet
The Logical Solution Syracuse Conjecture
From Everand
The Logical Solution Syracuse Conjecture
Rolando Zucchini
No ratings yet
Michael Harrilal Calorimetry 1081505 2109538944
No ratings yet
Michael Harrilal Calorimetry 1081505 2109538944
18 pages
Artificial Intelligence in Corporate Wellness
100% (1)
Artificial Intelligence in Corporate Wellness
13 pages
Academic Freedom: XIV. Education
No ratings yet
Academic Freedom: XIV. Education
13 pages
Final DBMS Lab Manual
No ratings yet
Final DBMS Lab Manual
15 pages
Masculinity Spacecat
No ratings yet
Masculinity Spacecat
3 pages
Royal Affairs
No ratings yet
Royal Affairs
147 pages
A2+ High. Achievers. Teacher S Resource Book
No ratings yet
A2+ High. Achievers. Teacher S Resource Book
2 pages
Artery and Venus Supply of Adrenal Gland Along Wit
No ratings yet
Artery and Venus Supply of Adrenal Gland Along Wit
2 pages
JrEng TS T 01032023
No ratings yet
JrEng TS T 01032023
32 pages
Effort Rubrics Revised
No ratings yet
Effort Rubrics Revised
3 pages
Instant Download Elizabeth First Edition J. Randy Taraborrelli PDF All Chapters
No ratings yet
Instant Download Elizabeth First Edition J. Randy Taraborrelli PDF All Chapters
51 pages
Effects of Chewing Gum On Short Term Memory
No ratings yet
Effects of Chewing Gum On Short Term Memory
14 pages
HRM554 Video Resume Script
No ratings yet
HRM554 Video Resume Script
1 page
Essay About Quality
100% (2)
Essay About Quality
8 pages
Reckitt Benckiser PLC
100% (1)
Reckitt Benckiser PLC
10 pages
Research - Eng10 - G3 - Perceived Effects of The Citizenship Training Program On Academic Performance of Grade 10 Students in BNHS
100% (1)
Research - Eng10 - G3 - Perceived Effects of The Citizenship Training Program On Academic Performance of Grade 10 Students in BNHS
6 pages
International Tamil Academy Cupertino - Substantive Change Report
No ratings yet
International Tamil Academy Cupertino - Substantive Change Report
4 pages
9.1-Come Follow Me - PDF (RE)
No ratings yet
9.1-Come Follow Me - PDF (RE)
105 pages
Ethics First Part
No ratings yet
Ethics First Part
12 pages
Demurrer To Evidence BP 6
No ratings yet
Demurrer To Evidence BP 6
3 pages
Bbc2 MGT 102-Hbo
No ratings yet
Bbc2 MGT 102-Hbo
5 pages
DMSCO Log Book Vol.3 7/1925-6/1926
No ratings yet
DMSCO Log Book Vol.3 7/1925-6/1926
109 pages
Lucknow Pact
100% (1)
Lucknow Pact
5 pages
Knox grammar Improving Writing Booklet!!! ENGLISH ADVANCED
No ratings yet
Knox grammar Improving Writing Booklet!!! ENGLISH ADVANCED
51 pages
Week 5-Value Proposition PDF
No ratings yet
Week 5-Value Proposition PDF
38 pages
AASI 3201 Asian American Census Data Powerpoint
No ratings yet
AASI 3201 Asian American Census Data Powerpoint
10 pages