Mining Data Streams (Part 1)
Mining Data Streams (Part 1)
Mining Data Streams (Part 1)
(Part 1)
New Topic: Infinite Data
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection
2
Data Streams
In many data mining situations, we do not
know the entire data set in advance
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage
6
Problems on Data Streams
Types of queries one wants on answer on
a data stream: (we’ll do these today)
Sampling data from a stream
Construct a random sample
Queries over sliding windows
Number of items of type x in the last k elements
of the stream
7
Problems on Data Streams
Types of queries one wants on answer on
a data stream: (we’ll do these next time)
Filtering a data stream
Select elements with property x from the stream
Counting distinct elements
Number of distinct elements in the last k elements
of the stream
Estimating moments
Estimate avg./std. dev. of last k elements
Finding frequent elements
8
Applications (1)
Mining query streams
Google wants to know what queries are
more frequent today than yesterday
9
Applications (2)
Sensor Networks
Many sensors feeding into a central controller
Telephone call records
Data feeds into customer bills as well as
settlements between telephone companies
IP packets monitored at a switch
Gather information for optimal routing
Detect denial-of-service attacks
10
Sampling from a Data Stream:
Sampling a fixed proportion
As the stream grows the sample
also gets bigger
Sampling from a Data Stream
Since we can not store the entire stream,
one obvious approach is to store a sample
Two different problems:
(1) Sample a fixed proportion of elements
in the stream (say 1 in 10)
(2) Maintain a random sample of fixed size
over a potentially infinite stream
At any “time” k we would like a random sample
of s elements
What is the property of the sample we want to maintain?
For all time steps k, each of k elements seen so far has
equal prob. of being sampled
12
Sampling a Fixed Proportion
Problem 1: Sampling fixed proportion
Scenario: Search engine query stream
Stream of tuples: (user, query, time)
Answer questions such as: How often did a user
run the same query in a single days
Have space to store 1/10th of query stream
Naïve solution:
Generate a random integer in [0..9] for each query
Store the query if the integer is 0, otherwise
discard
13
Problem with Naïve Approach
Simple question: What fraction of queries by an
average search engine user are duplicates?
Suppose each user issues x queries once and d queries
twice (total of x+2d queries)
Correct answer: d/(x+d)
Proposed solution: We keep 10% of the queries
Sample will contain x/10 of the singleton queries and
2d/10 of the duplicate queries at least once
But only d/100 pairs of duplicates
d/100 = 1/10 ∙ 1/10 ∙ d
Of d “duplicates” 18d/100 appear exactly once
18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
So the sample-based answer is 14
Solution: Sample Users
Solution:
Pick 1/10th of users and take all their
searches in the sample
15
Generalized Solution
Stream of tuples with keys:
Key is some subset of each tuple’s components
e.g., tuple is (user, search, time); key is user
Choice of key depends on application
Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
16
Sampling from a Data Stream:
Sampling a fixed-size sample
As the stream grows, the sample is of
fixed size
Maintaining a fixed-size sample
Problem 2: Fixed-size sample
Suppose we need to maintain a random
sample S of size exactly s tuples
E.g., main memory size constraint
Why? Don’t know length of stream in advance
Suppose at time n we have seen n items
Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
Impractical solution would be to store all the n tuples seen
so far and out of them pick s at random
18
Solution: Fixed Size Sample
Algorithm (a.k.a. Reservoir Sampling)
Store all the first s elements of the stream to S
Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
With probability s/n, keep the nth element, else discard it
If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
Past Future
24
Counting Bits (1)
Problem:
Given a stream of 0s and 1s
Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N
Obvious solution:
Store the most recent N bits
When new bit comes in, discard the N+1st bit
010011011101010110110110 Suppose N=6
Past Future
25
Counting Bits (2)
You can not get an exact answer without
storing the entire window
Real Problem:
What if we cannot afford to store N bits?
E.g., we’re processing 1 billion streams and
N = 1 billion 010011011101010110110110
Past Future
Maintain 2 counters:
S: number of 1s from the beginning of the stream
Z: number of 0s from the beginning of the stream
How many 1s are in the last N bits?
But, what if stream is non-uniform?
What if distribution changes over time?
27
[Datar, Gionis, Indyk, Motwani]
DGIM Method
DGIM solution that does not assume
uniformity
28
Idea: Exponential Windows
Solution that doesn’t (quite) work:
Summarize exponentially increasing regions
of the stream, looking backward
Drop small regions if they begin at the same point
Window of as a larger region
width 16
has 6 1s 6 10
4
?
3 2
21
1 0
010011100010100100010110110111001010110011010
N
We can reconstruct the count of the last N bits, except we
are not sure how many of the last 6 1s are included in the N
29
What’s Good?
Stores only O(log2N ) bits
counts of bits each
30
What’s Not So Good?
As long as the 1s are fairly evenly distributed,
the error due to the unknown region is small
– no more than 50%
But it could be that all the 1s are in the
unknown area at the end
In that case, the error is unbounded!
6 10
4
?
3 2
1 2
1 0
010011100010100100010110110111001010110011010
N
31
[Datar, Gionis, Indyk, Motwani]
32
DGIM: Timestamps
Each bit in the stream has a timestamp,
starting 1, 2, …
33
DGIM: Buckets
A bucket in the DGIM method is a record
consisting of:
(A) The timestamp of its end [O(log N) bits]
(B) The number of 1s between its beginning and
end [O(log log N) bits]
Constraint on buckets:
Number of 1s must be a power of 2
That explains the O(log log N) in (B) above
1001010110001011010101010101011010101010101110101010111010100010110010
N
34
Representing a Stream by Buckets
Either one or two buckets with the same
power-of-2 number of 1s
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100010110010
37
Updating Buckets (2)
If the current bit is 1:
(1) Create a new bucket of size 1, for just this bit
End timestamp = current time
(2) If there are now three buckets of size 1,
combine the oldest two into a bucket of size 2
(3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
(4) And so on …
38
Example: Updating Buckets
Current state of the stream:
1001010110001011010101010101011010101010101110101010111010100010110010
Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
0101100010110101010101010110101010101011101010101110101000101100101101
39
How to Query?
To estimate the number of 1s in the most
recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the bucket)
2. Add half the size of the last bucket
40
Example: Bucketized Stream
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100010110010
41
Error Bound: Proof
Why is error 50%? Let’s prove it!
Suppose the last bucket has size 2r
Then by assuming 2r-1 (i.e., half) of its 1s are
still within the window, we make an error of
at most 2r-1
Since there is at least one bucket of each of
the sizes less than 2r, the true sum is at least
1 + 2 + 4 + .. + 2r-1 = 2r -1
Thus, error at most 50% At least 16 1s
111111110000000011101010101011010101010101110101010111010100010110010
N
42
Further Reducing the Error
Instead of maintaining 1 or 2 of each size
bucket, we allow either r-1 or r buckets (r > 2)
Except for the largest size buckets; we can have
any number between 1 and r of those
Error is at most O(1/r)
By picking r appropriately, we can tradeoff
between number of bits we store and the
error
43
Extensions
Can we use the same trick to answer queries
How many 1’s in the last k? where k < N?
A: Find earliest bucket B that at overlaps with k.
Number of 1s is the sum of sizes of more recent
buckets + ½ size of B
1001010110001011010101010101011010101010101110101010111010100010110010
k
46