Mining Data Streams
Mining Data Streams
Mining Data Streams
Some of these slides are based on Stanford Mining Massive Data Sets
Course slides at http://www.stanford.edu/class/cs246/ 1
Infinite Data: Data Streams
• Millions of sensors
• Each sending data
every (fraction of a)
second
• Analyze trends
over time
LOBO Observatory
http://www.mbari.org/lobo/Intro.html
3
Image Data
• Satellites send
terabytes a day of
data
• Cameras have http://www.defenceweb.co.za/images/stories/AIR/Air_new/satellite.jpg
lower resolution,
but there are more
of them…
http://en.wikipedia.org/wiki/File:Three_Surveillance_cameras.jpg
4
Internet and Web Traffic
• Twitter
• Facebook
• Instagram
• Flickr
• … http://blog.socialmaximizer.com/wp-
content/uploads/2013/02/social_media.jpg6
The Stream Model
7
The Stream Model: More details
Ad-Hoc
Queries
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering:
Each stream composed of
tuples / elements
Limited
Working Archival
Storage Storage
8
Stream Queries
• Standing queries versus Ad-hoc
– Usually store a sliding window for Ad-hoc queries
• Examples
– Output alert when element > 25 appears
– Compute a running average of the last 20 elements
– Compute a running average of the elements from
the past 48 hours Are these easy or
– Maximum value ever seen hard to answer?
11
Goal
• Ideas?
13
Solution Attempt
17
Reservoir Sampling
1. Store all the first s elements of stream to S
2. Suppose we have seen n-1 elements and now the
nth arrives (n > s)
– With probability s/n, keep the nth element, else discard
– If we keep the nth element then randomly choose one of
the elements already in S to discard
19
Proof By Induction (cont)
• Inductive Step: For elements already in
sample, the probability that they remain is:
s s s 1 n
1
n 1 n 1 s n 1
Discard Keep Keep
element element element
n+1 n+1 in
sample
21
The Problem
• We would like to filter the stream according to
some critereon
– If simple property of tuple (e.g., <10): Easy!
– If requires look up in a large set that does not fit in
memory: Hard!
23
Simple Filtering Solution
• Let h be a hash table from email addresses to 8
billion
• Hash each allowed email, and set corresponding
bit to 1
• There are 1 billion email addresses, so about 1/8
of bits will be 1 (perhaps less, due to hash
collisions!)
– Given a new email, compute hash. If value is 1, let email
through. If 0, consider it spam
– There will be false positives! (about 1/8th of spam) 24
Bloom Filter
25
Using a Bloom Filter
• To set up the filter:
– Take each key K in S and set hi(K) to 1, for all hash
functions h1,…,hk
Probability that a given dart will not hit a given target: (x-1)/x
28
Back to the Running Example
31
Problem
• The stream elements are from some
universal set
• Estimate the number of different elements
we have seen from the beginning or from
specific point in time
32
Applications
34
Flajolet-Martin Algorithm:
Intuition
• We will hash elements of stream to a bit-string
– Must choose bit stream length to have more possible bit
streams than members of universal set
– Example: IP addresses are a universal set of size 4*8
bits
– Need a string of length at least 32 bits
35
Flajolet-Martin Algorithm:
Intuition
• Pick hash function h that maps universe of N
elements to at least log2N bits
• For each stream element a, let r(a) be the number
of training 0-s in h(a)
– r(a) = position of the first 1 counting from the right
– Example: h(a) = 12, then 12 is 1100 in binary and r(a)=2
38
Why it Works: More formally
39
Why it Works: More formally
r m
(1 2 ) (1 2 ) r 2 r
m 2 r
e m 2 r
• E[2R] is infinite!
– Probability halves when R→R+1, but value
doubles
41
Counting Ones in a Window
42
Problem
• We have a window of length N on a binary stream
– N is too big to store
45
DGIM Algorithm
…1011011000101110110010110
48
Space Requirements
• Each bucket requires O(lg N) bits
– We saw this earlier
50
Example
…1011011000101110110010110
51
How Good an Estimate is This?
• Suppose that the leftmost bucket b included
has size 2j. Let c be the correct answer.
– Suppose the estimate is less than c: In the
worst case, all the 1-s in the leftmost bucket are
in the query range. So, the estimate misses half
of bucket b, i.e., 2j-1.
• Then, c is at least 2j.
• Actually c is at least 2j+1-1 since there is at least one
bucket of each size 2j-1,2j-2,…,1. So, our estimate is at
least 50% of c 52
How Good an Estimate is This?
• Suppose that the leftmost bucket b included
has size 2j. Let c be the correct answer.
– Suppose the estimate is more than c: In the
worst case, only the rightmost 1 in the leftmost
bucket is in the query range, and there is only one
of each bucket size less than 2j
• Then, c = 1 + 2j-1+2j-2 +…+ 1=2j
• Our estimate was 2j-1+ 2j-1+2j-2 +…+ 1=2j +2j-1-1
• So, our estimate is at most 50% more than c
53
Maintaining DGIM Conditions
• Suppose we have a window of length N represented
by buckets satisfying DGIM conditions. Then a new
bit comes in:
Time: At most
– Check the leftmost bucket. If its timestamp log(N),
is not since
currentTimestamp – N, the remove there are log(N)
different sizes
– If new bit is 0, do nothing
– If new bit is 1, create a new bucket of size 1
• If there are now only 2 buckets of size 1, stop
• Otherwise, merge previous buckets of size 1 into bucket of size 2
– If there are now only 2 buckets of size 2, stop
– Otherwise, merge previous buckets of size 2 into buckets of size 4
54
– Etc …
Example: Updating Buckets
1001010110001011010101010101011010101010101110101010111010100010110010
0010101100010110101010101010110101010101011101010101110101000101100101
0010101100010110101010101010110101010101011101010101110101000101100101
0101100010110101010101010110101010101011101010101110101000101100101101
0101100010110101010101010110101010101011101010101110101000101100101101
0101100010110101010101010110101010101011101010101110101000101100101101
55
Slide by Jure Leskovec: Mining Massive Datasets
Example
…1011011000101110110010110
56
Reducing the Error
• Instead of allowing 1 or 2 of each bucket size,
allow r-1 or r of each bucket size for sizes 1, 2, 4,
8, … (and an integer r>2)
• Of the smallest and largest size present, we allow
there to be any number of buckets, from 1 to r
– Use similar propagation algorithm to that of before
58
Problem
• Given a stream, which items are currently
popular
– E.g., popular movie tickets, popular items being
sold in Amazon, etc.
– appear more than s times in the window
• Possible solution
– Stream per item; at each timepoint “1” if the item
appears in the original stream and “0” otherwise
– Use DGIM to estimate counts of 1 for each item 59
Example
Problem
with this
• Original Stream: approach?
1, 2, 1, 1, 3, 2, 4
Too many
streams!
Stream for 1: 1, 0, 1, 1, 0, 0, 0
Stream for 2: 0, 1, 0, 0, 0, 1, 0
Stream for 3: 0, 0, 0, 0, 1, 0, 0
Stream for 4: 0, 0, 0, 0, 0, 0, 1
60
Exponentially Decaying
Windows
• A heuristic for selecting frequent items
– Gives more weight to more recent popular items
– Instead of computing count in the last N elements,
compute smooth aggregation over entire stream
i
a
i 1
(1 c ) t i
a (1 c)
i 1
i
t i
62
Example
Stream for 1: 1, 0, 1, 1, 0, 0, 0
Stream for 4: 0, 0, 0, 0, 0, 0, 1
• Suppose c = 10-6
• What is the running score for each stream?
a (1 c)
i 1
i
t i
63
Back to Our Problem
i
x
i 1
(1 c ) t i
i
x
i 1
(1 c ) t i
66
Memory Requirements
• Suppose we want to find items with weight greater
than ½
• Since sum of all scores is 1/c, there cant be more
than 2/c items with weight ½ or more!
• So, 2/c is a limit on the number of scores being
counted at any time
– For other weight requirements, we would get a different
bound, in a similar manner