Mining Data Streams

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 67

Mining Data Streams

Some of these slides are based on Stanford Mining Massive Data Sets
Course slides at http://www.stanford.edu/class/cs246/ 1
Infinite Data: Data Streams

• Until now, we assumed that the data is finite


– Data is crawled and stored
– Queries are issued against index

• However, sometimes data is infinite (is


constantly being created) and therefore
becomes too big to be stored
– Examples of data streams?
– Later: How can we query such data?
2
Sensor Data

• Millions of sensors
• Each sending data
every (fraction of a)
second
• Analyze trends
over time
LOBO Observatory
http://www.mbari.org/lobo/Intro.html

3
Image Data

• Satellites send
terabytes a day of
data
• Cameras have http://www.defenceweb.co.za/images/stories/AIR/Air_new/satellite.jpg

lower resolution,
but there are more
of them…
http://en.wikipedia.org/wiki/File:Three_Surveillance_cameras.jpg

4
Internet and Web Traffic

• Streams of HTTP requests, IP Addresses,


etc, can be used to monitor traffic for
suspicious activity
• Streams of Google queries can be used to
determine query trends, or to understand
disease spread
– Can the spread of the flu be predicted accurately
by Google queries?
5
Social Updates

• Twitter
• Facebook
• Instagram
• Flickr
• … http://blog.socialmaximizer.com/wp-
content/uploads/2013/02/social_media.jpg6
The Stream Model

• Tuples in the stream enter at a rapid rate


• Have a fixed working memory size
• Have a fixed (but larger) archival storage

• How do you make critical calculations


quickly, i.e., using only working memory?

7
The Stream Model: More details
Ad-Hoc
Queries

. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering:
Each stream composed of
tuples / elements
Limited
Working Archival
Storage Storage
8
Stream Queries
• Standing queries versus Ad-hoc
– Usually store a sliding window for Ad-hoc queries

• Examples
– Output alert when element > 25 appears
– Compute a running average of the last 20 elements
– Compute a running average of the elements from
the past 48 hours Are these easy or
– Maximum value ever seen hard to answer?

– Compute the number of different elements seen


9
Constraints and Issues

• Data appears rapidly


– Elements must be processed in real time (= in
main memory), or are lost

• We may be willing to get an approximate


answer, instead of an exact answer, to save
time
• Hashing will be useful as it adds
randomness to algorithms!
10
Sampling Data in a Stream

11
Goal

• Create a sample of the stream


• Answer queries over the subset and have it
be representative of the entire stream
• Two Different Problems:
– Fixed proportion of elements (e.g., 1 in 10)
– Random sample of fixed size (e.g., at any time
k, we should have a random sample of s
elements)
12
Motivating Example

• Search engine stream of tuples


– (user, query, time)

• We have room to store 1/10 of the stream


• Query: What fraction of the typical user’s
queries were repeated over the past month

• Ideas?
13
Solution Attempt

• For each tuple, choose a random number


from 0 to 9
• Store tuples for which random value was 0
• On average, per user, we store 1/10 of
queries
• Can this be used to answer our question on
repeated queries?
14
Wrong!
• Suppose in past month, a user issued x queries
once and d queries twice (in total x+2d queries)
• Correct answer: d/(x+d)
• We have a 10% sample, so we have
 x/10 of the singleton queries issued and 2d/10 of
duplicate queries (at least once)
 d/100 pairs of duplicates
 Of d duplicates, 18d/100 appear once
 18d/100 = ((1/10 * 9/10) + (9/10 * 1/10)) * d
d
 We will give the following wrong answer: x
100 
d
d 18d 10 x  19d
 
10 100 100
Solution

• Pick 1/10 of users, and take all their


searches in sample
• Use a hash function that hashes the user
name (or user id) into 10 buckets
– Store data if user is hashed into bucket 0

• How would you store a fixed fraction a/b of


the stream? 16
Variant of the Problem

• Store a random sample of fixed size (e.g., at any


time k, we should have a random sample of s
elements)
– Each of the k elements seen so far have the
same probability to be in the sample
– Each of the k elements seen so far should have
probability s/k to be in the sample
• Ideas?

17
Reservoir Sampling
1. Store all the first s elements of stream to S
2. Suppose we have seen n-1 elements and now the
nth arrives (n > s)
– With probability s/n, keep the nth element, else discard
– If we keep the nth element then randomly choose one of
the elements already in S to discard

Claim: After n elements, the sample contains


each element seen so far with probability s/n
18
Proof By Induction
• Inductive Claim: If, after n elements, the
sample S contains each element seen so far
with probability s/n, then, after n+1 elements,
the sample S contains each element with
probability s/(n+1)
• Base Case: n=s, each element of the first n
has probability s/n = 1 to be in sample

19
Proof By Induction (cont)
• Inductive Step: For elements already in
sample, the probability that they remain is:

 s   s  s  1  n
1    
 n  1   n  1  s  n  1
Discard Keep Keep
element element element
n+1 n+1 in
sample

• At time n, tuple is in sample with prob s/n, so


at time n+1 its probability to now be in
sample is s/n+1
20
Filtering Stream Data

21
The Problem
• We would like to filter the stream according to
some critereon
– If simple property of tuple (e.g., <10): Easy!
– If requires look up in a large set that does not fit in
memory: Hard!

• Example: Stream of URLs (for crawling). Filter out


if “seen” already
• Example: Stream of emails. We have addresses
believed to be non-spam. Filter out “spam”
(perhaps for further processing) 22
Filtering Example
• Suppose we have 1 billion “allowed” email
addresses
• Each email address takes ~ 20 bytes
• We have 1GB available main memory
– We cannot store a hash table of allowed emails in main
memory!

• View main memory as a bit array


– We have 8 billion bits available

23
Simple Filtering Solution
• Let h be a hash table from email addresses to 8
billion
• Hash each allowed email, and set corresponding
bit to 1
• There are 1 billion email addresses, so about 1/8
of bits will be 1 (perhaps less, due to hash
collisions!)
– Given a new email, compute hash. If value is 1, let email
through. If 0, consider it spam
– There will be false positives! (about 1/8th of spam) 24
Bloom Filter

• A Bloom filter consists of:


– An array of n bits, all initialized to 0
– A collection of hash functions h1,…,hk. Each
function returns a value <= n
– A set S of m key values

• Goal: Allow all stream elements whose keys


are in S. Reject all others.

25
Using a Bloom Filter
• To set up the filter:
– Take each key K in S and set hi(K) to 1, for all hash
functions h1,…,hk

• To use the filter:


– Given a new key K in stream, compute hi(K), for all hash
functions h1,…,hk
– If all of these are 1, allow element. If at least one of
these bits is 0, reject element

Can we have false Can we have false


negatives? positives?
26
Probability of a False Positive

• Model: Throwing darts at targets


– Suppose we have x targets and y darts
– Any dart is equally likely to hit any target
– How many targets will be hit at least once?

Probability that a given dart will not hit a given target: (x-1)/x

Probability that none of the darts will hit a given target:


 y
x 
 x 1 
y
 1  x
    1  
 x   x
27
Probability of a False Positive (cont)

• As x goes to infinity, it is well known that


x
 1 1
lim 1   
x   x e

• So, if x is large, we have


 y  y
x   
 x 1 
y
 1  x 1  x  y
 x 
    1     e  
 x   x e

28
Back to the Running Example

• There are 1 billion email addresses (=1


billion darts)
• There are 8 billion bits (=8 billion targets)
• Meanwhile, 1 hash function
• Probability of a bit to not be hit:
 1, 000, 000, 000  1
  8, 000, 000, 000 
e   e 8
 0.8825

• Probability of a bit to be hit is approx 0.1175


29
Now, multiple hash functions

• Suppose we use k hash functions


• Number of targets is n
• Number of darts is km
 km
• Probability that a bit remains 0 is e n

• Choose k so as to have enough 0 bits


• Probability of false positive is now
k
  km

1  e n 
 
  30
Counting Distinct Stream
Elements

31
Problem
• The stream elements are from some
universal set
• Estimate the number of different elements
we have seen from the beginning or from
specific point in time

• Why is this hard?

32
Applications

• How many different words in Web pages at a


site?
– Why is this interesting?

• How many different Web pages does a


customer request a week?
– Why is this interesting?

• How many distinct products have we sold


last month? 33
Obvious Solution
• Keep all elements seen so far in a hash table
• When element appears, check in hashtable
• But, what if too many elements to store in
hashtable?
• Or too many streams to store each in hashtable?
• Solution: Use a small amount of memory, and
estimate the correct value
– Limit probability that the error is large

34
Flajolet-Martin Algorithm:
Intuition
• We will hash elements of stream to a bit-string
– Must choose bit stream length to have more possible bit
streams than members of universal set
– Example: IP addresses are a universal set of size 4*8
bits
– Need a string of length at least 32 bits

• The more elements, the more likely we will see


“unusual” hash values

35
Flajolet-Martin Algorithm:
Intuition
• Pick hash function h that maps universe of N
elements to at least log2N bits
• For each stream element a, let r(a) be the number
of training 0-s in h(a)
– r(a) = position of the first 1 counting from the right
– Example: h(a) = 12, then 12 is 1100 in binary and r(a)=2

• Use R to denote the maximum tail length seen so


far
• Estimate the number of distinct elements as 2R
36
Why it Works: Very rough and heuristic
• h(a) hashes a with equal prob. to any of N values
• Then, h(a) is a sequence of log2N bits, where 2-r
fraction of all a-s have a tail of r zeros
– About 50% of a-s hash to ***0
– About 25% of a-s hash to **00
– So, if the longest tail is R = 2, then we have probably
seen about 4 distinct items so far

• So, we need to hash about 2r items before we see


one with a zero-suffix of length r
37
Why it Works: More formally

• Let m be the number of distinct items


• We will show that the probability of finding a
tail of r zeros:
– Goes to 1 if m is much greater than 2r
– Goes to 0 if m is much smaller than 2r

• So, 2R will be around m!

38
Why it Works: More formally

• What is the probability that a given h(a) ends


in at least r 0-s?
– h(a) hashes elements uniformly at random
– Probability that a random number ends in at
least r 0-s is 2-r

• The probability of not seeing a tail of length r


among m elements: (1-2-r)m

39
Why it Works: More formally

• Probability of not finding a tail of length r is:

r m
(1  2 )  (1  2 ) r 2 r

m 2 r
e  m 2 r

– If m<<2r, then probability tends to 1


– So, the probability of finding a tail of length r tends to 0

– If m>>2r, then probability tends to 0


– So, the probability of finding a tail of length r tends to 1

• So, 2R will be around m!


40
Why it Doesn’t Work

• E[2R] is infinite!
– Probability halves when R→R+1, but value
doubles

• Work around uses many hash functions hi


and many samples of Ri
– Details omitted

41
Counting Ones in a Window

42
Problem
• We have a window of length N on a binary stream
– N is too big to store

• At all times we want to be able to answers “How


many 1-s are there in the last k bits?” (k≤N)
• Example: For each spam mail seen we emitted a
1. Now, we want to always know how many of the
last million emails were spam
• Example: For each tweet seen we emitted a 1 if it
is anti-Semitic. Now, we want to always know how
many of the billion tweets were anti-Semitic 43
Cost of Exact Counts
• For a precise count, we must store all N bits of the
window
– Easy to show even if we can only ask about number of
1-s in entire window
– Suppose we use j<N bits to store information
– There are 2 different bit sequences that are represented
in the same manner.
– Suppose that the sequences agree on their last k-1 bits,
but differ on the kth
– Follow the window by N-k 1-s
– We remain with the same representation for the two
different strings, but they must have a different number
of 1-s! 44
The Datar-Gionis-Indyk-Motwani
Algorithm (DGIM Algorithm)
• Use O(log2N) bits and get an estimate that is no
more than 50%
– Later improve to get a better estimate

• Assume: each bit has a timestamp (i.e., position in


which it arrives)
– Represent timestamp modulo N – require log(N) bits to
represent

• Store the total number of bits ever seen, modulo N

45
DGIM Algorithm

• Divide window into buckets consisting of


– The timestamp of its right (most recent) end
– Size of bucket = Number of 1-s in bucket.
Number must be a power of 2

• Bucket representation: log(N) for timestamp


+ log(log(N)) for size
– We know that size i is some 2j, so only store j. j
is at most log(N) and needs log(log(N)) bits
46
Rules for Representing a
Stream by Buckets
1) Right end of a bucket always has a 1
2) Every position with 1 is in some bucket
3) No position is in more than one bucket
4) There are one or two buckets of any given
size, up to some maximum size
5) All sizes must be a power of 2
6) Buckets cannot decrease in size as we
move to the left 47
Example: Bucketized Stream

…1011011000101110110010110

0ne of Two of One of Two of


size 8 size 4 size 2 size 1

• Observe that all of the DGIM rules hold

48
Space Requirements
• Each bucket requires O(lg N) bits
– We saw this earlier

• For a window of length N, there can be at


most N 1-s.
– If the largest bucket is of size 2j, then j cannot
be more than log N
– There are at most 2 buckets of all sizes from
logN to 1, i.e., O(log(N)) buckets
– Total space for buckets: O(log2(N)) 49
Query Answering

• Given k≤N, we want to know how many of


the last k bits were 1-s.
– Find bucket b with earliest timestamp that
includes at least some of the last k bits
– Estimate number of 1-s as the sum of sizes of
buckets to the right of b plus half of the size of b

50
Example

…1011011000101110110010110

0ne of Two of One of Two of


size 8 size 4 size 2 size 1

• What would be your estimate if k = 10?


• What if k=15?

51
How Good an Estimate is This?
• Suppose that the leftmost bucket b included
has size 2j. Let c be the correct answer.
– Suppose the estimate is less than c: In the
worst case, all the 1-s in the leftmost bucket are
in the query range. So, the estimate misses half
of bucket b, i.e., 2j-1.
• Then, c is at least 2j.
• Actually c is at least 2j+1-1 since there is at least one
bucket of each size 2j-1,2j-2,…,1. So, our estimate is at
least 50% of c 52
How Good an Estimate is This?
• Suppose that the leftmost bucket b included
has size 2j. Let c be the correct answer.
– Suppose the estimate is more than c: In the
worst case, only the rightmost 1 in the leftmost
bucket is in the query range, and there is only one
of each bucket size less than 2j
• Then, c = 1 + 2j-1+2j-2 +…+ 1=2j
• Our estimate was 2j-1+ 2j-1+2j-2 +…+ 1=2j +2j-1-1
• So, our estimate is at most 50% more than c
53
Maintaining DGIM Conditions
• Suppose we have a window of length N represented
by buckets satisfying DGIM conditions. Then a new
bit comes in:
Time: At most
– Check the leftmost bucket. If its timestamp log(N),
is not since
currentTimestamp – N, the remove there are log(N)
different sizes
– If new bit is 0, do nothing
– If new bit is 1, create a new bucket of size 1
• If there are now only 2 buckets of size 1, stop
• Otherwise, merge previous buckets of size 1 into bucket of size 2
– If there are now only 2 buckets of size 2, stop
– Otherwise, merge previous buckets of size 2 into buckets of size 4
54
– Etc …
Example: Updating Buckets

1001010110001011010101010101011010101010101110101010111010100010110010

0010101100010110101010101010110101010101011101010101110101000101100101

0010101100010110101010101010110101010101011101010101110101000101100101

0101100010110101010101010110101010101011101010101110101000101100101101

0101100010110101010101010110101010101011101010101110101000101100101101

0101100010110101010101010110101010101011101010101110101000101100101101
55
Slide by Jure Leskovec: Mining Massive Datasets
Example

…1011011000101110110010110

0ne of Two of One of Two of


size 8 size 4 size 2 size 1

• What happens if the next 3 bits are 1,1,1?

56
Reducing the Error
• Instead of allowing 1 or 2 of each bucket size,
allow r-1 or r of each bucket size for sizes 1, 2, 4,
8, … (and an integer r>2)
• Of the smallest and largest size present, we allow
there to be any number of buckets, from 1 to r
– Use similar propagation algorithm to that of before

• Buckets are smaller, so there is tighter bound on


error
– Can prove that the error is at most 1/r
57
Counting Frequent Items

58
Problem
• Given a stream, which items are currently
popular
– E.g., popular movie tickets, popular items being
sold in Amazon, etc.
– appear more than s times in the window

• Possible solution
– Stream per item; at each timepoint “1” if the item
appears in the original stream and “0” otherwise
– Use DGIM to estimate counts of 1 for each item 59
Example
Problem
with this
• Original Stream: approach?
1, 2, 1, 1, 3, 2, 4

Too many
streams!
Stream for 1: 1, 0, 1, 1, 0, 0, 0

Stream for 2: 0, 1, 0, 0, 0, 1, 0

Stream for 3: 0, 0, 0, 0, 1, 0, 0

Stream for 4: 0, 0, 0, 0, 0, 0, 1
60
Exponentially Decaying
Windows
• A heuristic for selecting frequent items
– Gives more weight to more recent popular items
– Instead of computing count in the last N elements,
compute smooth aggregation over entire stream

• If stream is a1,a2… and we are taking the sum over


the stream, the answer at time t is defined as
t

 i
a
i 1
(1  c ) t i

where c is a tiny constant ~ 10-6 61


Exponentially Decaying
Windows (cont)
• If stream is a1,a2… and we are taking the sum over
the stream, the answer at time t is defined as
t

 a (1  c)
i 1
i
t i

where c is a tiny constant ~ 10-6


• When new at+1 arrives, we (1) multiply current sum
by (1-c) and (2) add at+1

62
Example
Stream for 1: 1, 0, 1, 1, 0, 0, 0

Stream for 4: 0, 0, 0, 0, 0, 0, 1

• Suppose c = 10-6
• What is the running score for each stream?

 a (1  c)
i 1
i
t i

63
Back to Our Problem

• For each different item x, we compute the


running score of the stream defined by the
characteristic function of the item
– Stream in which there is 1 when item appears
and 0 otherwise
t

 i
 x

i 1
(1  c ) t i

– ix=1 if ai=x and 0 otherwise


64
Retaining the Running Scores

• Each time we see some item x in the stream:


1) Multiply all running counts by (1-c)
2) Add 1 for the running sum corresponding to x
(create a new running score with value 1 if
there is none)

• How much memory do we need for running


scores???
65
Property of Decaying Windows

• Remember, for each item x, we have a


running score t

 i
 x

i 1
(1  c ) t i

• Summing over all running scores we get


t t
1
 
x i 1
i
x
(1  c) t i
  (1  c)
i 1
t i t 


c

66
Memory Requirements
• Suppose we want to find items with weight greater
than ½
• Since sum of all scores is 1/c, there cant be more
than 2/c items with weight ½ or more!
• So, 2/c is a limit on the number of scores being
counted at any time
– For other weight requirements, we would get a different
bound, in a similar manner

• Think about it: How would you choose c?


67

You might also like