Unit Ii BD

IT18702 – BIG DATA ANALYTICS
MS.MEENAKSHI.P,AP/INT
OBJECTIVES & OUTCOMES
 OBJECTIVES
 To understand the concept of big data.
 To learn about various practical data analytics with R and Hadoop.
 To learn about big data frameworks.
 OUTCOMES:
 Upon completion of the course, students will be able to
 Work with big data tools and its analysis techniques
 Design efficient algorithms for mining the data from large volumes
 Design an efficient recommendation system
 Design the tools for visualization
 Learn NoSQL databases and management.
• Data streams are continuous flows of data
• Data streams include network traffic, sensor data, call center
records and so on.
Stream Data Applications
•Telecommunication calling records
• Business: credit card transaction flows
• Network monitoring and traffic engineering
• Financial market: stock exchange
• Engineering & industrial processes: power supply & manufacturing
• Sensor, monitoring & surveillance: video streams, RFIDs Security
monitoring
•Web logs and Web page click streams
•Massive data sets (even saved but random access is too expensive)
Characteristics of Data Stream
Huge volumes of continuous data
 Infinite
 Fast changing
 Requires fast, real-time response
Captures nicely our data processing needs of today
Random access is expensive
—Single scan algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or multi-dimensional in nature,
 Needs multi-level and multi-dimensional processing
 Any number of streams can enter the system
 Each have different data types
 Each stream can provide elements at its own schedule
 Time between elements of one stream need not be uniform.
 Streams can be archived in a large archival store but not possible to
answer the queries
 Working store – which summaries or parts of the stream may be stored
and used for answering queries
- Can be a main memory or disk- depending up on the speed required
- Cannot store all data from streams.
Two Forms of Query
1. Ad-hoc queries: Normal queries asked one time about streams.
2. Standing queries: Queries that are, in principle, asked about the
stream at all times.
– Example: Report each new maximum value ever seen in stream S.
• Standing queries
– permanently executing
• Ad-hoc queries
– produce outputs at appropriate times
STANDING QUERY
ADHOC QUERY
• A question asked once about the current state of the stream
• We do not store all streams
=>we cannot answer arbitrary queries
qwertyuiopasdfghjklzxcvbnm
• Solution : store a sliding window qwertyuiopasdfghjklzxcvbnm

– Elements or time
qwertyuiopasdfghjklzxcvbnm
Past Future
AD-hoc Queries Standing Queries
– Produce output at appropriate times – Permanently Executing
– An ad hoc query is a loosely – standing query that executes
typed command/query whose over streaming data.
value depends upon some – Streaming queries are similar to
variable. database queries in how they
– Each time the command is analyze data;
executed, the result is different, – they differ by operating
depending on the value of the continuously on data as they
variable. arrive and by updating results
– It cannot be predetermined incrementally in real-time.
and usually comes under
dynamic programming SQL
query.
– An ad hoc query is short lived
and is created at runtime.
Sampling data in a stream
 We cannot store all streams in main memory

 Solution : to get a approximate answer than an exact solution
 We ask queries about the sampled data

• Scenario: Search engine query stream
– Stream of tuples: (user, query, time)
– Answer questions such as: What fraction of typical user queries were
repeated over the past month ?
– Wish to store 1/10th of query stream
• Obvious solution:
– Generate a random integer in [0...9] for each query
– Store the query if the integer is 0, otherwise discard
– This solution is wrong for average number of duplicate queries for a user
Sampling data in a stream
 Suppose each user issues x queries once and d queries twice
(total of x+2d queries)
 Correct answer: d/(x + d)
 Proposed solution: We keep 10% of the queries
 Sample will contain x/10 of the singleton queries and
2d/10 of the duplicate queries at least once
 But only d/100 pairs of duplicates : d/100 = 1/10 ∙ 1/10 ∙ d
 Of d “duplicates” 18d/100 appear exactly once
 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
 note that 18/100 is the Probability that one of the two occurrences will be in the
1/10th of the stream that is selected, while the other is in the 9/10th that is not
selected
 So the sample-based answer is
 d/(x + d) ≠
Obtaining a Representative Sample
• Pick 1/10th of the users and take all of their searches
1. Store a list of users
– Generate a random integer between 0 and 9
– 0 =>value : in ; others =>value : out
2. Hash function 0 1 2 3 4 5 6 7 8 9
– Hash each user name to one of ten buckets,0 through 9

• More generally, we can obtain a sample consisting of any
rational fraction
– a/b of the users by hashing user names to b buckets, 0 through b − 1.
Add the search query to the sample if the hash value is less than a.
3.The general sampling problem
• The stream consists of tuples with n components
• A subset of components are the key
• If the key consists of more than one component, the hash
function needs to combine the values to make a single hash-value
Stream of tuples: (user, query, time)
• user is in the key.
• However, we could also take a sample of queries by making query be the key, or even
take a sample of user-query pairs by making both those components form the key.
• To take a sample of size a/b, we hash the key value for each tuple to b
• buckets, and accept the tuple for the sample if the hash value is less than a.
• If the key consists of more than one component, the hash function needs to
• combine the values for those components to make a single hash-value
4. Varying the sample size
• The sample will grow as more of the stream enters the system.
• In order to assure that at all times, we choose a hash function h from key
values to a very large number of values 0, 1, . . . ,B−1.
• We maintain a threshold t, which initially can be the largest bucket number, B
− 1.
• At all times, the sample consists of those tuples whose key K satisfies h(K) ≤
t.Values
• We sample the tuples whose key K satisfies h(K) ≦ t
• Lower t to t-1, if the samples exceeds the allotted space
t=4
0 1 2 3 4 5 6 7 8 9
• Suppose we have a stream of tuples with the schema:
• Grade (University, courseID, studentID, grade)
• Assume universities are unique, but a courseID is unique only within
a university (i.e., different universities may have different courses
with the same ID, e.g., “CS101”) and likewise, studentID’s are
unique only within a university (different university may assign the
same ID to different students). Suppose we want to answer certain
queries approximately from a 1/20 sample of the data. For each of
th
queries below, indicate how you would construct the sample. That
is, tell what the key attributes should be.
– For each university, estimate the average number of students in a
course.
» In this case, we construct the sample with key attribute is
university.
– Estimate the fraction of students who have a GPA of 3.5 or more.
» In this case, we construct the sample with key attribute is student.
– Estimate the fraction of courses where at least half the students got
“A.”
» In this case, we construct the sample with key attribute is course.
ALGORITHMS
– (1) Filtering a data stream: Bloom filters
• Select elements with property x from stream
– (2) Counting distinct elements: Flajolet-Martin
• Number of distinct elements in the last k elements
of the stream
– (3) Estimating moments: AMS method
• Estimate std. dev. of last k elements
– (4) Counting frequent items
First Cut Solution
Output the item since it may be in S.

Item hashes to a bucket that at least
one of the items in S hashed to.
Item Filter
Hash
func h
0010001011000 Bit array B
Drop the item.

It hashes to a bucket set
to 0 so it is surely not in S.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of M 23

assive Datasets, http://www.mmds.org
example
h1(“geeks”) % 10 = 1
h2(“geeks”) % 10 = 4
h3(“geeks”) % 10 = 7
Now if we want to check “geeks” is present in filter or not.

We’ll do the same process but this time in reverse order.
We calculate respective hashes using h1, h2 and h3 and check if all these indices are set
to 1 in the bit array. If all the bits are set then we can say that “geeks” is probably
present.
If any of the bit at these indices are 0 then “geeks” is definitely not present.
01/04/2024 Mining of Massive Datasets. Leskovec, Rajaraman 24
and Ullman. Stanford University
First Cut Solution
 |S| = 1 billion email addresses

|B|= 1GB = 8 billion bits
 If the email address is in S, then it surely hashes to a bucket that
has the big set to 1,
so it always gets through (no false negatives)
 Approximately 1/8 of the bits are set to 1, so about 1/8th of the
addresses not in S get through to the output (false positives)
 Actually, less than 1/8th, because more than one address might hash to
the same bit
Analysis: Throwing Darts (1)
 More accurate analysis for the number of false positives
 Consider: If we throw m darts into n equally likely targets, what
is the probability that
a target gets at least one dart?
 In our case:
 Targets = bits/buckets
 Darts = hash values of items

 We have m-y darts, n - x targets
 What is the probability that a target gets at least one dart?

 The probability that a given dart will not hit a given target is .
 The probability that none of the y darts
 will hit a given target is
 .We can write this expression as
 Using the approximation for small ǫ

 we conclude that the probability that none of the y darts hit a
given target is e−y/x.J. Leskovec, A. Rajaraman, J. Ullman: Mining of M 27
 Fraction of 1s in the array B =
= probability of false positive = 1 – e-y/x
 Example: y= 109 darts, x= 8∙109 targets

 Fraction of 1s in B = 1 – e-1/8 = 0.1175
 Compare with our earlier estimate: 1/8 = 0.125

Bloom Filter
 Consider: |S| = m, |B| = n
 Use k independent hash functions h1 ,…, hk
 Initialization:
 Set B to all 0s
 Hash each element s S using each hash function hi, set B[hi(s)] = 1 (for
(note: we have a
each i = 1,.., k) single array B!)
 Run-time:
 When a stream element with key x arrives
 If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
 That is, x hashes to a bucket set to 1 for every hash function hi(x)
 Otherwise discard the J.element xhttp://www.mmds.org
Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets,
29
Bloom Filter -- Analysis
 What fraction of the bit vector B are 1s?
 Throwing k∙m darts at n targets
 So fraction of 1s is (1 – e-km/n)
 But we have k independent hash functions

and we only let the element x through if all k hash element x to a
bucket of value 1
 So, false positive probability = (1 – e-km/n)k
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive

30
Datasets, http://www.mmds.org
Bloom Filter – Analysis (2)
 m = 1 billion, n = 8 billion
0.2
0.18
 k = 1: (1 – e-1/8) = 0.1175 0.16
False positive prob.

0.14
 k = 2: (1 – e-1/4)2 = 0.0493 0.12
0.1
0.08
 What happens as we 0.06
keep increasing k?
0.04
0.02
0 2 4 6 8 10 12 14 16 18 20
Number of hash functions, k

l
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive

31
Datasets, http://www.mmds.org
Counting Distinct Elements
• Problem:
– Data stream consists of a universe of elements chosen from a set of size N
– Maintain a count of the number of distinct elements seen so far
• Obvious approach:
Maintain the set of elements seen so far
– That is, keep a hash table of all the distinct elements seen so far
Applications
• How many different words are found among the Web pages being
crawled at a site?
– Unusually low or high numbers could indicate artificial pages (spam?)
• How many different Web pages does each customer request in a

week?
• How many distinct products have we sold in the last week?

• Very very rough and heuristic intuition why Flajolet-Martin works:
– h(a) hashes a with equal prob. to any of N values
– Then h(a) is a sequence of log2 N bits,
where 2-r fraction of all as have a tail of r zeros
• About 50% of as hash to ***0
• About 25% of as hash to **00
• So, if we saw the longest tail of r=2 (i.e., item hash
ending *100) then we have probably seen
about 4 distinct items so far
– So, it takes to hash about 2r items before we
see one with zero-suffix of length r
• What is the probability that a given h(a) ends in at least r zeros is
2-r
– h(a) hashes elements uniformly at random
– Probability that a random number ends in
at least r zeros is 2-r
• Then, the probability of NOT seeing a tail
of length r among m elements:
The Maximum no of distinct elements is 2r
To ensure uniformity we hash the elements using a multiplicative hash function
• Ensuring uniform distribution
To ensure uniformity we hash the elements using a multiplicative hash function
where a and b are odd numbers and c is the capping limit of the hash range.
This hash function hashes the elements uniformly into a hash range of size c.
• Stream: 4, 2, 5 ,9, 1, 6, 3, 7
• h(x) = x + 6 mod 32
h(4) = (4) + 6 mod 32 = 10 mod 32 = 10 = (01010)
h(2) = (2) + 6 mod 32 = 8 mod 32 = 8 = (01000)
h(5) = (5) + 6 mod 32 = 11 mod 32 = 11 = (01011)
h(9) = (9) + 6 mod 32 = 15 mod 32 = 15 = (01111)
h(1) = (1) + 6 mod 32 = 7 mod 32 = 7 = (00111)
h(6) = (6) + 6 mod 32 = 12 mod 32 = 12 = (01110)
h(3) = (3) + 6 mod 32 = 9 mod 32 = 9 = (01001)
h(7) = (7) + 6 mod 32 = 13 mod 32 = 13 = (01101)
Trailing zero's {1, 3, 0, 0, 0, 1, 0, 0}
R = max [Trailing Zero] = 3
Output = 2R = 23 = 8
• h(x) = 3x + 1 mod 32
h(4) = 3(4) + 7 mod 32 = 19 mod 32 = 19 = (10011)
h(2) = 3(2) + 7 mod 32 = 13 mod 32 = 13 = (01101)
h(5) = 3(5) + 7 mod 32 = 22 mod 32 = 22 = (10110)
h(9) = 3(9) + 7 mod 32 = 34 mod 32 = 2 = (00010)
h(1) = 3(1) + 7 mod 32 = 10 mod 32 = 10 = (01010)
h(6) = 3(6) + 7 mod 32 = 25 mod 32 = 25 = (11001)
h(3) = 3(3) + 7 mod 32 = 16 mod 32 = 16 = (10000)
h(7) = 3(7) + 7 mod 32 = 28 mod 32 = 28 = (11100)
Trailing zero's {0, 0, 1, 1, 1, 0, 4, 2}
R = max [Trailing Zero] = 4
Output = 2R = 24 = 16
• Generalization of problem of counting distinct elements in a given
stream
• Problem called as “computing moments”
• It involves the distribution of frequencies of different elements

• Suppose a stream has elements chosen from a set A of N values
• Let mi be the number of times value i occurs in the str

• The kth moment is
 iA
( mi ) k
• 0thmoment = number of distinct elements
– The problem just considered
• 1st moment = count of the numbers of elements = length of the
stream
– Easy to compute
• 2nd moment = surprise number S =
a measure of how uneven the distribution is
• Stream of length 100
• 11 distinct values
• Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S = 910

– (10)2 + 10(9)2 = 100 + 810 = 910
• Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1 Surprise S = 8,110

– (90)2 + 10(1)2 = 8100+10=8110
AMS algorithm (Alon-Matias-Szegedy)
• AMS method works for all moments
• Gives an unbiased estimate
• We will just concentrate on the 2nd moment S
• We pick and keep track of many variables X:
– For each variable X we store X.el and X.val
• X.el corresponds to the particular element or item I in a U
• X.val corresponds to the count of item i
– Note this requires a count in main memory, so number of Xs is limited
• Our goal is to compute
• How to set X.val and X.el?
– Assume stream has length n (we relax this later)
– Pick some random time t (t<n) to start, so that any time is equally likely
– Let at time t the stream have item i. We set X.el = i
– Then we maintain count c (X.val = c) of the number of is in the stream
starting from the chosen time t
 Then the estimate of the 2nd moment () is:
• Note, we will keep track of multiple Xs, (X1, X2,… Xk)

and our final estimate will be
EXAMPLE
• Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b.

• The length of the stream is n = 15.
• Since a appears 5 times, b appears 4 times and c and d appear three times each,
• the second moment for the stream is
• 25+16+9+9 = 59.
• Suppose we keep three variables, X1, X2, and X3. Also,
• Assume that at “random” we pick the 3rd, 8th, and 13th positions to define
these three variables.
• When we reach position 3, we find element c,
• so we set X1.element = c and X1.value = 1
AMS algorithm (Alon-Matias-Szegedy) for 2. order
moments
(1) Randomly pick x positions (here: 3; i.e. 3 variables to compute the

the more space we use, the more
2nd order) from stream with known length accurate the estimate
X 1 = (c,
X 1 = (c, 2)
1)
X 2 = (d, X 2 = (d, 2)
1) X 1 = (c, 3)
X 3 = (a, 1)
X 3 = (a,
2)
(2) Process the stream, one element
27 at a time
moments
• Estimate of the 2nd order moment from any
X = (element, value): argument will follow
n ⇥ (2 ⇥ X.value —
1)
•
Applied to from
estimate our example:
X 1 : 15 ⇥ (2 ⇥ 3 —1) =
estimate from X 2 : 15 ⇥ (2 ⇥ 2 —1) =
75 average
estimate from X 3 : 15 ⇥ (2 ⇥ 2 —1) =
45
45
AV tt(X 1 , X 2 , X 3 ) = 55
28
moments
• To show: expected value of any X constructed this
way is the second moment of the stream
• Notation:
e(i): stream element a t position i in the stream
c(i): number of times e(i) appears starting a t
position i
ignored (before
picked starting e(6) = a e(9) = a
position) c(6) = 4 c(9) = 3
AMS algorithm (Alon-Matias-Szegedy) for 2.
order moments
estimate of the 2. order moment
Given: n ⇥ (2 ⇥ X.value —
1)
1 n
Expectation: E(n(2X.value —1)) = n i=1 n ⇥ (2 ⇥ c(i) —
n 1)
Simplifies to = i = 1 (2 ⇥ c(i) —
1)
To show: this expression is the 2. order moment
We know: (m a ) 2 = a 1 + 3 + 5 + .. + (2ma — 1)
16 = 42 = 1 + 3 + 5 + 7
E(n(2X.value — 1)) = a (m a ) 2
e (i):stream element a t position i in the stream
c(i): number of times e(i) appears starting a t position i
HIGHER ORDER MOMENTS
• For estimating kth moment we essentially use the same
algorithm but change the estimate:
– For k=2 we used n (2·c – 1)
– For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val)
• Why?
– For k=2: Remember we had and we showed terms 2c-1 (for c=1,…,m)
sum to m2
• So:
– For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
• Generally: Estimate
problem
 Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2.
What is the third moment of this stream?

 The frequency moment of a stream is calculated by using the following formula:
 𝑭𝒎 = ∑𝑵 𝒇𝒎 Where m is the order of moment, and f is number of occurrence(s) of the ith
element.
Element Occurrence 1st moment 2nd moment 3rd moment
1 3 3 9 27

2 2 2 4 8


3 2 2 4 8

 4 2 2 4 8

 𝐹𝑚 = 9 𝐹𝑚 = 21 𝐹𝑚 = 51

01/04/2024 54
Starting position i Xi.element Xi.Value
3, 1, 4, 1, 3, 4, 2, 1, 2.
1st 3 2
2nd 1 3
3rd 4 2
4th 1 2
5th 3 1
6th 4 1
7th 2 2
8th 1 1
9th 2 1
01/04/2024 55
Therefore, the second moment of the stream is estimated as:
 From this table, we can derive an estimate of the second

moment by the following formula
9*(2*2-1) 27
9*(2*3-1) 45
 9*(2*2-1) 27
/n 9*(2*2-1) 27
Therefore, the second moment of the stream is estimated as:
9*(2*1-1) 9
9*(2*1-1) 9
9*(2*2-1) 27
9*(2*1-1) 9
9*(2*1-1) 9
189 21
01/04/2024 Mining of Massive Datasets. Leskovec, Rajaraman and Ullman. Stanford University 56
 This result is exactly same as the result calculated using the
formula .
 It is because we utilized all of 9 possible starting positions for all
variables.
 If we use less than 9 variables to save computational cost, the
result will be slightly different from the true value but still
acceptable error.
01/04/2024 57
DEALING WITH INFINITE STREAMS
 Till now we assumed n is constant, but n grows with time
 Count the no of elements and store this value, this requires only
log n bits
 Problem is how to select the position of the variables
 If selected early then as stream gets longer biased estimate is too
large
 If selected late then biased estimate may be unreliable
 So we have seen n stream elements, and the probability of
 any particular position being the position of a variable is uniform,
that is s/n
01/04/2024 58
DEALING WITH INFINITE STREAMS
 When the (n+1)st element arrives, pick that position with probability
s/(n+1).
 If not picked, then the s variables keep their same positions.
 However, if the (n+1)st position is picked, then throw out one of the
current s variables, with equal probability.
 Replace the one discarded by a new variable whose element
 is the one at position n + 1 and whose value is 1.
 By induction on the stream length n that all positions have equal
probability s/n of being chosen as the position of a variable.
01/04/2024 59
Counting Bits (1)
 Problem:
 Given a stream of 0s and 1s
 Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N
 Obvious solution:
Store the most recent N bits
 When new bit comes in, discard the N+1st bit
010011011101010110110110 Suppose N=6
Past Future
60
• DGIM METHOD
Name refers to the inventors:
– Datar, Gionis, Indyk, and Motwani.
• Store only O(log2N) bits per stream (N = window size).
• DGIM solution that does not assume uniformity
• Gives approximate answer, never off by more than 50%.
– Error factor can be reduced to any fraction > 0, with more complicated
algorithm and proportionally more stored bits.
Represent the window as a set of exponentially growing non-
overlapping buckets
• Each bit in the stream has a timestamp
• - the position in the stream from the beginning.
• Record timestamps modulo N (window size) - use o(log N) bits
• Store the most recent timestamp to identify the position of any other bit in the
window
A bucket is a segment of the window; it is represented by a record consisting of two
components:
– Timestamp of the most recent end. Needs N) bits
– Size of the bucket - the number of ones in it.
• Size is always .
• To store j we need N) bits
– Each bucket needs N) bits
•
Represetig the Bucket
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• Buckets do not overlap.
• There are one or two buckets of any given size, up to some
maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in
time).
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100010110010
Estimation of 1’s within the last k bits
1. Determine bucket b with the earliest timestamp
that includes at least some of the k most recent
2. bits
3. Sum the sizes (#1’s) of all buckets to the right of b
Final estimate: add size(b)/2 to the sum
stream direction
timestamp t
..1011011000101110110010110
}
k = 10
bucket b exact #bits: 5
est imated # bit s:
Query Answering
How many ones are in the most recent k bits?
• Find all buckets overlapping with last k bits
• Sum the sizes of all but the oldest one
• Add the half of the size of the oldest one

Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24
67
k
What is the maximum error for a bucket b of size 2^j?
• Case 1: estimate is less than correct answer c

• Worst case: all 1’s of b are in range of the query but
only half are included (by def. of the estimate)
• Thus, estimate is at least 50% of c
What is the maximum error for a bucket b of size 2j?
• Case 2: estimate is greater than c

• Worst case: only the rightmost bit of b is in range and
only bucket of each size smaller than b
one
• exists
Thus, estimate is no more than 50% greater than
b
half of b (of size 2j) we have the correct count for all buckets < b
Updating the buckets with increasing stream length
• Starting condition: window of length N with correct bucketing
• Update: new bit nb ‘enters’ the window
• Check timestamp of leftmost bucket, drop it, if

it completely falls outside the new window
If nb is set to 1:
•
• Create a new bucket with current timestamp and size 1
•
If 3 buckets of size 1 exist, merge the two older
ones, create a bucket of size 2
• If 3 buckets of size 2 exist ….
Update time complexity: O(log N )
Example: one bit enters the window
stream direction
timestamp t
..10110110001011101100101101
..10110110001011101100101101
0 enters the stream?

What happens if now a 1
72
73
Example: Counting Items
 If each ai is an “item” we can compute the
characteristic function of each possible
item x as an Exponentially Decaying Window
 That is:
where δi=1 if ai=x, and 0 otherwise
 Imagine that for each item x we have a binary
stream (1 if x appears, 0 if x does not appear)
 New item x arrives:
 Multiply all counts by (1-c)
 Add +1 to count for element x
 Call this sum the “weight” of item x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 74

Unit Ii BD

Uploaded by

Copyright:

Available Formats

Unit Ii BD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit Ii BD

Uploaded by

Copyright:

Available Formats

IT18702 – BIG DATA ANALYTICS

• Solution : store a sliding window qwertyuiopasdfghjklzxcvbnm

 We cannot store all streams in main memory

 We ask queries about the sampled data

– Hash each user name to one of ten buckets,0 through 9

Output the item since it may be in S.

0010001011000 Bit array B

Drop the item.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of M 23

Now if we want to check “geeks” is present in filter or not.

 |S| = 1 billion email addresses

J. Leskovec, A. Rajaraman, J. Ullman: Mining of M 26

 What is the probability that a target gets at least one dart?

 Using the approximation for small ǫ

 Example: y= 109 darts, x= 8∙109 targets

J. Leskovec, A. Rajaraman, J. Ullman: Mining of M 28

 But we have k independent hash functions

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive

 k = 1: (1 – e-1/8) = 0.1175 0.16

False positive prob.

 k = 2: (1 – e-1/4)2 = 0.0493 0.12

 What happens as we 0.06

Number of hash functions, k

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive

• How many different Web pages does each customer request in a

• How many distinct products have we sold in the last week?

• Ensuring uniform distribution

To ensure uniformity we hash the elements using a multiplicative hash function

• Problem called as “computing moments”

• It involves the distribution of frequencies of different elements

• Let mi be the number of times value i occurs in the str

• Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S = 910

• Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1 Surprise S = 8,110

• Note, we will keep track of multiple Xs, (X1, X2,… Xk)

• Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b.

(1) Randomly pick x positions (here: 3; i.e. 3 variables to compute the

 From this table, we can derive an estimate of the second

How many ones are in the most recent k bits?

• Find all buckets overlapping with last k bits

• Sum the sizes of all but the oldest one

• Add the half of the size of the oldest one

• Case 1: estimate is less than correct answer c

• Case 2: estimate is greater than c

• Starting condition: window of length N with correct bucketing

• Update: new bit nb ‘enters’ the window

• Check timestamp of leftmost bucket, drop it, if

0 enters the stream?

You might also like