Unit Ii BD
Unit Ii BD
Unit Ii BD
MS.MEENAKSHI.P,AP/INT
OBJECTIVES & OUTCOMES
OBJECTIVES
To understand the concept of big data.
To learn about various practical data analytics with R and Hadoop.
To learn about big data frameworks.
OUTCOMES:
Upon completion of the course, students will be able to
Work with big data tools and its analysis techniques
Design efficient algorithms for mining the data from large volumes
Design an efficient recommendation system
Design the tools for visualization
Learn NoSQL databases and management.
• Data streams are continuous flows of data
• Data streams include network traffic, sensor data, call center
records and so on.
Stream Data Applications
•Telecommunication calling records
• Business: credit card transaction flows
• Network monitoring and traffic engineering
• Financial market: stock exchange
• Engineering & industrial processes: power supply & manufacturing
• Sensor, monitoring & surveillance: video streams, RFIDs Security
monitoring
•Web logs and Web page click streams
•Massive data sets (even saved but random access is too expensive)
Characteristics of Data Stream
Huge volumes of continuous data
Infinite
Fast changing
Requires fast, real-time response
Captures nicely our data processing needs of today
Random access is expensive
—Single scan algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or multi-dimensional in nature,
Needs multi-level and multi-dimensional processing
Any number of streams can enter the system
Each have different data types
Each stream can provide elements at its own schedule
Time between elements of one stream need not be uniform.
Streams can be archived in a large archival store but not possible to
answer the queries
Working store – which summaries or parts of the stream may be stored
and used for answering queries
- Can be a main memory or disk- depending up on the speed required
- Cannot store all data from streams.
Two Forms of Query
1. Ad-hoc queries: Normal queries asked one time about streams.
2. Standing queries: Queries that are, in principle, asked about the
stream at all times.
– Example: Report each new maximum value ever seen in stream S.
• Standing queries
– permanently executing
• Ad-hoc queries
– produce outputs at appropriate times
STANDING QUERY
ADHOC QUERY
• A question asked once about the current state of the stream
• We do not store all streams
=>we cannot answer arbitrary queries
qwertyuiopasdfghjklzxcvbnm
Past Future
AD-hoc Queries Standing Queries
– Produce output at appropriate times – Permanently Executing
– An ad hoc query is a loosely – standing query that executes
typed command/query whose over streaming data.
value depends upon some – Streaming queries are similar to
variable. database queries in how they
– Each time the command is analyze data;
executed, the result is different, – they differ by operating
depending on the value of the continuously on data as they
variable. arrive and by updating results
– It cannot be predetermined incrementally in real-time.
and usually comes under
dynamic programming SQL
query.
– An ad hoc query is short lived
and is created at runtime.
Sampling data in a stream
• Obvious solution:
– Generate a random integer in [0...9] for each query
– Store the query if the integer is 0, otherwise discard
– This solution is wrong for average number of duplicate queries for a user
Sampling data in a stream
Suppose each user issues x queries once and d queries twice
(total of x+2d queries)
Correct answer: d/(x + d)
Proposed solution: We keep 10% of the queries
Sample will contain x/10 of the singleton queries and
2d/10 of the duplicate queries at least once
But only d/100 pairs of duplicates : d/100 = 1/10 ∙ 1/10 ∙ d
Of d “duplicates” 18d/100 appear exactly once
18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
note that 18/100 is the Probability that one of the two occurrences will be in the
1/10th of the stream that is selected, while the other is in the 9/10th that is not
selected
So the sample-based answer is
d/(x + d) ≠
Obtaining a Representative Sample
• Pick 1/10th of the users and take all of their searches
1. Store a list of users
– Generate a random integer between 0 and 9
– 0 =>value : in ; others =>value : out
2. Hash function 0 1 2 3 4 5 6 7 8 9
t=4
0 1 2 3 4 5 6 7 8 9
• Suppose we have a stream of tuples with the schema:
• Grade (University, courseID, studentID, grade)
• Assume universities are unique, but a courseID is unique only within
a university (i.e., different universities may have different courses
with the same ID, e.g., “CS101”) and likewise, studentID’s are
unique only within a university (different university may assign the
same ID to different students). Suppose we want to answer certain
queries approximately from a 1/20 sample of the data. For each of
th
queries below, indicate how you would construct the sample. That
is, tell what the key attributes should be.
– For each university, estimate the average number of students in a
course.
» In this case, we construct the sample with key attribute is
university.
– Estimate the fraction of students who have a GPA of 3.5 or more.
» In this case, we construct the sample with key attribute is student.
– Estimate the fraction of courses where at least half the students got
“A.”
» In this case, we construct the sample with key attribute is course.
ALGORITHMS
– (1) Filtering a data stream: Bloom filters
• Select elements with property x from stream
– (2) Counting distinct elements: Flajolet-Martin
• Number of distinct elements in the last k elements
of the stream
– (3) Estimating moments: AMS method
• Estimate std. dev. of last k elements
– (4) Counting frequent items
First Cut Solution
Hash
func h
h1(“geeks”) % 10 = 1
h2(“geeks”) % 10 = 4
h3(“geeks”) % 10 = 7
0.18
0.1
0.08
keep increasing k?
0.04
0.02
0 2 4 6 8 10 12 14 16 18 20
• Obvious approach:
Maintain the set of elements seen so far
– That is, keep a hash table of all the distinct elements seen so far
Applications
• How many different words are found among the Web pages being
crawled at a site?
– Unusually low or high numbers could indicate artificial pages (spam?)
where a and b are odd numbers and c is the capping limit of the hash range.
This hash function hashes the elements uniformly into a hash range of size c.
• Stream: 4, 2, 5 ,9, 1, 6, 3, 7
• h(x) = x + 6 mod 32
h(4) = (4) + 6 mod 32 = 10 mod 32 = 10 = (01010)
h(2) = (2) + 6 mod 32 = 8 mod 32 = 8 = (01000)
h(5) = (5) + 6 mod 32 = 11 mod 32 = 11 = (01011)
h(9) = (9) + 6 mod 32 = 15 mod 32 = 15 = (01111)
h(1) = (1) + 6 mod 32 = 7 mod 32 = 7 = (00111)
h(6) = (6) + 6 mod 32 = 12 mod 32 = 12 = (01110)
h(3) = (3) + 6 mod 32 = 9 mod 32 = 9 = (01001)
h(7) = (7) + 6 mod 32 = 13 mod 32 = 13 = (01101)
Trailing zero's {1, 3, 0, 0, 0, 1, 0, 0}
R = max [Trailing Zero] = 3
Output = 2R = 23 = 8
• h(x) = 3x + 1 mod 32
h(4) = 3(4) + 7 mod 32 = 19 mod 32 = 19 = (10011)
h(2) = 3(2) + 7 mod 32 = 13 mod 32 = 13 = (01101)
h(5) = 3(5) + 7 mod 32 = 22 mod 32 = 22 = (10110)
h(9) = 3(9) + 7 mod 32 = 34 mod 32 = 2 = (00010)
h(1) = 3(1) + 7 mod 32 = 10 mod 32 = 10 = (01010)
h(6) = 3(6) + 7 mod 32 = 25 mod 32 = 25 = (11001)
h(3) = 3(3) + 7 mod 32 = 16 mod 32 = 16 = (10000)
h(7) = 3(7) + 7 mod 32 = 28 mod 32 = 28 = (11100)
Trailing zero's {0, 0, 1, 1, 1, 0, 4, 2}
R = max [Trailing Zero] = 4
Output = 2R = 24 = 16
• Generalization of problem of counting distinct elements in a given
stream
iA
( mi ) k
• 0thmoment = number of distinct elements
– The problem just considered
• 1st moment = count of the numbers of elements = length of the
stream
– Easy to compute
• 2nd moment = surprise number S =
a measure of how uneven the distribution is
• Stream of length 100
• 11 distinct values
X 1 = (c,
X 1 = (c, 2)
1)
X 2 = (d, X 2 = (d, 2)
1) X 1 = (c, 3)
X 3 = (a, 1)
X 3 = (a,
2)
(2) Process the stream, one element
27 at a time
AMS algorithm (Alon-Matias-Szegedy) for 2. order
moments
• Estimate of the 2nd order moment from any
X = (element, value): argument will follow
n ⇥ (2 ⇥ X.value —
1)
•
Applied to from
estimate our example:
X 1 : 15 ⇥ (2 ⇥ 3 —1) =
estimate from X 2 : 15 ⇥ (2 ⇥ 2 —1) =
75 average
estimate from X 3 : 15 ⇥ (2 ⇥ 2 —1) =
45
45
AV tt(X 1 , X 2 , X 3 ) = 55
28
AMS algorithm (Alon-Matias-Szegedy) for 2. order
moments
• To show: expected value of any X constructed this
way is the second moment of the stream
• Notation:
e(i): stream element a t position i in the stream
c(i): number of times e(i) appears starting a t
position i
ignored (before
picked starting e(6) = a e(9) = a
position) c(6) = 4 c(9) = 3
AMS algorithm (Alon-Matias-Szegedy) for 2.
order moments
estimate of the 2. order moment
Given: n ⇥ (2 ⇥ X.value —
1)
1 n
Expectation: E(n(2X.value —1)) = n i=1 n ⇥ (2 ⇥ c(i) —
n 1)
Simplifies to = i = 1 (2 ⇥ c(i) —
1)
To show: this expression is the 2. order moment
We know: (m a ) 2 = a 1 + 3 + 5 + .. + (2ma — 1)
16 = 42 = 1 + 3 + 5 + 7
E(n(2X.value — 1)) = a (m a ) 2
e (i):stream element a t position i in the stream
c(i): number of times e(i) appears starting a t position i
HIGHER ORDER MOMENTS
• For estimating kth moment we essentially use the same
algorithm but change the estimate:
– For k=2 we used n (2·c – 1)
– For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val)
• Why?
– For k=2: Remember we had and we showed terms 2c-1 (for c=1,…,m)
sum to m2
• So:
– For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
• Generally: Estimate
problem
Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2.
What is the third moment of this stream?
The frequency moment of a stream is calculated by using the following formula:
𝑭𝒎 = ∑𝑵 𝒇𝒎 Where m is the order of moment, and f is number of occurrence(s) of the ith
element.
Element Occurrence 1st moment 2nd moment 3rd moment
1 3 3 9 27
2 2 2 4 8
3 2 2 4 8
4 2 2 4 8
𝐹𝑚 = 9 𝐹𝑚 = 21 𝐹𝑚 = 51
01/04/2024 54
Starting position i Xi.element Xi.Value
3, 1, 4, 1, 3, 4, 2, 1, 2.
1st 3 2
2nd 1 3
3rd 4 2
4th 1 2
5th 3 1
6th 4 1
7th 2 2
8th 1 1
9th 2 1
01/04/2024 55
Therefore, the second moment of the stream is estimated as:
9*(2*1-1) 9
9*(2*1-1) 9
9*(2*2-1) 27
9*(2*1-1) 9
9*(2*1-1) 9
189 21
01/04/2024 Mining of Massive Datasets. Leskovec, Rajaraman and Ullman. Stanford University 56
This result is exactly same as the result calculated using the
formula .
It is because we utilized all of 9 possible starting positions for all
variables.
If we use less than 9 variables to save computational cost, the
result will be slightly different from the true value but still
acceptable error.
01/04/2024 57
DEALING WITH INFINITE STREAMS
Till now we assumed n is constant, but n grows with time
Count the no of elements and store this value, this requires only
log n bits
Problem is how to select the position of the variables
If selected early then as stream gets longer biased estimate is too
large
If selected late then biased estimate may be unreliable
So we have seen n stream elements, and the probability of
any particular position being the position of a variable is uniform,
that is s/n
01/04/2024 58
DEALING WITH INFINITE STREAMS
When the (n+1)st element arrives, pick that position with probability
s/(n+1).
If not picked, then the s variables keep their same positions.
However, if the (n+1)st position is picked, then throw out one of the
current s variables, with equal probability.
Replace the one discarded by a new variable whose element
is the one at position n + 1 and whose value is 1.
By induction on the stream length n that all positions have equal
probability s/n of being chosen as the position of a variable.
01/04/2024 59
Counting Bits (1)
Problem:
Given a stream of 0s and 1s
Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N
Obvious solution:
Store the most recent N bits
When new bit comes in, discard the N+1st bit
010011011101010110110110 Suppose N=6
Past Future
60
• DGIM METHOD
Name refers to the inventors:
– Datar, Gionis, Indyk, and Motwani.
• Store only O(log2N) bits per stream (N = window size).
• DGIM solution that does not assume uniformity
• Gives approximate answer, never off by more than 50%.
– Error factor can be reduced to any fraction > 0, with more complicated
algorithm and proportionally more stored bits.
Represent the window as a set of exponentially growing non-
overlapping buckets
• Each bit in the stream has a timestamp
• - the position in the stream from the beginning.
• Record timestamps modulo N (window size) - use o(log N) bits
• Store the most recent timestamp to identify the position of any other bit in the
window
A bucket is a segment of the window; it is represented by a record consisting of two
components:
– Timestamp of the most recent end. Needs N) bits
– Size of the bucket - the number of ones in it.
• Size is always .
• To store j we need N) bits
– Each bucket needs N) bits
•
Represetig the Bucket
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• Buckets do not overlap.
• There are one or two buckets of any given size, up to some
maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in
time).
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100010110010
Estimation of 1’s within the last k bits
1. Determine bucket b with the earliest timestamp
that includes at least some of the k most recent
2. bits
3. Sum the sizes (#1’s) of all buckets to the right of b
Final estimate: add size(b)/2 to the sum
stream direction
timestamp t
..1011011000101110110010110
}
k = 10
bucket b exact #bits: 5
est imated # bit s:
Query Answering
67
k
What is the maximum error for a bucket b of size 2^j?
half of b (of size 2j) we have the correct count for all buckets < b
Updating the buckets with increasing stream length
..10110110001011101100101101
..10110110001011101100101101