BDA Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Q 1. What is Data Stream?

Explain Data Stream Operations in the context of


Big Data.

Data Stream Operations


1.Sampling Data

2. Filtering Streams
3.Counting Distinct Elements

4. Counting Ones
Q 2.Explain Count Distinct Problem

Q 3. What is a Bloom filter? Explain Bloom Filtering Process with a neat


diagram.
Q 4. Explain SVM and Parallel SVM
Q 5. Explain PCY algorithm with an example
PCY Algorithm

● The algorithm, PCY after its authors, works on the observation that there may be
much unused space in main memory on the first pass.
● An array is considered as an hash table, whose buckets hold integers rather than
sets of keys (as in an ordinary hash table) or bits (as in a Bloom filter).
● Pairs of items are hashed to buckets of this hash table.
● As we examine a basket during the first pass, we not only add 1 to the count for each
item in the basket, but we generate all the pairs, using a double loop.
● We hash each pair, and we add 1 to the bucket into which that pair hashes. Note that
the pair itself doesn’t go into the bucket; the pair only affects the single integer in the
bucket.
● At the end of the first pass, each bucket has a count, which is the sum of the counts
of all the pairs that hash to that bucket.
● If the count of a bucket is at least as great as the support threshold s, it is called a
frequent bucket.
● We can say nothing about the pairs that hash to a frequent bucket; they could all be
frequent pairs from the information available to us.
● But if the count of the bucket is less than s (an infrequent bucket), we know no pair
that hashes to this bucket can be frequent, even if the pair consists of two frequent
items.

An example problem solved using PCY algorithm

Question: Apply PCY algorithm on the following transaction to find the candidate
sets (frequent sets).

Given data
Threshold value or minimization value = 3
Hash function= (i*j) mod 10

T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12= {3, 4, 6}

Use buckets and concepts of Mapreduce to solve the above problem.

Solution:

1. To identify the length or we can say repetition of each candidate in the given
dataset.
2. Reduce the candidate set to all having length 1.
3. Map pair of candidates and find the length of each pair.
4. Apply a hash function to find bucket no.
5. Draw a candidate set table.

Step 1: Mapping all the elements in order to find their length.

Items → {1, 2, 3, 4, 5, 6}
Key 1 2 3 4 5 6
Value 4 6 8 8 6 4

Step 2: Removing all elements having value less than 1.

But here in this example there is no key having value less than 1. Hence, candidate
set = {1, 2, 3, 4, 5, 6}

Step 3: Map all the candidate set in pairs and calculate their lengths.

T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)


T2: {(2, 4) (3, 4)} = (3 4)
T3: {(3, 5) (4, 5)} = (5, 3)
T4: {(4, 5) (5, 6)} = (3, 2)
T5: {(1, 5)} = 1
T6: {(2, 6)} = 1
T7: {(1, 4)} = 2
T8: {(2, 5)} = 2
T9: {(3, 6)} = 2
T10:______
T11:______
T12:______

Note: Pairs should not get repeated avoid the pairs that are already written before.

Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4)
(3,5) (4,5) (4,6)}
ADVERTISEMENT

Step 4: Apply Hash Functions. (It gives us bucket number)

Hash Function = ( i * j) mod 10


(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4

Now, arrange the pairs according to the ascending order of their obtained bucket
number.

Bucket no. Pair


0 (4,5)
2 (3,4)
3 (1,3)
4 (4,6)
5 (3,5)
6 (2,3)
8 (2,4)
Step 5: In this final step we will prepare the candidate set.

Bit vector Bucket no. Highest Support Count Pairs Candidate Set

1 0 3 (4,5) (4,5)

1 2 4 (3,4) (3,4)

1 3 3 (1,3) (1,3)

1 4 3 (4,6) (4,6)

1 5 5 (3,5) (3,5)

1 6 3 (2,3) (2,3)

1 8 3 (2,4) (2,4)

Note: Highest support count is the no. of repetition of that vector.

Check the pairs which have the highest support count less than 3, and write those in
the candidate set, if less than 3 then reject.

Hence, The frequent itemsets are (4, 5), (3,4)


Q 6. Explain Clustering with mapreduce along with example
Q 7. .Define Page Rank and draw detail structure of web and explain deadend

Figure 7.1: A hypothetical example of the Web


Q 8. What are clustering algorithms and explain cure clustering algorithm and
canopy clustering

Clustering is the process of examining a collection of “points,” and grouping the


points into “clusters” according to some distance measure.The clustering algorithm is an
unsupervised method, where the input is not a labeled one and problem solving is based on
the experience that the algorithm gains out of solving similar problems as a training
schedule.

Types of Clustering
1.Partition Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method.

2.Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters.

3.Distribution Model-Based Clustering


In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.

4.Hierarchical Clustering
In this technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram. The observations or any number of clusters can be selected by
cutting the tree at the correct level.

5.Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster.

Cure Algorithm
Canopy Clustering
Q 9. Explain about knn classification for Big Data
Q 10. Explain about SON algorithm

You might also like