BDA Questions
BDA Questions
BDA Questions
2. Filtering Streams
3.Counting Distinct Elements
4. Counting Ones
Q 2.Explain Count Distinct Problem
● The algorithm, PCY after its authors, works on the observation that there may be
much unused space in main memory on the first pass.
● An array is considered as an hash table, whose buckets hold integers rather than
sets of keys (as in an ordinary hash table) or bits (as in a Bloom filter).
● Pairs of items are hashed to buckets of this hash table.
● As we examine a basket during the first pass, we not only add 1 to the count for each
item in the basket, but we generate all the pairs, using a double loop.
● We hash each pair, and we add 1 to the bucket into which that pair hashes. Note that
the pair itself doesn’t go into the bucket; the pair only affects the single integer in the
bucket.
● At the end of the first pass, each bucket has a count, which is the sum of the counts
of all the pairs that hash to that bucket.
● If the count of a bucket is at least as great as the support threshold s, it is called a
frequent bucket.
● We can say nothing about the pairs that hash to a frequent bucket; they could all be
frequent pairs from the information available to us.
● But if the count of the bucket is less than s (an infrequent bucket), we know no pair
that hashes to this bucket can be frequent, even if the pair consists of two frequent
items.
Question: Apply PCY algorithm on the following transaction to find the candidate
sets (frequent sets).
Given data
Threshold value or minimization value = 3
Hash function= (i*j) mod 10
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12= {3, 4, 6}
Solution:
1. To identify the length or we can say repetition of each candidate in the given
dataset.
2. Reduce the candidate set to all having length 1.
3. Map pair of candidates and find the length of each pair.
4. Apply a hash function to find bucket no.
5. Draw a candidate set table.
Items → {1, 2, 3, 4, 5, 6}
Key 1 2 3 4 5 6
Value 4 6 8 8 6 4
But here in this example there is no key having value less than 1. Hence, candidate
set = {1, 2, 3, 4, 5, 6}
Step 3: Map all the candidate set in pairs and calculate their lengths.
Note: Pairs should not get repeated avoid the pairs that are already written before.
Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4)
(3,5) (4,5) (4,6)}
ADVERTISEMENT
Now, arrange the pairs according to the ascending order of their obtained bucket
number.
Bit vector Bucket no. Highest Support Count Pairs Candidate Set
1 0 3 (4,5) (4,5)
1 2 4 (3,4) (3,4)
1 3 3 (1,3) (1,3)
1 4 3 (4,6) (4,6)
1 5 5 (3,5) (3,5)
1 6 3 (2,3) (2,3)
1 8 3 (2,4) (2,4)
Check the pairs which have the highest support count less than 3, and write those in
the candidate set, if less than 3 then reject.
Types of Clustering
1.Partition Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method.
2.Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters.
4.Hierarchical Clustering
In this technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram. The observations or any number of clusters can be selected by
cutting the tree at the correct level.
5.Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster.
Cure Algorithm
Canopy Clustering
Q 9. Explain about knn classification for Big Data
Q 10. Explain about SON algorithm