Homework Assignment 2: Total Points 80

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Homework Assignment 2

From the course book Mining Massive Datasets, chapter 4.


http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
Use your own words. No cut-and-paste from the web or from class mates. Copying from other
sources will be detected and result in 0 points. If assignments by multiple students seem too
similar to be independent work, all students will receive 0 points.
It is great to work on solutions in groups! Just prepare the homework report in your own words.
Show all steps of your solution and calculations and explain them briefly in your own words. If
you write just the answer to the question without solution details, it will be 0 points.
No long answers, just brief and clear explanations for each step of your solution are required.

Total points 80
1. Sampling

(20 points) Exercise 4.2.1 : Suppose we have a stream of tuples with the schema Grades(university,
courseID, studentID, grade) Assume universities are unique, but a courseID is unique only within a
university (i.e., different universities may have different courses with the same ID, e.g., “CS101”) and
likewise, studentID’s are unique only within a university (different universities may assign the same ID to
different students). Suppose we want to answer certain queries approximately from a 1/15th sample of
the data. For each of the queries below, indicate how you would construct the sample. That is, tell what
the key attributes should be.

(a) Estimate the average number of courses per university.

(b) Estimate the fraction of students who have a GPA of 3.7 or more.

Explain briefly but clearly how you will create the sample and why.

2. Bloom Filter

(15 points) Exercise 4.3.1 : For the situation of our running example (8 billion bits, 1 billion members of
the set S), calculate the false-positive rate if we use 3 and 5 hash functions. Briefly explain each step in
your solution.

3. FM Algorithm

(10 points) Exercise 4.4.1 : Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash
functions will all be of the form h(x) = ax+ b mod 32 for some a and b. You should treat the result as a 5-
bit binary integer. Determine the tail length for each stream element and the resulting estimate of the
number of distinct elements if the hash function is:

(a) h(x) = 2x + 1 mod 32.

(b) h(x) = 3x + 7 mod 32.

Briefly explain each step in your solution.


4. DGIM Algorithm

1) (15 points) Exercise 4.6.1 : Suppose the window is as shown in Fig. 4.2. Estimate the number of 1’s the
last k positions, for k =

(a) 5

(b) 15

In each case, how far off the correct value is your estimate?
2) (20 points) Study the example in section 4.6.7 Extensions to the Counting of Ones. Use the technique of
Section 4.6.6 to estimate the total error. Show that if each ci has fractional error at most e, then the
estimate of the true sum has error at most e.
Briefly explain each step in your solution.

You might also like