COMP1942 Question Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

COMP1942 Question Paper

COMP1942 Exploring and Visualizing Data (Spring Semester 2015)


Midterm Examination (Question Paper)
Date: 18 March, 2015 (Wed)
Time: 12:05-13:20
Duration: 1 hour 15 minutes

Student ID:__________________ Student Name:__________________ _______________

Seat No. :__________________

Instructions:
(1) Please answer all questions in Part A in the answer sheet.
(2) You can optionally answer the bonus question in Part B in the answer sheet. You can obtain additional
marks for the bonus question if you answer it correctly.
(3) You can use a calculator.

Question Paper

1/7
COMP1942 Question Paper

Part A (Compulsory Short Questions)


Q1 (20 Marks)

(a) Consider association rule mining on a table with items B, C and D. Suppose that we know that the lift
ratio of “CB” is at least 2.0.
(i) Is it a must that the lift ratio of “BC” is at least 2.0? If yes, please explain. Otherwise, please give
an example showing that it is not a must.
(ii) Is it a must that the lift ratio of “{C, D}B” is at least 2.0? Please explain. Otherwise, please give
an example showing that it is not a must.
(b) Consider association rule mining on a table with items B, C and D. A user gives a support threshold
parameter and a confidence threshold parameter. Suppose that we know that “CB” is an interesting
association rule.
(i) Is it a must that “BC” is an interesting rule? If yes, please explain. Otherwise, please give an
example showing that it is not a must.
(ii) Is it a must that “{C, D}B” is an interesting rule? If yes, please explain. Otherwise, please give an
example showing that it is not a must.

Q2 (20 Marks)

(a) Given the following transactions, and the support threshold = 2. We want to find all large itemsets.
A B C D E
1 0 0 1 0
1 0 0 1 1
0 0 1 0 0
1 0 1 1 1
1 0 1 0 1
Follow the steps of the Apriori Algorithm and deduce L1, C2, L2, C3, L3, C4, … until all the large
itemsets are discovered. Finally, show all large itemsets.
(b) The following shows an FP-tree which is constructed from a set of transactions. Let the support
threshold be 4.
item Head of node-link
root
a
a:10 b:1
b

c
b:6 c:4

c:1
(i) Please draw the conditional FP-tree on c.
(ii) Please draw the conditional FP-tree on b.
(iii)Please draw the conditional FP-tree on a.
(iv) Please list all frequent itemsets. You do not need to give the frequency of each frequent
itemset.

2/7
COMP1942 Question Paper

Q3 (20 Marks)

(a) Consider Algorithm sequential k-means clustering.


When it reads a data point x, it will update the mean m of a cluster with the following operation.
m  m + 1/n (x - m)
where n is the size of the cluster including the new data point x.

(i) Please write down the steps for Algorithm sequential k-means clustering.
(ii) Please prove that, with the above operation, the mean m is calculated correctly. That is, the mean
m calculated is equal to the expected vector among all data points in the cluster.
(Hints: Let xj be the j-th example in cluster i and mi(t) be the mean vector of cluster i containing t
x  x 2  ...  xt 1  xt
examples. Consider that x is the t-th example in cluster i. Note that mi (t )  1 .)
t
(b) There are 10 data points in the dataset, namely data points 1, 2, …, 10. When we use the XLMiner
software to perform “Hierarchical Clustering”, we obtain the following result. In class, we learnt how
to analyze the table in the result. Suppose that we want to find three clusters. Please give all data points
in each of these three clusters.

3/7
COMP1942 Question Paper

Q4 (20 Marks)

The following shows a history of customers with their ages, incomes and credit ratings. We also indicate
whether they will buy an apple watch or not in the last column.
No. Age Income Credit_Rating Buy_AppleWatch
1 young high high yes
2 young high fair yes
3 old medium fair yes
4 old low fair yes
5 young medium low no
6 young high fair no
7 old low low no
8 old high low no

(a) We want to train a C4.5 decision tree classifier to predict whether a new customer will buy an apple
watch or not. We define the value of attribute Buy_AppleWatch is the label of a record.
(i) Please find a C4.5 decision tree according to the above example. In the decision tree, whenever
we process (1) a node containing at least 80% records with the same label or (2) a node
containing at most 2 records, we stop to process this node for splitting.
(ii) Consider a new old customer whose income is high but his credit rating is fair. Please predict
whether this new customer will buy an apple watch or not.
(b) What is the difference between the C4.5 decision tree and the ID3 decision tree? Why is there a
difference?

4/7
COMP1942 Question Paper

Q5 (20 Marks)

(a) Consider eight data points.


The following matrix shows the pairwise distances between any two points.
1 2 3 4 5 6 7 8
1 0 
 
2  11 0 
3  5 13 0 
 
4 12 2 14 0 

5  7 17 1 18 0 

6 13 4 15 5 20 0 
 
7  9 15 12 16 15 19 0 
8  11 20 12 21 17 22 30 0 

Consider two clusters where the first cluster contains data points 1, 2, 3 and 4, and the second cluster
contains data points 5, 6, 7 and 8.

(i) Write down the distance between the first cluster and the second cluster under the single linkage.
(ii) Write down the distance between the first cluster and the second cluster under the complete linkage.
(iii) Write down the distance between the first cluster and the second cluster under the group average
linkage.

(b) We are given the following lift chart for a decision tree.

5
4 Lift curve
Cumulative 3
2 Baseline
1
0 1 2 3 4 5 6 7 8 9 10 11 12 # cases
(i) Is it possible to know the confusion matrix for this decision tree according to this lift chart? If yes,
please give the confusion matrix. If no, please explain it.
(ii)Is it possible to know the error report for this decision tree according to this lift chart? If yes, please
give the error report. If no, please explain it.

5/7
COMP1942 Question Paper

Part B (Bonus Question)


Note: The following bonus question is an OPTIONAL question. You can decide whether you will answer
it or not.

Q6 (10 Additional Marks)

(a) We are given two customers, namely X and Y. The following shows 5 transactions for these two
customers. Each transaction contains three kinds of information: (1) customer ID (e.g., X and Y), (2) the
time that this transaction occurred, and (3) all the items involved in this transaction.

Customer X, time 1, items A, B, C


Customer Y, time 2, items A, F
Customer X, time 3, items D, E
Customer X, time 4, item G
Customer Y, time 5, items D, E, G

For example, the first transaction corresponds to that customer X bought item A, item B and item C at time
1, while the last transaction corresponds to that customer Y bought item D, item E and item G at time 5.

A sequence is defined to be a series of itemsets in form of <S1, S2, S3, …, Sm> where Si is an itemset for i =
1, 2, …, m. The above transactions can be transformed into two sequences as follows.

X: <{A, B, C}, {D, E}, {G}>


Y: <{A, F}, {D, E, G}>

After this transformation, each customer is associated with a sequence.

Given a sequence S in form of <S1, S2, S3, …, Sm> and another sequence S’ in form of <S1’, S2’, S3’, …,
Sn’> , S is said to be a subsequence of S’ if m  n and there exist m integers, namely i1, i2, …, im, such that (i)
1i1<i2< …< imn, and (2) Sj  Sij’ for j = 1, 2, …, m. If S is a subsequence of S’, then S’ is defined to be a
super-sequence of S.
The support of a sequence S is defined to be the total number of customers which sequences are super-
sequences of S.
m
Given a positive integer k, a sequence in form of <S1, S2, S3, …, Sm> is said to be a k-sequence if  |Si|=k.
i1
(Note that |Si| denotes the number of items in itemset Si.)
Can the Apriori algorithm be adapted to mining all k-sequences with support at least 2 where k = 2, 3,
4, …. ? If yes, please write down the proposed method using the concept of the Apriori algorithm and
illustrate your algorithm with the above example. If no, please explain the reason.

(b) We want to study the same problem described in (a). However, the support of a sequence is defined in
another way. Now, the support of a sequence S is re-defined to be the total number of all possible
occurrences of S in all customers divided by the total number of customers which sequences are super-
sequences of S. An occurrence of a sequence S (in form of <S1, S2, S3, …, Sm>) in a customer who has
his/her sequence S’ (<S1’, S2’, S3’, …, Sn’>) corresponds to one possible set of m integers, namely i1, i2, …,
im, such that (i) 1i1<i2< …< imn, and (2) Sj  Si j’ for j = 1, 2, …, m. Note that if S is a subsequence of S’,
it is possible that there are multiple possible sets of m integers (or multiple possible occurrences). For
6/7
COMP1942 Question Paper
example, suppose that there is a sequence S’ for a customer as <{A}, {B}, {B}, {C}>. Consider a sequence
S = <{A}, {B}, {C}>. There are two possible occurrences of S in this customer (and the corresponding two
possible sets of integers are {1, 2, 4} and {1, 3, 4}).

Can the Apriori algorithm be adapted to mining all k-sequences with support at least 2 where k = 2, 3, 4, ….
with this new definition of support? If yes, please write down the proposed method using the concept of the
Apriori algorithm and illustrate your algorithm with the above example. If no, please explain the reason.

End of Paper

7/7

You might also like