DataMining - Workbook MCQ

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

DATA MINING--REVISION

Choose the Correct Answer.

1. For the following association rule: Computer → Webcam (60%, 100%):


Which of the following is true?

I. 100% of costumers bought both a computer and a webcam


II. 60% of costumers bought both a computer and a webcam
III. 100% of costumers who bought a computer bought also a webcam
IV. 60% of costumers who bought a computer bought also a webcam
a. II only b III Only c. I and IV d. II and III

2. We have Market Basket data for 1,000 rental transactions at a Video Store.
There are four videos for rent -- Video A, Video B, Video C and Video D. The
probability that both Video C and Video D are rented at the same time is known
as ________ .

a. Correlation b. support c. lift d. confidence

Consider the following transaction database: Suppose that minsup is set to 40%
and minconf. to 70%.
TransID Items
3. The support of the item set A, B, E is……… T100 A, B, C, D
T200 A, B, C, E
a. 50% b.40% c. 70% d. 66% T300 A, B, E, F, H
4. Based on the given minimum support the item set T400 A, C, H
A,B,E is……..

a. frequent b. not frequent c. strong d. not strong

5. The confidence of the rule A, BE is

a. 50% b.40% c. 100% d. 66%

6. Based on the given minimum confidence the rule A, BE is……..

a. frequent b. not frequent c. strong d. not strong

7. The lift of the rule A, BE is…….

a. 1.33 b.1 c. 0.89 d. 0.66

8. The value of the lift in the previous question means that items are……….

a. positive correlated b. negative correlated


c. independent d. strong
9. Identify the outlier for the given data? 23, 34, 27, 7, 30, 26, 28, 31, 34
a. 7 b. 23 c. 31 d. 34
Confusion Matrix
From the given Confusion Matrix
10. Accuracy is…….
Predicted Total
a.0.99 b.0.95 c.0.86 d.0.05
Yes No
11. Error rate is…….
a.0.99 b.0.95 c.0.86 d.0.05 Yes 6954 46 7000

12. Sensitivity is……. No 412 2588 3000

a.0.99 b.0.95 c.0.86 d.0.05 Total 7366 2634 10000

13. Specificity is ……….


a.0.99 b.0.95 c.0.86 d.0.05

Assume, you want to cluster 7 observations into 3 clusters using K-Means


clustering algorithm. After first iteration clusters, C1, C2, C3 has following
observations: C1: {(2, 2), (4, 4), (6, 6)} C2: {(0, 4), (4, 0)} C3: {(5, 5), (9, 9)}

14. What will be the cluster centroids if you want to proceed for second
iteration?
a. C1: (4, 4), C2: (2, 2), C3: (7, 7) b. C1: (6, 6), C2: (4, 4), C3: (9, 9)
c. C1: (2, 2), C2: (0, 0), C3: (5, 5) d. None of these
15. What will be the Manhattan distance for observation (9, 9) from cluster
centroid C1 in second iteration?
a. 10 b. 5 c. 13 d. None of these

16. Consider the given data: {3, 4, 5, 10, 21, 32, 43, 44, 46, 52, 59, 67}, Using
equal-width partitioning and four bins, how many values are there in the
first bin?
a. 3 b. 4 c. 5 d. 6

19. If smooth by median is applied to the previous bins, what is the new value
of the data in the first bin?
a. 4 b. 4.5 c. 5 d. 7.5
17. Supervised learning differs from unsupervised clustering in that
supervised learning requires

a. at least one input attribute.

b. input attributes to be categorical.

c. at least one output attribute.

d. ouput attriubutes to be categorical.

18. The correlation between the number of years an employee has worked for
a company and the salary of the employee is 0.75. What can be said about
employee salary and years worked?

a. There is no relationship between salary and years worked.

b. Individuals that have worked for the company the longest have higher
salaries.

c. Individuals that have worked for the company the longest have lower
salaries.

d. The majority of employees have been with the company a long time.

e. The majority of employees have been with the company a short period of
time.

19. The correlation coefficient for two real-valued attributes is –0.85. What
does this value tell you?

A. The attributes are not linearly related.

B. As the value of one attribute increases the value of the second attribute
also increases.

C. As the value of one attribute decreases the value of the second attribute
increases.

D. The attributes show a curvilinear relationship

20. ...................... is an essential process where intelligent methods are


applied to extract data patterns.

A. Data warehousing C. Text mining

B. Data mining D. Data selection


21. Data mining is best described as the process of

a. identifying patterns in data.

b. deducing relationships in data.

c. representing data.

d. simulating trends in data.

22. Unlike traditional production rules, association rules

a. allow the same variable to be an input attribute in one rule and an


output attribute in another rule.

b. allow more than one input attribute in a single rule.

c. require input attributes to take on numeric values.

d. require each rule to have exactly one categorical output attribute.

23. Given desired class C and population P, lift is defined as

a. the probability of class C given population P divided by the


probability of C given a sample taken from the population.

b. the probability of population P given a sample taken from P.

c. the probability of class C given a sample taken from population P.

d. the probability of class C given a sample taken from population P


divided by the probability of C within the entire population P.

24. Association rule support is defined as

a. the percentage of instances that contain the antecendent


conditional items listed in the association rule.

b. the percentage of instances that contain the consequent conditions


listed in the association rule.

c. the percentage of instances that contain all items listed in the


association rule.

d. the percentage of instances in the database that contain at least


one of the antecendent conditional items listed in the association rule.
25. The full form of KDD is ..................

A. Knowledge Database

B. Knowledge Discovery Database

C. Knowledge Data House

D. Knowledge Data Definition

26. This approach is best when we are interested in finding all possible
interactions among a set of attributes.

a. decision tree

b. association rules

c. K-Means algorithm

d. genetic learning

27. If the information gain of age, income and gender attributes are 0.42, 0.24
and 0.024 which one will you choose as splitting attribute

a. age

b. income

c. gender

d. all of them

28. Based on Apriori property all nonempty subsets of frequent itemset:

a. must also be frequent

b. may be frequent

c. can't be frequent

d. all of them

29. Reducing the number of attributes to solve the high dimensionality


problem is called

a. cleaning c. dimensionality
reduction
b. over fitting
d. Dimensionality
30. The bottleneck of the Apriori algorithm is caused by all the following
except

a. the number of association rules

b. the number of scans required

c. the computations of support for candidates

d. the number of generated candidates

31. Which of the following is the process of detecting and correcting wrong
data:

a. data cleaning

b. data selection

c. data integration

d. all of them

32. Which of the following is the process of combining data from different
sources:

a. data cleaning

b. data selection

c. data integration

d. all of them

33. Which of the following are interesting measures for association rules:

a. lift

b. Recall

c. Accuracy

d. Compactness

34. If the lift measure of items bred and rice if equal 0.5 this means that:

a. if client buy bred they are more likely to but rice

b. if client buy bred they are less likely to but rice


c. if client buy bred they can buy rice or not with the same probability

d. none of them

35. Nonparametric data reduction strategies include all the following except:

a. Histograms

b. Clustering

c. Sampling

d. Regression

36. If you want to give all attributes equal weight, which preprocess task you
will use:

a. Cleaning

b. Transformation

c. Integration

d. Reduction

37. Task of inferring a model from labeled training data is called:

a. Transformation

b. Cluster analysis

c. Classification

d. Association rues

38. Regression is a method of the following except:

a. Cleaning

b. Transformation

c. Reduction

d. Integration
39. The termination condition of the decision tree include the following
except:

a. No tuples for a given branch

b. No noise

c. No remaining attributes

d. All tuples belong to the same class

40. Correlation analysis is used to:

a. extract association rules

b. define support and confidence values

c. eliminate misleading rules

41. If the mean is larger than the median then this might be an indication that
the data is a. negatively skewed

b. positively skewed

c. symmetric

d. correlated

42. _______is the result of tuple firing more than one rule with different
class prediction. a. Association rule

b. Strong rule

c. Rule conflict

43. ________ is the correlation analysis measure for nominal attributes.

a. Covariance

b. Chi-square

c. Lift

d. Correlation co-efficient
44. We have Market Basket data for 1,000 rental transactions at a Video
Store. There are four videos for rent -- Video A, Video B, Video C and Video
D. The probability that Video D will be rented given that Video C has been
rented is known as ________ .

A. the basic probability

B. support

C. lift

D. confidence

45. Data used to build data mining model:

a. Validation data

b. Test data

c. Training data

d. Hidden data

46. This technique uses mean and standard deviation scores to transform
real-valued attributes.

a. decimal scaling

b. min-max normalization

c. z-score normalization

d. logarithmic normalization

47. Point out the correct statement:

a) Combining classifiers improves interpretability

b) Combining classifiers reduces accuracy

c) Combining classifiers improves accuracy

d) All of the Mentioned

48. The attribute data type of Number of telephones in your house is

a. Nominal c. Interval

b. Ordinal d. Ratio
49. Which of the following best describes the process of finding the
interquartile range for a set of data?

a. ADD the biggest and smallest numbers


b. Place the number in order from least to greatest then find the
middle.
c. Find the difference between the Maximum and the Minimum.
d. Subtract Q3 from Q1
50. What is the term for the median of the lower half of the data?
a. Lower Quartile
b. Upper Quartile
c. Median
d. Maximum
51. What is the term that means the middle data value?
a. Mean
b. Median
c. Mode
d. Range
52. What is the mode?
a. # happening the most
b. the average
c. biggest - smallest
d. the middle #
53. The heights of some students are given.
158cm 172cm 164cm

164cm 167cm 159cm

What is the range of the heights?

a. 13cm
b. 14cm
c. 164cm
d. 330cm
54. Supervised learning and unsupervised clustering both require at least one
a. hidden attribute.
b. output attribute.
c. input attribute.
d. categorical attribute.
55. Supervised learning differs from unsupervised clustering in that supervised
learning requires
a. at least one input attribute.
b. input attributes to be categorical.
c. at least one output attribute.
d. ouput attriubutes to be categorical.
56. Which of the following is a valid production rule for the decision tree
below?

Business
Appoint
ment?

No Yes
Decision =
wear slacks
Temp
above
70?

No Yes
Decision = Decision =
wear jeans wear shorts

a. IF Business Appointment = No & Temp above 70 = No


THEN Decision = wear slacks
b. IF Business Appointment = Yes & Temp above 70 = Yes
THEN Decision = wear shorts
c. IF Temp above 70 = No
THEN Decision = wear shorts
d. IF Business Appointment= No & Temp above 70 = No THEN
Decision = wear jeans
57. A nearest neighbor approach is best used
a. with large-sized datasets.
b. when irrelevant attributes have been removed from the
data.
c. when a generalized model of the data is desireable.
d. when an explanation of what has been found is of primary
importance.
58. If a customer is spending more than expected, the customer’s intrinsic value
is ________ their actual value.
a. greater than
b. less than
c. less than or equal to
d. equal to
59. . ...................... is an essential process where intelligent methods are
applied to extract data patterns.

a. Data warehousing
b. Data mining
c. Text mining
d. Data selection

60. Data mining can also applied to other forms such as ................ i) Data
streams ii) Sequence data iii) Networked data iv) Text data v) Spatial data
a. i, ii, iii and v only
b. ii, iii, iv and v only
c. i, iii, iv and v only
d. All i, ii, iii, iv and v
61. Which of the following is not a data mining functionality?
a. Characterization and Discrimination
b. Classification and regression
c. Selection and interpretation
d. Clustering and Analysis
62. ............................. is a summarization of the general characteristics or
features of a target class of data.
a. Data Characterization
b. Data Classification
c. Data discrimination
d. Data selection
63. ............................. is a comparison of the general features of the target
class data objects against the general features of objects from one or
multiple contrasting classes.
a. Data Characterization
b. Data Classification
c. Data discrimination
d. Data selection
64. Strategic value of data mining is ......................
a. cost-sensitive
b. work-sensitive
c. time-sensitive
d. technical-sensitive

65. ............................. is the process of finding a model that describes and


distinguishes data classes or concepts.
a. Data Characterization
b. Data Classification
c. Data discrimination
d. Data selection
66. The various aspects of data mining methodologies is/are ...................
i) Mining various and new kinds of knowledge
ii) Mining knowledge in multidimensional space
iii) Pattern evaluation and pattern or constraint-guided mining.
iv) Handling uncertainty, noise, or incompleteness of data

a. i, ii and iv only
b. ii, iii and iv only
c. i, ii and iii only
d. All i, ii, iii and iv
67. The out put of KDD is .............
a. Data
b. Information
c. Query
d. Useful information
68. Bayesian classifiers is
a. A class of learning algorithm that tries to find an optimum
classification of a set of examples using the probabilistic theory.
b. Any mechanism employed by a learning system to constrain the
search space of a hypothesis.
c. An approach to the design of learning algorithms that is inspired by
the fact that when people encounter new situations, they often
explain them by reference to familiar experiences, adapting the
explanations to fit the new situation.
d.None of these
69. Classification is
a. A subdivision of a set of examples into a number of classes.
b. A measure of the accuracy, of the classification of a concept that is
given by a certain theory.
c. The task of assigning a classification to a set of examples
d. None of these
70. If the mean, median and mode of a distribution are 5, 6, 7 respectively, then
the distribution is:
a. skewed negatively d. symmetrical
b. not skewed e. bimodal.
c. skewed positively
71. Which of the following measures of central tendency tends to be most
influenced by an extreme score?
a. median c. mean
b. mode
72. Which of the following is not a measure of central tendency?
a. mean d. standard deviation
b. median e. none of these
c. mode
73. In a group of 12 scores, the largest score is increased by 36 points.What
effect will this have on the mean of the scores?
a. it will be increased by 12 points
b. it will remain unchanged
c. it will be increased by 3 points
d. it will increase by 36 points
e. there is no way of knowing exactly how many points the mean
will be increased.

74. Non-parametric data reduction strategies includes all the following except
a-Histogram b- regression c- clustering d- sampling

75. If you want to give all attributes an equal weight which preprocess task you
will use
a-Cleaning b-integration c-transformation d-reduction

76. Regression is a method of all of the following except


a-Cleaning b- integration c-transformation d-reduction

77. Which of the following lists all parts of the five-number summary?
a. Mean, Median, Mode, Range, and Total
b. Minimum, Quartile1, Median, Quartile3, and Maximum
c. Smallest, Q1, Q2, Q3, and Q4
d. Minimum, Maximum, Range, Mean, and Median

You might also like