CS412 Assignment 1 Ref Solution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

CS412 Assignment 1 Ref Solution

Question 1: (Open question, following are some possible answers)


In the introduction chapter, we have introduced different ways to perform data mining: (1) using a data mining
language to write data mining functions in a commercial data mining system, e.g., Microsoft SQL Server Analysis
Manager or Oracle Data Mining, (2) develop your own customized data mining functions from scratch for
particular applications, and (3) write functions for invisible data mining.

1. Present examples for each way of building data mining functions.


Answer: (answers from rawlani1 and duan9)
(1) An insurance company which has its user and business data stored in IBMs’ DB2 relational database. The
insurance company can use DB2 Intelligent Miners’ Classification function to create a risk group profile in the
form of a data mining model. This profile will contain attributes common to lapsed customers, and can be
applied to new customers to see if they belong to the risk group. (Many students think SQL is a data mining
language, which incorrect. SQL is a classical database retrieval tech rather than data mining method)
(2) Many web data mining problems need to be solve from scratch because this is a new area and tasks could be
various, for example, mining link structure of web pages to decide the importance of web pages, or mining Email
repositories to discover social networks.
(3) Invisible data mining is extensively used in online shopping websites. When users select an item to purchase,
they usually see suggestions based on the popularity and similarity of the item from the website, saying
something like “Customers who viewed this item, also viewed the following”. This is a classic example of
invisible data mining where the end user does nothing to get these suggestions. This kind of usage of invisible
data mining can be found in websites selling electronic items, cars, clothes, airline ticketing agents etc like
Amazon, Ebay and EDreams.

2. Discuss the pros and cons of each way to perform data mining.
Answer: (answers from rawlani1)
1) Commercial Data Mining System
Pros: A vast majority of commonly used data mining functions are already present in these systems, so this
would save a huge amount of development time. These functions are implemented by data mining experts in
high profile companies, so they are very much reliable.
Cons: They may not necessarily suit your needs. Integration to the backend is usually not so straight forward.
2) Customized Data Mining Functions
Pros: It always meets the application specific needs. Customizable when application requirements change.
Cons: To write them correctly, a lot of experience is required.
3) Invisible Data Mining
Pros: This has made online shopping very easy.
Cons: All the available information cannot be shown to user because of data privacy issues.

3. Suppose you are hired by a web-retail company, such as Amazon.com, discuss three data mining functions you
would build for the company so that customers at home can do "invisible data mining".
Answer: (answers from rawlani1)

1
1) We can display other items having similar configuration and price to the item the user chooses to view.
2) We can make suggestions on items that go along with the product being viewed. These are not similar items,
but other accessories that go along with it (like headphones, mic when browsing laptop).
3) Suggested products can be ranked based on customer reviews and brand popularity.
4) Products recommendation based on the purchase and browsing history of other users.

Question 2:
Suppose a student collected the price and weight of 20 products in a shop with the following result:
price $5.89 $49.59 $59.98 $159 $17.99 $56.99 $82.75 $142.19 $31 $125.5
weight 1.4 1.5 2.2 2.7 3.2 3.9 4.1 4.1 4.6 4.8

price $4.5 $22 $52.9 $61 $33.5 $328 $128 $142.19 $229 $189.4
weight 4.9 5.3 5.5 5.8 6.2 8.9 11.6 18.0 22.9 38.2

1. Calculate the mean, Q1, median, Q3, and standard deviation of price and weight;
Answer:
Different textbooks and software have different definitions of Q1 and Q3. In Matlab, if Q1 is in between element
𝐸𝑙𝑒𝑚𝑒𝑛𝑡 𝑁 +𝐸𝑙𝑒𝑚𝑒𝑛𝑡 (𝑁+1)
N and N+1, it would return 2
, while in Microsoft Excel, it defines as 0.25 ∗ 𝐸𝑙𝑒𝑚𝑒𝑛𝑡 𝑁 +
0.75 ∗ 𝐸𝑙𝑒𝑚𝑒𝑛𝑡(𝑁 + 1). In this assignment, we take both definitions as correct answers.
For standard deviation, there are two definitions as well, but actually they have different semantic meanings.
1 𝑁
The first definition is 𝑖=1(𝑥𝑖 − 𝑥 )2 , denoted as 𝑆𝑛 . It’s called standard deviation of the sample, which has a
𝑁
uniformly smaller mean squared error than the "sample standard deviation" (see below), and is the maximum-
likelihood estimate when the population is normally distributed. But this estimator, when applied to a small or
moderately-sized sample, tends to be too low: it is a biased estimator.
1 𝑁
The second definition is 𝑁−1 𝑖=1(𝑥𝑖 − 𝑥 )2 , where 𝑥𝑖 is the sample and 𝑥 is the mean of the sample. This
definition is called sample standard deviation. This correction (the use of N − 1 instead of N) is known as Bessel's
correction. The reason for this correction is that 𝑆 2 is an unbiased estimator for the variance of the underlying
population, if that variance exists and the sample values are drawn independently with replacement.
In this assignment, we take both definitions as correct answers.
For price, mean: 96.0685, Q1: 32.25 or 32.875, median: 60.49, Q3: 142.19, std-dev: 84.22 or 82.09
For weight, mean: 7.99, Q1: 3.55 or 3.725; median: 4.85, Q3: 7.55 or 6.875, std-dev: 8.7139 or 8.9403

2. Draw the boxplots for price and weight;


Answer:

2
3. Draw scatter plot and Q-Q plot based on these two variables;
Answer:

4. Normalize the two variables based on the min-max normalization (min = 1, max = 10);
Answer:
Norm price = (price – 4.5) / (328 – 4.5) * (10-1) + 1
Norm weight = (weight – 1.4)/(38.2 – 1.4) * (10-1) + 1
Price 1.0387 2.2544 2.5435 5.2983 1.3753 2.4603 3.1770 4.8306 1.7372 4.3663
Weight 1.0000 1.0245 1.1957 1.3179 1.4402 1.6114 1.6603 1.6603 1.7826 1.8315
Price 1.0000 1.4869 2.3465 2.5719 1.8068 10.0000 4.4359 4.8306 7.2457 6.1440
Weight 1.8560 1.9538 2.0027 2.0761 2.1739 2.8342 3.4946 5.0598 6.2582 10.0000

3
5. Normalize the two variables based on the z-score normalization;
Answer: (from luu1)
Norm price = (price – mean)/std_dev = (price – 96.07)/84.2249
Norm weight = (weight – 7.99)/8.9403
Price -1.0707 -0.5518 -0.4285 0.7472 -0.9270 -0.4640 -0.1581 0.5476 -0.7726 0.3494
Weight -0.7371 -0.7259 -0.6476 -0.5917 -0.5358 -0.4575 -0.4351 -0.4351 -0.3792 -0.3568
Price -1.0872 -0.8794 -0.5125 -0.4164 -0.7429 2.7537 0.3791 0.5476 1.5783 1.1081
Weight -0.3456 -0.3009 -0.2785 -0.2450 -0.2002 0.1018 0.4038 1.1196 1.6677 3.3791
Or you can use 82.09 and 8.7139 as std_dev for price and weight.

6. Calculate the Pearson correlation coefficient. Are these two variables positively or negatively correlated?
Answer: Pearson correlation coefficient = 0.536307. These 2 variables are positively correlated.

7. Take the price of the above 20 products, partition them into four bins by each of the following methods: equal-
width partitioning, and equal-depth (equal-frequency) partitioning.
Answer:
a) Equal-width partitioning: the width of each bin = (328 – 4.5)/4 = 80.875
Bin 1: $4.50 $5.89 $17.99 $22 $31 $33.50 $49.59 $52.90 $56.99 $59.98 $61 $82.75
Bin 2: $125.50 $128 $142.19 $142.19 $159
Bin 3: $189.40 $229
Bin 4: $328
b) Equal-depth partitioning:
Bin 1: $4.50 $5.89 $17.99 $22 $31
Bin 2: $33.50 $49.59 $52.90 $56.99 $59.98
Bin 3: $61 $82.75 $125.50 $128 $142.19
Bin 4: $142.19 $159 $189.40 $229 $328

Question 3: (Open question, following are some possible answers)


Design a data warehouse for Walmart-like chain store to analyze the sales transactions. Suppose the data
warehouse consisting of the following dimensions: product, store, sales, and time; and a set of measures you
would like to define.
1. Draw a star-schema, based on your consideration of power and convenience of analysis of the warehouse.
Answer: (Some students take sale as measure instead of dimension, which is feasible in this scenario as well)
From nagaraj3:

4
Or answer from rawlani1:

2. Suppose you start from the top (all-summary) of the multi-dimensional hierarchy, what are the concrete OLAP
operations (drilling, slicing, etc.) you need to find the following:
* Average daily profit of each product in the Toys department in January 2009.
* Find which store has the highest monthly increase of sales of bread among all the stores in Illinois
Answer: from duan9:
* drill down on product to department
drill down on time to month
dice on product department = toys and month = Jan 2009

5
drill down on time to day
average profit on day
* drill down on product to type
drill down on store to state
dice on product type = bread and state of store = Illinois
drill down on time to month
calculate monthly increase for each month
drill down on store to street number
argmax monthly increase on store

3. Suppose we want to present the standard-deviation of sales by item category, location and week, and freely
drilling up and down in multidimensional space, describe how this measure can be computed efficiently
Answer: from chiwang1
For each cuboid, we calculate the standard deviation as well as the mean and count. Assume the number of
values to be aggregated in one dimension of one cuboid is n, and we want to calculate the aggregated measure
along this dimension, i.e., compute stdagg , meanagg and countagg from stdi , meani and counti , 1<=i<=n.

In this way the aggregated measures can be computed in a distributed manner.

4. Median and rank are two holistic measures. Discuss how to develop efficient (maybe approximate) methods to
compute these two measures in a multi-dimensional space.
Answer: from luu1
Median: We can compute the approximate median as following
Step 1: partition the data info k bins using equal-width method.
Step 2: For each bin, we store the following values: frequency, lower and upper boundaries. We also store the
bin width.
Step 3: When user queries the median of a cell, we compute the median bin ( with cost O(k)) and return the
approximate median computed as following1
N
− lower bin i freq i
2
Median = L1 + freq median
width [1]

Where L1 is the lower boundary of the median interval, N is the sum of frequencies of all bins. lower bin i freqi
is the sum of the frequencies of all bins lower than the median bin. freqmedian is the frequency of the median
bin and width is the width of the median bin.
However, the efficiency of the above algorithm might depend on various method of binning. A baseline method
is equal-width binning assuming some min and max value of the data. The best case scenario is when we have

1
Same as formula 2.3 in our textbook.

6
equal-depth binning for the query cell. In reality, the more we know about the data, the closer to the ideal case
we can get.
Moreover, space and performance also depends on the parameter k. The bigger k is, the closer the approximate
result, but the more memory and computation will be required.
For approximate rank. We can do similarly. The different is that instead of finding the median bin, we find the
bin that contains element of rank m. The approximate rank is
𝑚 − 𝑙𝑜𝑤𝑒𝑟 𝑏𝑖𝑛 𝑖 𝑓𝑟𝑒𝑞 𝑖
Rank(m) = L1 + 𝑓𝑟𝑒𝑞 𝑏𝑖𝑛 _𝑚
width

Question 4: (Open question, following are some possible answers)


Suppose a company would like to design a data warehouse that may facilitate the analysis of moving vehicles in
an online analytical processing manner. The company registers huge amounts of auto movement data in the
format of (Auto_ID, location, speed, time). Each Auto_ID represents one vehicle associated with information,
such as vehicle category, driver_category, etc., and each location could be associated with a street in a city. You
may assume a street map is available for the city.
1. Design such a data warehouse that may facilitate effective on-line analytical processing in multidimensional
space.
Answer: from rawlani1
To design a data warehouse for the analysis of moving vehicles, we can consider vehicle as a fact table that
points to four dimensions: auto, time, location and speed. The measures can vary depending on the desired
target and the power required. Here, measures considered are vehicle_sold and vehicle mileage.

2. The movement data may contain noise. Discuss how you can develop a method that may automatically
disocver some data records are likely erroneously registered in the data repository.
Answer: from duan9
To handle the noise in data, we first need to do data cleaning. Missing values may be filled or dropped entirely,
depending on the tolerance of the system. Then we can use some data smoothing techniques to remove noisy
data points, for example, regression and outlier analysis. Finally, we can also set up some rules to detect
inconsistent data and remove them based on domain knowledge.

7
3. The movement data may be sparse. Discuss how you can develop a method that may construct reliable data
warehouse despite of the sparsity of data.
Answer: from rawlani1
It may be possible to get a data in the data warehouse that may be sparse. Analyzing a sparse data is not reliable
as a single outlier may completely shift the results. Hence there are few efficient values to deal with it. We have
to evaluate the confidence interval in such case where confidence interval is a measure that defines the
reliability of the data. Smaller the confidence interval, better it is. If the confidence interval is too large this
indicates larger ambiguity in the data. Hence, for our vehicle database we can compute confidence interval.
Confidence interval can be large in an example like when data size was large enough, but the queried data cell
had only few or no values. In such a case, we have 2 ways to resolve this issue:
Intra-cuboid query: Expand the sample size by including the nearby cells in the same cuboid as the queried cell
that reduces the confidence interval.
Inter-cuboid query: It is the extreme case of intra-cuboid. In this remove the dimension by generalizing it.

Also, sparsity can be considered as a missing value. We may use the existing data to find that missing value. This
technique is most commonly used in machine learning. E.g. if speed is missing, then we can view speed as per a
function of location or time. Hence, speed recorded previously on that particular street and that particular time
may be considered instead of that missing value.
Provided the semantics of the query is not change, we can even reduce the dimensionality of the data cube.
Hence, if some cell was sparse at a query execution for a particular hour, we may generalize it to a day. This may
now give some values in that cell.

4. If one wants to drive from A to B starting at a particular time, discuss how a system may use the data in this
warehouse to work out a fast route for the driver.
Answer: from luu1
Using this warehouse, we can look up the information for those vehicles of the same vehicle category and driver
category. Then using OLAP operation (drill, dice,..) we look up for the speed of a location at a specific time (at
level of hour) and will use that as the weight for the street on the city graph. We need to find the weight for all
the possible paths from the start location to end location. Using this weighted graph we can work out the fastest
route for the driver by any famous algorithm such as A* and Dijktra. We might need to update the weights every
hour. Using this algorithm, we don’t care about the direction of the street. We can also integrated that
information and create a directed graph. Based on the graph, we can calculate the fastest route.

You might also like