Data Mining Paper Solution
Data Mining Paper Solution
Data Mining Paper Solution
Question 1:
1) Solution
Knowledge discovery as a process consists of an iterative sequence of the following steps:
1. Data cleaning:
2. Data integration:
Data integration merges data from multiple sources into a coherent data store, such as a data
warehouse.
3. Data selection:
where data relevant to the analysis task are retrieved from the database.
4. Data transformation:
where data are transformed or consolidated into forms appropriate for mining by performing
summary or aggregation operations.
example, normalization may improve the accuracy and efficiency of mining algorithms
involving distance measurements.
5. Data mining:
an essential process where intelligent methods are applied in order to extract data patterns.
6. Pattern evaluation:
to identify the truly interesting patterns representing knowledge based on some interestingness
measures.
7. Knowledge presentation:
where visualization and knowledge representation techniques are used to present the mined
knowledge to the user.
DIAGRAM:
2) Solution
Snowflake schema
The snowflake schema is a variant of the star schema. Here, the centralized fact table is connected
to multiple dimensions. In the snowflake schema, dimension are present in a normalized from in
multiple related tables. The snowflake structure materialized when the dimensions of a star schema
are detailed and highly structured, having several levels of relationship, and the child tables have
multiple parent table. The snowflake effect affects only the dimension tables and does not affect
the fact tables.
Fact constellation
Fact Constellation is a schema for representing multidimensional model. It is a collection of
multiple fact tables having some common dimension tables. It can be viewed as a collection of
several star schemas and hence, also known as Galaxy schema. It is one of the widely used schema
for Data warehouse designing and it is much more complex than star and snowflake schema. For
complex systems, we require fact constellations.
Question 2: Solution
SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
From our table:
Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient = 6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) –
4862]]]
= 0.5298
The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which
means the variables have a moderate positive correlation
Question 3: Solution
Read transaction 1: {B,P} -> Create 2 nodes B and P. Set the path as null -> B
-> P and the count of B and P as 1 as shown below :
Read transaction 2: {B,P} -> The path will be null -> B -> P. As transaction 1
and 2 share the same path. Set counts of B and P to 2.
Read transaction 3: {B,P,M} -> The path will be null -> B -> P -> M. As
transaction 2 and 3 share the same path till node P. Therefore, set count of B
and P as 3 and create node M having count 1.
Question 4: Solution