Assignment 1
Assignment 1
Assignment 1
(b) The modes of the data are 25 and 35. Since the data has 2 values that occur with the same
highest frequency, it is bimodal.
(d) The first quartile or 25th percentile of the data is 20. The third quartile or 75 th percentile of the
data is 35.
(e) The five number summary (min. value, first quartile, median value, third quartile, maximum
value) of the data is : 13, 20, 25, 35, 70.
(f)
70
Values
35
25
20
13
Age
(g) Quantile plots display quantile information for all the data, where the values measured for the
independent variable are plotted against their corresponding quantile. It is used to show the
approximate percentage of values below or equal to the independent variable in a univariate
distribution.
A quantile-quantile plot graphs the quantiles of 1 univariate distribution against the corresponding
quantiles of another univariate distribution. Both axes display the range of values measured for
their corresponding distribution, and points are plotted that correspond to the quantile values of
distributions. A line (y = x) can be included in the graph to show where the rst, second and third
quantiles lie. Points that are above the line show a higher value for the distribution plotted on the y-
axis, than for the distribution plotted on the x-axis at the same quantile.
2.3
2.4
(a)
Age: Mean = = 46.44, Median = 51, 1 N
Values
Age
60
55
50
45
40
35
30
25
A box
plot of
the
variable %fat.
Values
40
35
30
25
20
15
10
% Fat
(c)
Fat
Age
45
40
25
30
35
10
15
20
5 20 25 30 35 40 45 50 55 60 65
Scatter Plot
Fat
Q-q
Plot
Age
2.5
(a) Nominal Attributes
The dissimilarity between 2 objects i and j can be calculated based on the ratio of mismatches -
d(i,j) = , pm
where m is the number of matches (the p number of variables for which i and j are in
the same state), and p is the total number of variables. Another method would be to
create a new binary variable for each of the M nominal states, and therefore using a large number
of binary variables. For a given state, the binary variable that represents the state needs to be set
to 1, and the remaining binary variables need to be set to 0.
1 0 sum
1 q r q+r
0 s t s+t
sum q+s r+t p
Using the table above, and ignoring the number of negative matches, t, which is unimportant, we
have-
d(i,j) = rs
qrs
(c) Numeric Attributes
We can calculate this using either Euclidean distance, Manhattan distance, or supremum distance.
d(i, j) = lim p h
1/ h
p
hinfinity xif xjf max xif xjf
f 1 f
(d) Term-frequency vectors
It is easier to measure the distance between vector objects using a nonmetric similarity function. The
similarity between two vectors, x and y, can be dened as a cosine measure-
s(x; y) = xt y
x y
where xt is a transposition of vector x, is the x
y
2.6
(a) The Euclidean distance is defined as:
d(i, j) = xi1 xj1 2 ... xin xjn 2
= = 6.7082. 22 20 2 1 0 2 45
42 36 10 8
2 2
d(i, j) = h h h
= 3 3 3 3
3
22 20 1 0 4236 108 3
233
6.1534
2.7
If we use the equation provided in the textbook, the most easiest way to compute the median is to
divide all the data into k equal length partitions.
N / 2 freq l
Median = + width
1 L
Where L1 is the lower boundary of freq
medium
freq l
the data set, N is the no. of values in
the entire data set, is the sum of the
frequencies of all the intervals that are lower than the
median interval, freq is the frequency of the median median interval, and width is the width of the
median interval.
The error introduced will be reduced as the value of k increases, but the time used will also
increase simultaneously. We can then calculate the median for different types of distributions (eg.
Gaussian, exponential, uniform etc) and find the value of k that gives the best result when
considering the time invested and the amount of error seen.
Another method would be to divide the dataset into intervals, into k intervals and find out which
interval the median is in. Then divide this interval into k sub-intervals and find which sub-interval
the median is in. This iteration needs to be repeated until the width of the interval reaches a
defined value. We can then use a combination of forward selection and backward elimination to
approximate the median. Using this method the median is within a smaller area without dividing all
the data into shorter intervals, which is expensive, as the cost is proportional to the number of intervals
used.
2.8
Using the equations, I obtained the following results:
(b) The normalized query is (0.65850, 0.75258). The normalized data set is -
A1 A2
x1 0.66162 0.74984
x2 0.72500 0.68875
x3 0.66436 0.74741
x4 0.62470 0.78087
x5 0.83205 0.55470
Euclidean
distances
x1 0.00415
x2 0.09217
x3 0.00781
x4 0.04409
x5 0.26320
I obtained the following ranking: x1, x3, x4, x2, x5