Assignment 1

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

2.

(a) The mean of the data is = 809/27 = 30. 1 N

The median is 25.


N
xi
i 1

(b) The modes of the data are 25 and 35. Since the data has 2 values that occur with the same
highest frequency, it is bimodal.

(c) The midrange of the data is (70+13)/2 = 41.5.

(d) The first quartile or 25th percentile of the data is 20. The third quartile or 75 th percentile of the
data is 35.

(e) The five number summary (min. value, first quartile, median value, third quartile, maximum
value) of the data is : 13, 20, 25, 35, 70.

(f)
70

Values

35

25

20

13

Age

Boxplot of the data

(g) Quantile plots display quantile information for all the data, where the values measured for the
independent variable are plotted against their corresponding quantile. It is used to show the
approximate percentage of values below or equal to the independent variable in a univariate
distribution.

A quantile-quantile plot graphs the quantiles of 1 univariate distribution against the corresponding
quantiles of another univariate distribution. Both axes display the range of values measured for
their corresponding distribution, and points are plotted that correspond to the quantile values of
distributions. A line (y = x) can be included in the graph to show where the rst, second and third
quantiles lie. Points that are above the line show a higher value for the distribution plotted on the y-
axis, than for the distribution plotted on the x-axis at the same quantile.

2.3

L1 (lower boundary of the data set) = 20


N (no. of values in the entire data set) = 3194
(sum of the frequencies of all the freq 1
intervals that are lower than the
median interval) = 950
freq = 1500 median
width of the median interval = 30
Median = 32.94 years

2.4

(a)
Age: Mean = = 46.44, Median = 51, 1 N

standard deviation = 12.85 N


xi
i 1

% Fat: Mean = = 28.78, Median = 30.7, 1 N

standard deviation = 8.99 N


xi
i 1
(b) A box plot of the variable age.

Values

Age

60

55

50

45

40

35

30

25
A box
plot of
the
variable %fat.

Values
40

35

30

25

20

15

10

% Fat

(c)
Fat

Age

45

40
25
30
35
10
15
20
5 20 25 30 35 40 45 50 55 60 65
Scatter Plot

Fat

Q-q
Plot
Age

2.5
(a) Nominal Attributes
The dissimilarity between 2 objects i and j can be calculated based on the ratio of mismatches -

d(i,j) = , pm
where m is the number of matches (the p number of variables for which i and j are in
the same state), and p is the total number of variables. Another method would be to
create a new binary variable for each of the M nominal states, and therefore using a large number
of binary variables. For a given state, the binary variable that represents the state needs to be set
to 1, and the remaining binary variables need to be set to 0.

(b) Asymmetric binary attributes

1 0 sum
1 q r q+r
0 s t s+t
sum q+s r+t p

Using the table above, and ignoring the number of negative matches, t, which is unimportant, we
have-

d(i,j) = rs
qrs
(c) Numeric Attributes

We can calculate this using either Euclidean distance, Manhattan distance, or supremum distance.

Euclidean distance is dened as


d(i, j) = xi1 xj1 2 ... xin xjn 2
where i = (xi1,xi2..), and j = (xj1,
xj2...xjn) are 2 dimensional objects.

The Manhattan (or city block) distance, is dened as


d(i, j) = |xi1- xj1| + |xi2 - xj2| + ... +|xin- xjn|

The supremum distance is

d(i, j) = lim p h
1/ h
p
hinfinity xif xjf max xif xjf
f 1 f
(d) Term-frequency vectors

It is easier to measure the distance between vector objects using a nonmetric similarity function. The
similarity between two vectors, x and y, can be dened as a cosine measure-

s(x; y) = xt y
x y
where xt is a transposition of vector x, is the x
y

Euclidean norm of vector x, is the


Euclidean norm of vector y, and s is the cosine of the angle between vectors x and y.

2.6
(a) The Euclidean distance is defined as:
d(i, j) = xi1 xj1 2 ... xin xjn 2
= = 6.7082. 22 20 2 1 0 2 45
42 36 10 8
2 2

(b) The Manhattan distance is defined as:


d(i, j) = |xi1- xj1| + |xi2 - xj2| + ... +|xin- xjn|

|22 - 20| + |1 - 0| + |42 - 36| + |10 - 8| = 11.

(c) The Minkowski distance is defined as:

d(i, j) = h h h

where h is a real h xi1 x j1 xi 2 x j 2 . xip x jp


number such that
h > 1.

= 3 3 3 3
3
22 20 1 0 4236 108 3
233

6.1534

(d) Compute the supremum distance between the two objects.


The supremum distance is defined as:
d(i, j) = lim p h
1/ h
p
hinfinity xif xjf max xif xjf
f 1 f

Supremum distance = 6

2.7
If we use the equation provided in the textbook, the most easiest way to compute the median is to
divide all the data into k equal length partitions.

N / 2 freq l
Median = + width
1 L
Where L1 is the lower boundary of freq
medium
freq l
the data set, N is the no. of values in
the entire data set, is the sum of the
frequencies of all the intervals that are lower than the
median interval, freq is the frequency of the median median interval, and width is the width of the
median interval.
The error introduced will be reduced as the value of k increases, but the time used will also
increase simultaneously. We can then calculate the median for different types of distributions (eg.
Gaussian, exponential, uniform etc) and find the value of k that gives the best result when
considering the time invested and the amount of error seen.
Another method would be to divide the dataset into intervals, into k intervals and find out which
interval the median is in. Then divide this interval into k sub-intervals and find which sub-interval
the median is in. This iteration needs to be repeated until the width of the interval reaches a
defined value. We can then use a combination of forward selection and backward elimination to
approximate the median. Using this method the median is within a smaller area without dividing all
the data into shorter intervals, which is expensive, as the cost is proportional to the number of intervals
used.

2.8
Using the equations, I obtained the following results:

Euclidean distance Manhattan distance Supremum Cosine similarity


distance
x1 0.1414 0.2 0.1 0.99999
x2 0.6708 0.9 0.6 0.99575
x3 0.2828 0.4 0.2 0.99997
x4 0.2236 0.3 0.2 0.99903
x5 0.6083 0.7 0.6 0.96536

I obtained the following rankings of the data points based on similarity:


Euclidean distance: x1, x4, x3, x5, x2
Manhattan distance: x1, x4, x3, x5, x2
Supremum distance: x1, x4, x3, x5, x2
Cosine similarity: x1, x3, x4, x2, x5

(b) The normalized query is (0.65850, 0.75258). The normalized data set is -

A1 A2
x1 0.66162 0.74984
x2 0.72500 0.68875
x3 0.66436 0.74741
x4 0.62470 0.78087
x5 0.83205 0.55470

Calculating the Euclidean distances-

Euclidean
distances
x1 0.00415
x2 0.09217
x3 0.00781
x4 0.04409
x5 0.26320
I obtained the following ranking: x1, x3, x4, x2, x5

You might also like