U-3 Notes
U-3 Notes
U-3 Notes
Introduction
Population in statistics means the whole of the information which comes under the purview of statistical
investigation. A part of the population selected for study is called a sample. When the sample is drawn properly, it
is identical with its population almost in all respect.
Inferential statistics is used to measure behavior in samples to learn more about the behavior in populations that
are very large or inaccessible. Samples are used because it is obvious how they are related to populations. For
example, if we want to have an idea of the average income of the people of a country, we will have to collect all the
earning individuals in the country-, which is quite difficult task. Hence samples are used.
Mean, median, mode, standard deviation are some examples of the statistical measure. It can be evaluated from
the population and samples. A numerical measure of a sample is called a statistic. A numerical measure of a
population is called a parameter. Population parameters are estimated by sample statistics. When a sample
statistic is used to estimate a population parameters, the statistic is called an estimator of the parameter.
Sampling Theory
The process of selecting a sample from an population/universe is called sampling. Theory of sampling is a study
of relationship existing between a population and samples drawn from the population. The aim is to get
information about the population by examining a sample of it.
A random sample is on in which each element of the population has an equal chance of inclusion in the sample. i.e.
each part of the population has some pre assigned probability of being selected in the sample.
Sampling Distribution
Sampling distribution of a statistic is the frequency distribution which is formed with various values of a statistic
calculated from all samples of same size. i.e. For any sample x1 , x2 ,..., xn of a given finite population, we can
compute statistics t x1 , x2 ,..., xn such as mean, variance etc. The set of all such statistics, one for each sample, is
called the sampling distribution of a statistic.
The standard deviation of the sampling distribution of a statistic is known as standard error. It is used to
measure the variability of the values of a statistic.
The Standard Errors of some of the well-known statistics, for large samples, are given below, where n is the
sample size, 2 is the population variance.
Example: A telephone tower monitored for an hour was found to have an estimated mean of 20 signals
transmitted per minute. The variance is known to be 4. Find the standard error for mean.
Test of Significance
Sampling theory deals with a problem of testing hypothesis. A hypothesis is a statement about the
population parameter, i.e. a conclusion tentatively drawn on logical basis.
The method in which we select samples to learn more about characteristics in a given population is called
hypothesis testing. Hypothesis testing is a systematic way to test claims or population. i.e. It enables us to
decide, on the basis of the results of the sample, whether (i) the deviation between the observed sample
statistic and the hypothetical parameter value or (ii) the deviation between two samples statistics is
significant
Step 1: Setting up of Null Hypothesis H 0 : It is a definite statement about population parameter set up whether to
accept or reject it. It states that there is no difference between the sample statistic and population parameter. To
test the statement about population, hypothesis that it is true.
Step 2: Setting up Alternative Hypothesis H1 : It is a complementary statement to null hypothesis. It is set in such
a way that the rejection of null hypothesis implies the acceptance of alternative hypothesis.
Types of Errors in Hypothesis Testing: There are two possible types of errors which may arise in testing a
hypothesis.
Step 4: The probability of making Type I error is denoted by , the level of significance. The probability level
below which we reject a null hypothesis is called the level of significance. In other words, level of significance is
the size of the type I error If level of significance in 5%, then we say that the probability for committing Type I
error is 0.05. This means that a correct decision is made 95% confidently.
Step 7: Conclusion: If Z Z , then we accept the null hypothesis. If Z Z , then we reject the null
hypothesis.
Note:
1. Compare calculated value of z with the critical value z at level of significance . The critical value of z of
the test statistic for a two tailed test is given by p ( z z ) = . By symmetry of normal curve
p ( z z ) + p ( z − z ) =
2 p ( z z ) =
p ( z z ) =
2
In case of one tailed test, p ( z z ) = if it is right tailed test; p ( z − z ) = if it is left tailed.
2. The critical value of z for one tailed test at level of significance is same as the critical value of z for
two tailed test at level of significance 2 .
From the normal table, the critical values of z at different levels of significant are listed below:
1% 5% 10%
Two tailed test z = 2.58 z = 1.96 z = 1.645
3. The values of the test statistic which separates the critical region and acceptance region is called
critical values. This value is dependent on level of significance and alternative hypothesis.
Degrees of freedom
The number of independent variates used to compute the test statistic is known as the number of degrees of
freedom. In general, the number of degrees of freedom is given by v = n − k , where n is the number of
observations in the sample and k is the number of constraints imposed on them.
To Test whether sample mean differs from the hypothetical population mean
Working Rule
• Set up null hypothesis H0 : = x (the sample is drawn from the given population)
x− x−
• Compute z = (or ) z = .
S
n n
• Choose appropriate level of significance and find table value of z (critical value)
Solved Problems
1. For the following case, specify which probability distribution to use in a hypothesis test.
(a). H 0 : = 98, H1 : 98, x = 65, s = 12, n = 42
(a) Test of significance for single mean, large sample, one tailed test
2. A standard sample of 200 tins of coconut oil gave an average weight of 4.95 kgs with a standard
deviation of 0.21 kg. Do we accept that the net weight is 5kgs per tin at 5% level of significance?
Null Hypothesis H0 : = 5 (no significant difference between sample mean and population mean)
x − 4.95 − 5
Test statistic z = = = −3.36
S 0.21
n 200
Table value of z for 5% level of significance is 1.96
Since calculated value of z the tabulated value, we reject the null hypothesis.
3. A random sample of 100 recorded deaths in India during the past year showed an average life span
of 71.8 years. Assuming a population standard deviation of 8.9 years, does this seem to indicate that
the mean life span today is greater than 70 years? Use a 0.05 level of significance.
3. = 5%
Working Rule
When n1 , x1 , s1 be the sample size, mean and SD of first sample and n2 , x2 , s2 be the sample size, mean
and SD of second sample and population SD is given
• Set up null hypothesis H 0 = 1 = 2 (samples are drawn from the populations with same mean)
• Set up alternative hypothesis H1 . This will determine whether we have to use right tailed or left tailed
x1 − x2
z= (if samples are drawn from same population) (or)
1 1
+
n1 n2
x1 − x2
z= (if samples are drawn from two population with same SD)
s12 s22
+
n2 n1
• Choose appropriate level of significance and find table value of z (critical value)
Solved Problems
1 Write down the formula of test statistic t to the significance of difference between the mean(large
samples)
x1 − x2 x1 − x2
Test statistic z = or z =
s2
s 2
12 22
1
+ 2
+
n1 n2 n1 n2
x1 − x2 2500 − 2200
Test statistic z = = = 8.82
2 2
s s 4002 5502
1
+ 2
+
n1 n2 400 400
Since calculated z is greater than the tabulated value z , we reject the null hypothesis H 0 .
3. A Mathematics test was given to 50 girls and 75 boys. The girls made an average grade of 76 with an
SD of 6 and the boys mad an average grade of 82 with an SD of 2. Test whether there is any difference
between the performance of boys and girls.
x1 − x2 76 − 75
Test statistic z = = = 1.137
s12 s22 62 22
+ +
n1 n2 50 75
Since calculated z is less than the tabulated value z , we accept the null hypothesis H 0 .
4. In a random sample of size 500, the mean is found to be 20. In another independent sample of size
400, mean is 15. Could the samples have been drawn from the same population with S.D. 4. Use 1%
level of significance.
Null Hypothesis H 0 = 1 = 2 (i.e. the samples have been taken from the same population)
Since calculated z is greater than the tabulated value z , we conclude that the difference between
Hence we reject the null hypothesis H 0 . i.e. the samples could not have been drawn from the same
population.
5. The mean height of two samples of 1000 and 2000 members are respectively 67.5 and 68 inches. Can
they be regarded as drawn from the same population with standard deviation 2.5inches at 5% level
of significance?
Null Hypothesis H 0 = 1 = 2 (i.e. the samples have been taken from the same population)
x1 − x2 67.5 − 68
Test statistic z = = = −5.16
1 1 1 1
+ 2.5 +
n1 n2 1000 2000
Since calculated z is greater than the tabulated value z , we reject the null hypothesis H 0 .
i.e. the samples could not have been drawn from the same population.
6. A random sample of 100 bulbs from a company P shows a mean life 1300 hours and standard
deviation of 82 hrs. Another random sample of 100 bulbs from company Q showed a mean life of
1248 hours and standard deviation of 93 hours. Are the bulbs of company P superior to bulbs of
company Q at 5% level of significance.
x1 − x2 1300 − 1248 52
Test statistic z = = = = 4.19
s2
s 2 2
82 93 2 12.39
1
+ 2
+
n1 n2 100 100
Since calculated z is greater than the tabulated value z , we reject the null hypothesis H 0 .
Since calculated z is less than the tabulated value z , we accept the null hypothesis H 0 .
i.e. there is a no difference in the sample means(both come from same population).
8. The mean height of 50 male students who showed above average participation in college athletics
was 68.2 inches with a standard deviation of 2.5 inches; while 50 male students who showed no
interest in such participation had a mean height of 67.5 inches with a standard deviation of 2.8 inches.
a. Test the hypothesis that male students who participate in college athletics are taller than other
male students.
b. By how much should the sample size of each of the two groups be increase in order that the
observed difference of 0.7 inches in the mean height be significant at the 5% level of significance.
x1 − x2 68.2 − 67.5
Test statistic z = = = 1.32
2 2
s s 2.52 2.82
1
+ 2
+
n1 n2 50 50
Since calculated z is less than the tabulated value z , we accept the null hypothesis H 0 .
i.e. the height of the male students who participate in college athletics and other male students are same.
To find the sample size if the difference between the two population means are significant.
68.2 − 67.5
1.645
2.52 2.82
+
n n
1.645 3.7536
n
0.7
1.645 3.7536
2
n
0.7
n 78
9. Test the significance of the difference between the means of the samples, drawn from two normal
populations with same SD using the following data:
Size Mean SD
Sample 1 100 61 4
Sample 2 200 63 6
x1 − x2 61 − 63
Test statistic z = = = −3.02
2 2
s s 42 62
1
+ 2
+
n2 n1 200 100
(Note the formula: samples are drawn from two population with same SD )
Taking level of significance as 5%, the table value is z0.05 = 1.96
Since calculated z is greater than the tabulated value z , we reject the null hypothesis H 0 .
i.e. the populations, from which samples are drawn may not have the same mean.
Test of significance of the difference between sample proportion and population proportion
Working Rule
• Set up alternative hypothesis H1 . This will determine whether we have to use right tailed or left tailed
or two tailed test.
Solved Problems
1. A coin is tossed 144 times and head appeared 80 2. A coin is tossed 800 times and head appeared 350
times. Can we say that the coin is unbiased? times. Can we say that he has made a random tossing
each time? (equivalently can we say that the coin is
unbiased?
Probability of getting a head in a toss
1 1 Probability of getting a head in a toss
P = , hence Q = 1 − P = . Given n = 144
2 2 1 1
P = , hence Q = 1 − P = . Given n = 800
80 2 2
Given sample proportion p =
144 350
Given sample proportion p =
1 800
Null hypothesis H 0 : P = (the coin is unbiased)
2 1
Null hypothesis H 0 : P = (random tossing is made)
1 2
Alternative hypothesis H1 : P (the coin is biased)
2 1
Alternative hypothesis H1 : P (coin is not randomly
Test statistic: 2
tossed
p−P
z=
PQ Test statistic:
n p−P
z=
PQ
80 1
− n
= 144 2
1 1 350 1
. −
2 2 = 800 2
144 1 1
.
= 1.333 2 2
800
= −3.525
We choose 5% level of significance and hence the table
value z0.05 = 1.96 We choose 5% level of significance and hence the table
value z0.05 = 1.96
Since calculated value of z z then accept H 0 .
i.e. the coin is unbiased Since calculated value of z z then reject H 0 .
i.e. the coin is not tossed randomly (equivalently the coin
is not unbiased)
Null hypothesis H 0 : P =
1
(smokers and non smokers
= 2.043
2
We choose 5% level of significance and hence the table
are equal in the city) value z0.05 = 1.645
1 Since calculated value of z z then reject H 0 .
Alternative hypothesis H1 : P (right tailed test)
2 i.e. majority of men in the city are smokers.
Working Rule
When p1 , p2 be two sample proportions drawn from the same population or from two populations with the
same proportion P is given
• Set up null hypothesis H 0 : P1 = P2 (Population proportions are equal)
• Set up alternative hypothesis H1 . This will determine whether we have to use right tailed or left tailed
or two tailed test.
p1 − p2
• Compute test statistic z = where Q = 1 − P, n1 , n2 = sample sizes.
1 1
PQ +
n1 n2
n1 p1 + n2 p2
If P is not known, then P =
n1 + n2
( p1 − p2 ) − d0
Note: Suppose we want to test H0 : P1 − P2 = d0 against H1 : P1 − P2 d0 . Now z = .
1 1
PQ +
n1 n2
1. In a large city A 20% of a random sample of 900 school boys had a slight physical defect. In another
city B 18.5% of a random sample of 1600 school boys had the same effect. Is the difference between
the proportions significant?
Null Hypothesis: H 0 : P1 = P2 , the difference between the two proportions is not significant
Alternative Hypothesis: H1 : P1 P2
20 18.5
Given p1 = = 0.2 and p2 = = 0.185 . Also n1 = 900, n2 = 1600
100 100
n1 p1 + n2 p2 900(0.2) + 1600(0.185)
Hence P = = = 0.19 and Q = 1 − P = 0.81
n1 + n2 900 + 1600
p1 − p2 0.20 − 0.185
Therefore test statistic z = = = 0.918
1 1 1 1
PQ + (0.19)(0.81) +
n1 n1 900 1600
Since calculated value of z tabulated value, we accept null hypothesis. Therefore the difference between
the proportions are not significant.
2. 400 men and 600 women were asked whether they would like to have a flyover near their residence.
200 men and 325 women were in favour of the proposal. Test whether these two proportions are
same.
Null Hypothesis: H 0 : P1 = P2 , the difference between the attitude of men and women as far as the
200 325
Given p1 = = 0.5 and p2 = = 0.542 . Also n1 = 400, n2 = 600
400 600
n1 p1 + n2 p2 400(0.5) + 600(0.542)
Hence P = = = 0.525 and Q = 1 − P = 0.475
n1 + n2 400 + 600
p1 − p2 0.5 − 0.542 −0.042
Therefore test statistic z = = = = −1.302
1 1 1 1 0.032234
PQ + (0.525)(0.475) +
n1 n1 400 600
Since calculated value of z tabulated value, we accept the null hypothesis. Therefore the difference
between the attitude of men and women as far as the proposal is concerned is not significant.
3. A cigarette manufacturing firm claims that its brand A outsells its brand B by 8%. It is found that 42
out of a sample of 200 smokers prefer brand A and 18 out of another sample of 100 smokers prefer
brand B. Test whether the 8% difference is a valid claim.
42
Proportion of preference of brand A p1 = = 0.21
200
18
Proportion of preference of brand B p2 = = 0.18 . Also n1 = 200, n2 = 100
100
n1 p1 + n2 p2 200(0.21) + 100(0.18)
Hence P = = = 0.2 and Q = 1 − P = 0.8
n1 + n2 200 + 100
Since calculated value of z tabulated value, we accept the null hypothesis. Therefore the difference of
8% in the sale of brand A and brand B is a valid claim.
This t − distribution is used when sample size is 30 and the population SD is unknown.
To Test whether sample mean differs from the hypothetical population mean
Working Rule
To test if the sample mean differs significantly from the population mean
To test the significance between two sample means.
2. For the following case, specify which probability distribution to use in a hypothesis test.
(a). H 0 : = 27, H1 : 27, x = 20.1, = 5, n = 12
(a) Test of significance for single mean, small sample, two tailed test.
3. A company claims that a vacuum cleaner uses an average of 46 kilowatt hours per year. If a random
sample of 12 homes included in a planned study indicates that vacuum cleaners use an average 42
kilowatt hours per year with a standard deviation of 11.9 kilowatt hours, does this suggest at the 0.05
level of significance that vacuum cleaners use, on average, less than 46 kilowatt hours annually?
Assume the population of kilowatt hours to be normal.
1. H : = 46
0
2. H : 46
1
3. = 5% , d . f = n − 1 =12 − 1 = 11
x− 42 − 46
4. The test statistic t= = = −1.1
s 11.9
n − 1 12 − 1
5. For 5% level of significance, the tabulated value at 11 degrees freedom is t = 2.2
4. Machinist is making engine parts with axle diameters of 0.7 inch. A random sample of 10 parts shows
a mean diameter of 0.742 inch with a standard deviation of 0.04 inch. Compute the statistic to test
the work is meeting the specification.
Since calculated value of t the tabulated value, we accept the null hypothesis.
Null Hypothesis H 0 : = 70 (sample mean is not different from the population mean)
x − 83 − 70 13
Test statistic t = = = = 2.727
s 12.5 0.841
n −1 21
Since calculated value of t the tabulated value, we accept the null hypothesis.
6. A certain pesticide is packed into bags by a machine. A random sample of 10 bags is chosen and the
contents of the bags is found to have the following weights (in kgs) 50, 49, 52, 44, 45, 48, 46, 45, 49
and 45. Test if the average quantity packed be taken as 50 kg.
T:
x 50 49 52 44 45 48 46 45 49 45
473
T:
(x − x)
2
7.29 2.89 22.09 10.89 5.29 0.49 1.69 5.29 2.89 5.29
64.1
x = 473 = 47.3 ( x − x )
2
64.1
x= and s = = = 2.66
n 10 n −1 9
Null Hypothesis H 0 : = 50 (sample mean weight is not different from the expected weight)
x − 47.3 − 50 −2.7
Test statistic t = = = = −3.2
s 2.66 0.841
n 10
Since calculated value of t the tabulated value, we reject the null hypothesis.
Working Rule
When n1 , x , s1 be the sample size, mean and SD of first sample and n2 , y , s2 be the sample size, mean
and SD of second sample is given
• Set up null hypothesis H 0 = 1 = 2 (samples are drawn from the populations with same mean)
• Set up alternative hypothesis H1 . This will determine whether we have to use right tailed or left tailed
value)
Note 1: If we were asked to test whether both the samples come from same normal population, we have to apply
both t and F tests.
Note 2: Instead of sample values xi , yi sometimes, the difference between them, say, X = xi − yi will be given.
X
In that case the test statistic is t = and proceed like test of hypothesis of single mean – Small sample
S
n
problem.
Note 3: Instead of two different samples, pairs of values which are correlated will be given. Then the test
d
statistic is t = where d = xi − yi
S
n
Solved Problems
1t. Two independent samples of sizes 8 and 7 contained the following values:
Sample I : 19 17 15 21 16 18 16 14
Sample II: 15 14 15 19 15 18 16
Is the difference between the sample means significance? Use 5% level of significance.
x1 19 17 15 21 16 18 16 14
x = 136
1
x2 15 14 15 19 15 18 16
x = 112
2
Null hypothesis H 0 = 1 = 2 (No significant difference between means of sample I and II)
x1 − x2 17 − 16 1
Test statistic t = = = = 0.93
1 1 1 1 1.075
S + 2.07 +
n1 n2 8 7
Sample Size Sample Mean Sum of squares of deviation from the mean
1 10 15 90
2 12 14 108
Test whether the samples come from the same normal population at 5% level of significance
(given F0.05 ( 9,11) = 2.9, F0.05 (11,9 ) = 3.1, t0.05 ( 20 ) = 2.086, t0.05 ( 22 ) = 2.07 approximately)
1 1
sample variances are s12 = ( x1 − x1 ) = (90) = 10 and
2
n1 − 1 9
1 1
s22 = ( x2 − x2 ) = (108) = 9.8
2
n2 − 1 11
n1 + n2 − 2 10 + 12 − 2
Null hypothesis H 0 = 1 = 2 (No significant difference between means of sample 1 and 2)
x1 − x2 15 − 14 1
Test statistic t = = = = 0.708
1 1 1 1 1.411
S + 3.298 +
n1 n2 10 12
i.e. two sample means do not differ significantly. and both the samples come from same population.
To test the variance
Null hypothesis H 0 : 12 = 22 (No significant difference between variances of sample 1 and 2)
S12 11.11
Test statistic F = = = 1.03
S22 10.69
Null Hypothesis H 0 = 1 = 2 (no significant difference in the BP before and after the medicine)
T:
X 8 8 7 5 4 1 0 0 −1 −1
31
T:
(X − X )
2
24.1 24.1 15.21 3.61 0.81 4.41 9.61 9.61 16.81 16.81
125.08
X ( X − X )
2
31 125.08
X= = = 3.1 and S = = = 3.727
n 10 n −1 9
X 3.1 3.1
Test statistic t = = = = 2.63
S 3.727 1.178
n 10
Since calculated value of t the tabulated value, we reject the null hypothesis.
(i.e. the medicine was responsible for the increase in B.P.)
4. Memory capacity of 9 students was tested before and after a meditation treatment for a month. State
whether the treatment was effective or not from the following data:
Before treatment : 10 15 9 3 7 12 16 17 4
After treatment : 12 17 8 5 6 11 18 20 3
We are given the paired values i.e. same set of students and the data are concerned.
Alternative Hypothesis H1 = 1 2
d=
d = 7 = 0.7778 and S =
d 2
=
29
= 1.9
n 9 n −1 8
d 0.778
Test statistic t = = = 1.23
S 1.9
n 9
Since calculated value of t the tabulated value, we accept the null hypothesis. i.e. training was not
improving the memory capacity
This is used to test the significance of sample estimates of population variance. Under the null hypothesis
that the population variances are equal, the test statistic is given by
S12 1 1
F= , assuming S12 S 22 where S12 = ( x − x ) , S22 = (y − y)
2 2
S22
n1 − 1 n2 − 1
are unbiased estimates of the common population variance 2 obtained from two independent samples.
The test statistic follows F-distribution with degrees freedom ( n1 − 1, n2 − 1) . By comparing the calculated
value, with the tabulated value for the above degrees of freedom at specific level of significance, the null
hypothesis is either accepted or rejected.
Many of the distributions of sample statistic tend to normality for large samples and as such they can best
be studied with the help of the normal curves.
Theory of normal curves can be applied to the graduation of the curves which are not normal
S12
• Compute test statistic F = 2 , assuming S12 S 22 .
S2
• Choose appropriate level of significance and degrees of freedom ( n1 − 1, n2 − 1) and find table value of
F (critical value)
• Compare calculated value of | F | with the tabulated value.
Solved Problems
x1 : 24 27 26 21 25
x2 : 27 30 32 36 28 23
x1 24 27 26 21 25
x = 123
1
x2
1
576 729 676 441 625
x = 3047
2
1
x2 27 30 32 36 28 23
x = 176
2
x 2
2
729 900 1024 1296 784 529
x = 5262
2
2
Mean of sample 1 : x1 =
x 1
=
123
= 24.6 Mean of sample 2 : x2 =
x = 176 = 29.3
2
n1 5 n2 6
Variance of sample 1 : s 2
=
x −(x ) 2
1 2
=
3047
− 24.62 = 4.24
1 1
n 1 5
Variance of sample 2 : s 2
=
x −(x ) 2
2 2
=
5262
− 29.32 = 18.51
2 2
n 2 6
To test the variance
Null hypothesis H 0 : 12 = 22 (No significant difference between variances of sample 1 and 2)
Degrees of freedom ( n2 − 1, n1 − 1) = ( 5, 4 )
2 Pumpkins were grown under two experimental conditions. Two random samples of 11 and 9
pumpkins show the sample standard deviations of their weights as 0.8 and 0.5 respectively.
Assuming that the weight distributions are normal, test the hypothesis that the true variances are
equal, against the alternative hypothesis that they are not at the 10% level of significance.
S12 0.704
Test statistic F = = = 2.5
S22 0.28
Table value of F for degrees (10, 8 ) of freedom at 10% level of significance is 5.81
x1 6 6 8 1 12 4 3 9 6 10
x = 65
1
x2
1
36 36 64 1 144 16 9 81 36 100
x = 523
2
1
x2 2 3 6 8 10 1 2 8
x = 40
2
x 2
2
4 9 36 64 100 1 4 64
x = 282
2
2
Variance of Diet B : s 2
=
x −(x ) 2
2 2
=
282 2
− 5 = 10.25
2 2
n 2 8
To test the variance
Method 1 20 16 26 27 23 22
Method 2 27 33 42 35 34 38
Test whether there is any significant difference between the variances of the time distribution at
5% level of significance.
x1 20 16 26 27 23 22
x = 134 1
x2
1
400 256 676 729 529 484
x = 3074
2
1
x2 27 33 42 35 34 38
x = 209 2
x 2
2
729 1089 1764 1225 1156 1444
x = 7407
2
2
Mean of method 1 : x1 =
x 1
=
134
= 22.3 Mean of method 2 : x2 =
x 2
=
209
= 34.8
n1 6 n2 6
Variance of method 1 : s 2
=
x −(x ) 2
1 2
=
3074
− 22.32 = 15.04
1 1
n 1 6
S22 28.15
Test statistic F = = = 1.56
S12 18.04
Degrees of freedom ( n2 − 1, n1 − 1) = ( 5, 5 )
DESIGN OF EXPERIMENTS
The design of experiments is a logical construction of the experiment in which the degree of uncertainty with
which the inferences is drawn may be well defined. Here we consider some aspects of experimental design and
analysis of data from such experiments using ANOVA techniques.
Statistical experiment is conducted to verify the truthiness of a hypothesis. Consider an agricultural experiment that
a particular manure increases the yield of a grain. Here the quantity of manure used and quantity of yield are two
experimental variables. In addition, there are other variables such as nature of soil, proper watering and quality of
seeds also affect the yield, which are called extraneous variables.
So the main aim of our design of experiment is to control the extraneous variables and hence to minimize the
experimental error so that the results of the experiments could be attributed only to the experimental variables.
The purpose of experimental design is to obtain maximum information with the minimum cost and labour.
With respect to an agricultural experiment, we mean the factors used in this design like treatments, experimental
unit, blocks and experimental error as follows:
The basic principles of experimental design are (i) randomization (ii) replication and (iii) local control and (iv)
ANOVA.
Replication means repetition. In our example, the manure is used in more than one plot so that the effect may be
identified precisely.
Local control controls the effect of extraneous variable by using the methods such as grouping, blocking and
balancing.
ANOVA is a test of the homogeneity of a set of data. It is defined as The separation of the variance ascribable to
one group of causes from the variance ascribable to other groups.
It enables us to find the total variability due to each factor and by comparing these variation, homogeneity of the
observation may be tested. i.e. whether all the observations are drawn from the same normal population.
Experimental Error
The unexplained random part of the variation in any experiment is termed as experimental error. An estimate of
experimental error can be obtained by replication.
The term CRD or one way classification refers to the fact that a single variable factor of interest is controlled
and its effect on the other elementary units is observed. Suppose we wish to compare h treatments (say
manure) and there are n plots available for the experiment. Let i th treatment be replicated ni times, so
that n1 + n2 + ... + nh = n .
In this design treatments are randomly arranged over the experimental units which are divided into groups
at random as follows.
The plots are numbered from 1 to n serially. n identical cards are taken, numbered from 1 to n and shuffled
thoroughly. The numbers on the first n1 cards drawn randomly give the number of plots to which the first treatment
is to be given. The numbers on the next n2 cards drawn at random give the numbers of the plots to which the second
treatment is to be given and so on.
This design is called a Completely Randomized Design. This design is used only when the number of treatments is
small and the experimental material is homogeneous.
: :
: :
k xk1 xk 2 ...... xki ..... xknk
k
Then n
i =1
i =N
Here we wish to test the null hypothesis that there is no significant difference between the treatments
under consideration. i.e. H0 : 1 = 2 = ..... = k and hence the alternative hypothesis is
H1 : 1 2 ..... k
Computational formula for various sum of squares:
T2
Total sum of square V = xij2 − where T = x ij
N
2
Ti T 2
Sum of squares between samples V1 = −
ni N
ANOVA table
Sources of Sum of Degrees of Mean square Calculated
variance squares freedom Variance
F
k −1 SST
Treatment V1 = ST2 ST2
k −1 F=
S E2
Error V2
N −k
SSE
= S E2 (S 2
T
S E2 )
N −k
Total V N −1
Here the calculated ratio follows F distribution with degrees freedom ( k − 1, N − k ) . If the calculated
value of F is less than the tabulated value, then the null hypothesis is accepted. Otherwise it is rejected.
• CRD results in the maximum use of the experimental units since all the experimental materials can
be used.
• The design is very flexible and easy to layout
• Any number of replicates and treatments may be used
• It provides with the maximum number of degrees of freedom
• It is most useful for laboratory techniques and methodological studies
2. What are the basic elements of an ANOVA table for one way classification?
ANOVA table
Sources of Sum of Degrees of Mean square Calculated
variance squares freedom Variance F
k −1 SST
Treatment SST = ST2 ST2
k −1 F= 2
SE
Error SSE
N −k
SSE
= S E2 (S 2
T
S E2 )
N −k
Total N −1
Solved Problems
3. The following table gives the yields of 15 samples of plot under three varieties of seed.
A 20 21 23 16 20
B 18 20 17 15 25
C 25 28 22 28 32
Test using analysis of variance whether there is a significant difference in the average yield of seeds.
Plots
Varieties of 1 2 3 4 5 Total x12 x22 x32 x42 x52
seeds
A 20 21 23 16 20 100 400 441 529 256 400
n1 n2 n3 N
Between
k −1 6.534 138.59 F0.05 (12, 2 )
varieties of 6.534 = 3.267 = 42.42
3–1=2 2 3.267 = 9.41
seeds
Within 1663.19
varieties of 1663.19 N −k = 138.59
seeds 15 – 3 = 12 12
Total 1669.724 N −1 = 14
Step 7 : Conclusion : Here calculated value is greater than the tabulated value.
Therefore , Null hypothesis is rejected. i.e. There is significant difference between the varieties of seeds in respect
of growth.
4. The accompanying data resulted from an experiment comparing the degree of soiling for fabric
copolymerized with the 3 different mixtures of methacrylic acid. Analyse the classification.
Degree of Soiling
Total x12 x22 x32 x42 x52
Mixture 1 2 3 4 5
M1 0.56 1.12 0.9 1.07 0.94 4.59 0.314 1.254 0.81 1.145 0.884
M2 0.72 0.69 0.87 0.78 0.91 3.97 0.518 0.476 0.757 0.608 0.828
M3 0.62 1.08 1.07 0.99 0.93 4.69 0.384 0.384 1.145 0.98 0.865
Total 13.25 1.216 2.114 2.712 2.733 2.577
H1 : There is significant difference between the degree of soiling with respect to the mixtures
( T ) + ( T ) + ( T )
2 2 2
T2
Step 5. Sum of Squares between degree of soiling V1 = −
1 2 3
n1 n2 n3 N
Between
k −1 0.061 0.0308 F0.05 (12, 2 )
0.061 = 0.0305 = 1.011
degree of 3–1=2 2 0.0305 = 19.41
soiling
Total 36 N −1 = 11
Therefore , Null hypothesis is accepted. i.e. There is no difference between the degree of soiling with
respect to the mixtures
Within each block, the k treatments are given to the k plots in a perfectly random manner, such that each
treatment occurs only once in any block. But the same k treatments are repeated from block to block. This
design is called Randomized Block Design.
Advantages of RBD
This is more accurate than completely randomized design
Any number of treatments on the number of replicates may be used
Statistical analysis is simple and fast.
Note: It is not suitable (i) for large number of treatments (ii) if blocks are not homogeneous
Here the data are classified on the basis of one criterion as follows
Treatments
1 2 ………………. k
1 x11 x12 ...... x1i ..... x1k
2 x21 x22 ...... x2i ..... x2k
Blocks
: :
: :
r xr1 xr 2 ...... xri ..... xrk
Then rk = N
Here we wish to test the null hypothesis that there is no significant difference between the treatments as
well as blocks under consideration. i.e.
H01 : 1 = 2 = ..... = r
and H 02 : 1 = 2 = ..... = k
Computational formula for various sum of squares:
ANOVA table
Sources of Sum of Degrees of Mean square Calculated
variance squares freedom Variance
F
r −1 V1 S R2
Blocks V1 = S R2 F1 = 1
r −1 S E2
k −1 V2 SC2
Treatments V2 = SC2 F2 = 1
k −1 S E2
V3
Error V3 = S E2
(r − 1)(k − 1) (r − 1)(k − 1)
Total V N −1
Here the calculated ratios F1 , F2 follows F distribution with degrees freedom ( r − 1,(r − 1)(k − 1) ) and
( k − 1,(r − 1)(k − 1) ) respectively.
If the calculated value of F is less than the tabulated value, then the null hypothesis is accepted. Otherwise
it is rejected.
Thus a two way analysis is used to measure how two dependent variables, in combination, affect a
dependent variable. For example the agricultural output may be classified on the basis of different
varieties of seeds and also on the basis of different varieties of fertilizers used.
Solved Problems
Treatment-I
1 2 3
1 30 26 38
2 24 29 28
Treatment-II 3 33 24 35
4 36 31 30
5 27 35 33
Use the coding method subtracting 30 from the given number.
This is two way classification. Calculation table. Subtract 30 from all the values.
2 −6 −1 −2 −9 36 1 4
3 3 −6 5 2 9 36 25
4 6 1 0 7 36 1 0
5 −3 5 3 5 9 25 9
x
2
ij = 271
( T ) + ( T ) + ( T ) + ( T ) + ( T )
2 2 2 2 2
T2
= −
R1 R2 R3 R4 R5
V1
n1 n2 n3 n4 n5 N
(4) 2 (−9) 2 22 7 2 52
V1 = + + + + − 5.4 = 52.9
3 3 3 3 3
( T ) + ( T ) + ( T )
2 2 2
T2
Step 6. Sum of Squares between treatment-I V2 = −
C1 C2 C3
n1 n2 n3 N
02 (−5)2 (14)2
V2 = + + − 5.4 = 38.8
5 5 5
( m − 1)( n − 1) 173.9
Error 173.9 = 21.7
=8 8
Total 24.6 N −1 = 14
Step 7 : Considering the difference between treatment-II, we find that, calculated value of F = 1.64
tabulated value of F5% = 6.04 , we accept H 01 : (the treatment-II do not differ significantly)
Considering the difference between treatment-I, we find that, calculated value of F = 1.11 tabulated
value of F5% = 19.37 , we accept H 02 : (the treatment-I do not differ significantly)
Chemists Row
Total x12 x22 x32 x42
A B C D TR
I 8 5 5 7 25 64 25 25 49
C
o
II 7 6 4 4 21 49 36 16 16
a
l
III 3 6 5 4 18 9 36 25 16
Column Total TC 18 17 14 15 64 122 97 66 81
x 2
ij = 366
( T ) + ( T ) + ( T )
2 2 2
T2
Step 5. Sum of Squares between coals V1 = −
R1 R2 R3
n1 n2 n3 N
( T ) + ( T ) + ( T ) + ( T )
2 2 2 2
T2
Step 6. Sum of Squares between chemists V2 = −
C1 C2 C3 C4
n1 n2 n3 n4 N
m −1 6.2 3.1
Between 6.2 = 3.1 = 1.24 F0.05 ( 2, 6 ) = 5.14
3–1=2 2 2.5
Coals
Between
n −1 3.36 2.5
3.36 = 1.12 = 2.2 F0.05 ( 6,3) = 8.94
Chemists 4–1=3 3 1.12
( m − 1)( n − 1) 15.04
Error 15.04 = 2.5
=6 6
Total 24.6 N −1 = 11
Step 7 : Considering the difference between Coals, we find that, calculated value of F = 1.24 tabulated
value of F5% = 5.14 , we accept H 01 : (the coals do not differ significantly)
Considering the difference between chemists, we find that, calculated value of F = 2.2 tabulated value of
F5% = 8.94 , we accept H 02 : (the chemists do not differ significantly)
Test at 5% level of significance whether the differences among the means obtained for the different
routes are significant and also whether the differences among the means obtained for the different
days of the week are significant.
This is two way classification. Let us arrange the data by subtracting 26 from each value.
Days Row
1 2 3 4 5 Total x12 x22 x32 x42 x52
TR
1 −4 0 −1 −1 5 −1 16 0 1 1 25
R
o 2 −1 1 2 0 3 5 1 1 4 0 9
u
t 3 0 3 7 4 7 21 0 9 49 16 49
s
4 0 2 1 4 4 11 0 4 1 16 16
Column
−5 6 9 7 19 36 17 14 55 33 90
Total TC
x 2
ij = 209
( T ) + ( T ) + ( T ) + ( T )
2 2 2 2
T2
Step 5. Sum of Squares between routes V1 = −
R1 R2 R3 R4
n1 n2 n3 n4 N
n1 n2 n3 n4 n5 N
−5 6 9 7 19
2 2 2 2 2
V2 = + + + + − 64.8
4 4 4 4 4
= 73.2
m −1 52.8 17.6
Between 52.8 = 17.6 = 11.57 F0.05 ( 3,12 ) = 3.49
4–1=3 3 1.52
Routes
73.2 18.53
Between Days 73.2 n −1 = 18.3 = 12.2 F0.05 ( 3,12 ) = 3.49
4 1.52
5–1=4
( m − 1)( n − 1) 18.2
Error 18.2 = 1.52
= 12 12
Total 144.2 N −1 = 19
Step 7 : Considering the difference between Routes, we find that, calculated value of F = 11.57
tabulated value of F5% = 3.49 , we reject H 01 : (the routes differ significantly)
Considering the difference between Days, we find that, calculated value of F = 12.2 tabulated value of
F5% = 3.49 , we reject H 02 : (the Days differ significantly)
4. The following table gives the number of refrigerators sold by 4 salesman in 3 months May, June, July.
Month Salesman
May 50 40 48 39
June 46 48 50 45
July 39 44 40 39
Salesman Row
1 2 3 4 Total x12 x22 x32 x42
TR
M
May 10 0 8 −1 17 100 0 64 1
o
n
June 6 8 10 5 29 36 64 100 25
t
h
July −1 4 0 −1 2 1 16 0 1
Column
15 12 18 3 48 137 80 164 27
Total TC
x 2
ij = 408
Step 2. Total T = 48
T 2 (48) 2
Step 3. Correction Factor = = 192
N 12
T2
Step 4. Total Sum of Squares V = xij − = 408 − 192 = 216
2
( T ) + ( T ) + ( T )
2 2 2
T2
Step 5. Sum of Squares between months V1 = −
R1 R2 R3
n1 n2 n3 N
17 2 292 22
V1 = + + − 216
4 4 4
= 74.5
( T ) + ( T ) + ( T ) + ( T )
2 2 2 2
T2
Step 6. Sum of Squares between salesman V2 = −
C1 C2 C3 C4
n1 n2 n3 n4 N
m −1 74.5 37.25
Between 74.5 = 37.25 = 1.81 F0.05 ( 2,6 ) = 5.14
3–1=2 2 20.58
Months
Between
n −1 18 20.58
18 =6 = 3.43 F0.05 ( 6,3) = 8.94
Salesman 4–1=3 3 6
( m − 1)( n − 1) 123.5
Error 123.5 = 20.58
=6 6
Total 216 N −1 = 11
Step 7 : Considering the difference between Months, we find that, calculated value of F = 1.81
tabulated value of F5% = 5.14 , we accept H 01 : (the months do not differ significantly)
Considering the difference between Salesman, we find that, calculated value of F = 3.43 tabulated value
of F5% = 8.94 , we accept H 02 : (the salesman do not differ significantly)
This is two way classification. Let us arrange the data by subtracting 40 from each value.
Engine Row
Total x12 x22 x32
A B C TR
D
I 5 −9 11 7 25 81 121
e
t
II 7 6 12 25 49 36 144
e
r
III 8 10 15 33 64 100 225
g
e IV 2 −3 9 8 4 9 81
n
t
Column
22 4 47 73 142 226 571
Total TC
x 2
ij = 939
( T ) + ( T ) + ( T ) + ( T )
2 2 2 2
T2
Step 5. Sum of Squares between detergents V1 = −
R1 R2 R3 R4
n1 n2 n3 n4 N
7 2 252 332 82
V1 = + + + − 444.1 = 164.9
3 3 3 3
( T ) + ( T ) + ( T )
2 2 2
T2
Step 6. Sum of Squares between engines V2 = −
C1 C2 C3
n1 n2 n3 N
222 42 47 2
V2 = + + − 444.1 = 233.15
4 4 4
m −1 164.9 54.96
Between 164.9 = 54.96 = 3.4 F0.05 ( 3,6 ) = 4.76
4–1=3 3 16.14
Detergents
Between
n −1 233.15 116.5
233.15 = 116.5 = 7.2 F0.05 ( 2, 6 ) = 5.14
Engines 3–1=2 2 16.14
( m − 1)( n − 1) 96.85
Error 96.85 = 16.14
=6 6
Total 494.9 N −1 = 11
Step 7 : Considering the difference between Detergents, we find that, calculated value of F = 3.4
tabulated value of F5% = 4.76 .
Therefore we accept H 01 : (the detergents do not differ significantly)
Considering the difference between Engines, we find that, calculated value of F = 7.2 tabulated value of
F5% = 5.14 .
Therefore, we reject H 02 : (the Engines differ significantly)
H01 (the varieties of crops do not differ significantly with respect to yield)
Rewriting the data such that the rows represent the blocks and the columns represent the varieties of
crops, we have
Variety of Crops
Block A B C
1 6 7 8
2 4 6 5
3 8 6 10
4 6 9 9
Crops
Blocks
A B C Ti
Ti 2 x 2
ij
k i
2
(21)
1 6 7 8 21 = 147 149
3
(15) 2
2 4 6 5 15 = 75 77
3
(24) 2
3 8 6 10 24 = 192 200
3
242
4 6 9 9 24 = 192 198
3
Ti 2
Tj 24 28 32 T = 84 k = 606 x
2
ij = 624
xij2
j
152 202 270 x 2
ij = 624
T 2 (84) 2
Correction Factor = = 588
N 12
T2 T2
Sum of squares between blocks Q1 = i − = 606 − 588 = 18
k N
T j2T2
Sum of squares between crops Q2 = −
h N
= 596 − 588 = 8
ANOVA Table
Source of Variation Sum of Degrees of Mean Square Calculated F Value
Squares Freedom
Between Rows (Blocks) 18 6
Q1 = 18 h −1 = 3 =6 = 3.6
3 1.67
Between Columns 8 4
(Crops) Q2 = 8 k −1 = 2 =4 = 2.4
2 1.67
Error 10
Q3 = 10 (h −1)(k −1) = 6 = 1.67 -
6
Total Q = 36 hk −1 = 11 - -
Considering the difference between columns, we find that, calculated value of F = 2.4 tabulated value
of F5% = 5.14 , we accept H 02 : (the varieties of crops do not differ significantly with respect to yield)
7. Four air conditioning compressor designs were tested in four different regions of India. The test
was repeated by installing additional air conditioners in a second cooling season. The following are
the times to failure (to the nearest month) of each compressor tested.
Replicate 1 Replicate 2
Designs Designs
A B C D A B C D
Northeast 58 35 72 61 49 24 60 64
Region Southeast 40 18 54 38 38 22 64 50
Northwest 63 44 81 52 59 16 60 48
Southwest 36 9 47 30 29 13 52 41
Test at the 0.05 level of significance whether the difference among the means determined for designs,
for regions, and for replicates are significant and for significance of the interaction between
compressor designs and regions.