Brown (1989) - WP8
Brown (1989) - WP8
Brown (1989) - WP8
RELIABILITY
J. D. BROWN
University ofHawai'i
University of Hawai'i Working Papers in ESL, Vol. 8, No. 1, May 1989, pp. 79-113.
80 J.D. BROWN
Surely there are other teacher trainers who have had the same experiences in
trying to provide teachers with useful information about measurement.
Could it be that the statistical concepts that have been developed to
estimate the reliability of criterion-referenced tests are so full of new and
esoteric looking symbols and techniques that they scare off most measurement
specialists trained in classical measurement theory? This question seems more
to the point because I have myself found it very difficult to explain criterion-
referenced test concepts in my own testing course. An entirely new set of
symbols and analyses has evolved for criterion-referenced tests, and these
analyses are not covered in introductory educational or psychological
measurement books. They are also missing from the most widely available test
analysis software packages.
Since I feel that most classroom tests are best designed as criterion-
referenced tests and since most of my testing course students are teachers
responsible for administering such tests, I have long wanted to include
information about criterion-referenced test development techniques and
reliability analysis of criterion-referenced tests in my testing course. One result
was a recently published article (Brown 1989) which discussed criterion-
referenced test development techniques. The other was this paper, the purpose
of which is to explain and demonstrate the short-cut reliability-like estimates
that I have managed to find and/ or derive from the literature on criterion-
referenced tests. To effectively explain these short-cut estimates, the paper will
first present background information about the similarities and differences
among norm-referenced, criterion-referenced and domain-referenced tests.
Then, a new criterion-referenced test development program in the English
Language Institute at the University of Hawai'i will be briefly described.
Example data from that program will be presented and explained so that they
can serve as a basis for illustrating the testing and reliability concepts that
follow.
The discussion of test reliability will start with a brief review of some of
the key concepts involved in any reliability estimate. Then the paper will focus
on traditional methods for arriving at consistency estimates for norm-
referenced tests in classical theory reliability. Finally, there will be a discussion
of criterion-referenced test consistency including presentation of four useful
and relatively easy-to-calculate estimates for criterion-referenced tests. These
short-cut techniques all have the advantage of being based on statistics familiar
CRITERION-REFERENCED TEST REUABIL11Y 81
to testers working in the world of traditional testing statistics. They are also
techniques that are straightforward enough to be calculated by hand with
relative ease.
BACKGROUND
In order to define and clarify the similarities and differences between norm-
referenced, criterion-referenced and domain-referenced tests, this paper will
begin by briefly focussing on some of the practical differences between norm-
referenced and criterion-referenced tests in the ways that scores are interpreted
and distributed, as well as in the purposes for giving each type of test and in
the students' knowledge of question content (for more details see Brown 1989).
There are also numerous contrasts between norm-referenced and criterion-
referenced tests in the ways that they are viewed empirically and treated
statistically, as will become apparent later in the paper (also see Hudson &
Lynch 1984). Domain-referenced tests will then be discussed within the
framework of criterion-referenced tests.
As shown in Table 1, norm-referenced tests (NRTs) are most often
designed to measure general language skills or abilities (e.g., overall English
language proficiency, academic listening ability, reading comprehension, etc.).
Each student's score on an NRT is interpreted relative to the scores of all other
students who took the test. Such interpretations are typically done with
reference to the statistical concept of normal distribution (familiarly known as
the "bell curve") of scores dispersed around a mean, or average. The purpose
of an NRT is to spread students out along a continuum of scores so that those
with "low'' abilities are at one end of the normal distribution, while those with
"high" abilities are found at the other (with the bulk of the students found
between the extremes, clustered around the mean). Another characteristic of
NRTs is that, even though the students may know the general form that the
questions will take on the examination (e.g., multiple-choice, true-false, etc.),
they typically have no idea what specific content will be tested by those
questions.
82 J.D. BROWN
Example Data
Throughout the discussion that follows, reference will be made to the
example data shown in Table 2 on the following pages.
84 J.D. BROWN
Table 2: Example Data Set - ELI Reading (ELI 72) Course Final Examination
Achievement Test
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
3 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
39 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
15 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
5 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1
9 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0
2 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1
40 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
31 1 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1 0
16 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0
12 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 l 1
25 l 1 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0
42 1 1 0 1 1 0 1 1 1 0 0 1 0 1 1 1 0
38 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1
22 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1
10 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1
18 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1
32 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1
13 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1
41 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0
8 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0
1 0 1 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0
11 0 1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 1
26 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0
30 1 1 1 1 1 0 1 0 1 1 0 0 1 1 l 1 0
14 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 0
34 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0
35 1 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 1
27 0 1 0 0 1 1 1 1 0 0 0 0 1 1 l 1 1
24 0 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1
7 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1
6 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1
29 0 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0
17 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1
20 0 1 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1
23 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1
36 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0
37 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0
4 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 1 1
19 1 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0
21 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
28 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0
IF .7143 . 8810 .5952.3810 . 6905.6667 . 9048 .7857 . 9048 . 3810.4762 . 7619 .6667.8810 . 8095 . 9048.5476
SI2 2041 . 1049 .2409 .2358 .2137 .2222 .0862 .1684 .0862 .2358 .2494 .1814 .2222 .1049 .1542 .0862 . 2477
DI 0839 .0983 .0300 . 1201 .1687 .1667 .3178 .0683 .0787 .1201 . 2153 .1315 . 0362 .1201.1356 .1004 .0694
CRITERION-REFERENCED TEST REUABILITY 85
18 19 20 21 22 23 24 25 26 27 28 29 30 TOTAL I?ROI?
1 1 1 1 1 1 1 1 1 1 1 1 1 29 .9667
1 1 1 1 1 1 1 1 1 1 1 1 1 28 .9333
0 1 1 1 0 1 1 1 0 1 1 1 1 27 .9000
1 1 1 1 0 1 0 1 1 1 1 1 1 26 .8667
1 1 1 1 1 1 1 1 1 1 1 1 1 26 .8667
0 1 1 1 1 1 0 1 1 1 1 1 1 26 .8667
1 1 1 1 1 1 1 1 l 1 1 0 1 26 .8667
1 1 1 1 1 0 0 1 1 0 1 1 1 25 .8333
1 1 1 1 1 1 1 1 1 l 1 1 1 25 .8333
1 l 1 1 1 1 1 0 1 1 1 1 1 25 .8333
1 1 1 1 1 1 0 1 1 1 1 1 1 24 .8000
1 1 1 1 0 0 1 1 1 1 1 1 1 23 .7667
1 1 1 1 1 1 1 0 1 1 1 1 1 23 .7667
1 1 1 1 0 1 1 1 0 1 1 0 0 23 . 7667
1 1 1 1 0 0 1 1 0 1 1 0 1 23 .7667
1 1 1 0 1 0 1 1 1 0 0 1 1 23 .7667
1 1 1 1 1 1 1 1 0 0 1 1 0 23 .7667
1 1 1 1 0 1 0 1 1 0 1 1 1 23 . 7667
0 1 1 1 1 1 0 1 1 0 1 0 1 23 • 7667
1 1 1 1 0 l 0 1 0 0 1 1 1 23 • 7667
0 1 1 1 1 0 1 1 0 1 1 0 1 23 .7667
1 1 1 1 0 l 1 1 1 1 1 1 1 22 . 7333
0 1 1 1 1 1 0 1 1 0 1 1 1 21 . 7000
1 1 1 0 0 0 1 1 1 1 1 1 1 21 .7000
1 0 1 1 1 1 1 0 0 0 1 1 1 21 .7000
1 1 1 1 0 1 0 0 1 1 1 1 1 21 .7000
1 1 1 1 0 0 1 1 1 0 1 0 0 21 .7000
1 1 1 1 1 0 1 1 0 0 1 1 0 21 .7000
1 1 1 0 1 1 1 1 1 1 1 1 0 21 . 7000
1 1 1 1 0 1 1 0 1 0 1 1 0 21 .7000
1 1 1 0 0 0 1 1 1 0 1 1 1 20 .6667
1 1 1 1 0 0 1 1 1 0 1 1 1 19 .6333
1 1 0 1 1 1 0 1 0 0 1 1 1 19 .6333
1 0 1 1 1 0 1 1 0 0 0 0 0 19 .6333
1 1 1 1 0 0 0 0 1 0 1 1 0 18 .6000
0 1 1 0 0 0 0 0 0 0 1 1 1 17 .5667
1 1 1 1 0 0 0 0 1 0 1 1 1 16 .5333
1 1 1 1 0 0 0 0 1 0 1 1 1 16 .5333
0 1 1 0 0 0 0 0 1 1 0 1 1 15 .5000
1 1 1 0 0 0 0 0 0 0 1 0 0 14 .4667
0 0 0 0 0 1 0 0 0 0 0 1 0 7 .2333
1 0 0 0 1 0 0 0 0 0 0 0 1 7 .2333
.8095.9048 . 9296 . 7857 .5000 . 5114 .5114 .6905 .6667 .4762 .8810 .7857 .761921.2857 .7095233 MEAN
4. 6511 . 1550366 SD
.1542.0862.0663.1684 .2500 .2449 .2449 .213"1 .2222 .249 4 .1049 .1684 .1814 5.3991 sum of Sr2
.0704.0787.0807 . 0683.1522.0497.1584.0600 .0362 .1284 .1201.0466.0445
86 J.D. BROWN
These data are based on one of the many criterion-referenced tests that have
recently been developed at the University of Hawai 'i. There are seven courses
commonly offered in the English Language Institute-two in the academic
listening skill, two in academic reading, and three in academic writing. Each of
these courses now has CRTs to test the specific objectives of the course. In all
cases, there are two forms (cleverly labeled Forms A and B). These tests are
administered in a randomly assigned counterbalanced design such that each
student is tested for diagnosis at the beginning of the course (pretest) and for
achievement at the end (posttest) without taking the same test form twice.
TEST RELIABILITY
In general, the reliability of a test is defined as the extent to which the results
can be considered consistent or stable. For example, when a placement test is
administered, it is desirable for the test to be reliable because an unreliable test
might produce wildly different scores if the students were to take it again. The
decisions about which levels of language study the students should'enter are
important ones that can make big differences for those students in terms of the
amounts of time, money and effort that they will have to invest in learning the
language. A responsible language professional will naturally want the
placement of students to be as accurate and consistent as possible.
The degree to which a test is consistent can be estimated by calculating a
reliability coefficient (;xx• ). A reliability coefficient is interpreted like a
correlation coefficient in that it can go as high as +1.0 for a perfectly reliable
test. But it is also different from a correlation coefficient in that it can only go
as low as 0. In part, this is because a test cannot logically have less than NO
reliability?
then scored separately as though they were different forms. Next, a correlation
coefficient is calculated for the two new sets of scores. The resulting coefficient
r!! could be interpreted as a reliability estimate except that it represents the
J~gree of reliability for only half of the test, either half, but still just half of the
test. Since, all things held constant, a longer test will be more reliable than a
short one, the correlation calculated between the odd-numbered and even-
numbered items must be adjusted. This adjustment of the half-test correlation
to estimate the full-test reliability is accomplished using the Spearman-Brown
Prophecy formula.
The procedures for the split-half reliability are still fairly complex
particularly when compared to the most commonly reported internal
consistency estimates which were worked out by Kuder and Richardson (1937).
These are known as the K-R21 and K-R20 formulas.
The Kuder-Richardson formula 21 (K-R21) is a relatively easy-to-
calculate estimate of the internal consistency of a test. K-R21 can be estimated
using the following formula:
K-R2t =-k-(
k -1
1- x(k-
kST
x)) 2
[1]
Where:
K-R21 = Kuder-Richardson Formula 21
k =Number of items
X = mean scores on the test
~2 = variance for the scores on the test (that is, the standard
deviation of the tests scores squared)
follows:
k ( 1- X ( k- X} ) =-
K-R21=-- 30 ( 1----~----'-
21.2857 x (30- 21.2857))
k- 1 kST2 29 30 X 4.65112
2
l:S1- )
k ( 1-
K-R20=--
k-1 ST2
[2]
Where:
KR-20 =Kuder-Richardson Formula 20
k =numberofitems
5 12 =item variance =IF(l-IF)
ST2 =variance for the scores on the test (that is, the standard
deviation of the test scores squared)
This formula contains several new elements. The first is the sum of the item
variances, symbolized by LSr. These item values (see second row from bottom
of Table 2) are derived from the item facility values (see third row from the
bottom of Table 2). Using the first item in Table 2 as an example, the process
begins by getting the IF value for each item (.7143 for item 1). Recall that this
value represents the proportion of students who answered each item correctly.
Next, 1 - IF must be calculated for each item. The result of subtracting the IF
from 1.00 will yield the proportion of students who answered each item
incorrectly (1 - .7143 = .2857 for item 1). These results must then be multiplied
by the IF, which will yield the item variance, or S12 = IF(1- IF)= .7143(.2857} =
CRITERION-REFERENCED TEsT RELIABILITY 91
.2041. In other words, the item variance for each item is equal to the
proportion of students who answered correctly multiplied times the
proportion who answered incorrectly. In Table 2, the item variances for each
item are shown in the second row from the bottom and are summed to the far
right in the table. This sum is substituted into the second numerator in the K-
R20 formula.
The other essential element in the K-R20 formula is the one symbolized
by Sr2. Again, this is a label for the variance of the whole test (i.e., the standard
deviation squared). Based on the information provided in Table 2, the sum of
the item variances (5.3391), test variance (4.65112) and number of items (30) can
be substituted into the formula to calculate K-R20 as follows:
K-R20 =-k-(
k -1
1_ rsl
ST2
1
) =
3o ( 1 _ 5.3991 )
29 4.65112
5.3391 )
= 1.0345 ( 1 - 21.6327 = 1.0345 (1 - .2496)
demonstrated here are only applicable when the items are scored
dichotomously, that is, when the items are either correct or incorrect, and when
the test is for norm-referenced purposes.
deviation.
Because of the potential lack of variance in scores on CRTs as well as
because of the nature of the absolute decisions that are typically based on such
tests, other strategies have been worked out for demonstrating CRT
consistency. There are many such strategies but they generally fall into one of
three categories (Berk 1984, p. 235): threshold loss agreement, squared-error
loss agreement and domain score dependability. These three strategies can be
applied specifically to the estimation of CRT consistency.
It should be noted that most of the CRT consistency strategies presented
in the remainder of this paper have only fairly recently been developed and are
to some degree controversial. However, they do provide tools for analyzing
CRTs that may prove useful to language testers. Like all statistics, they should
be used with caution and interpreted carefully as just what they are: estimates
of CRT consistency.
You will notice, as the discussion proceeds, that the terms agreement or
dependability will be used in lieu of the term reliability to refer to estimates of
the consistency for CRTs. The term reliability is being reserved for the classical
theory NRT consistency estimates. This is done to emphasize the fact that CRT
estimates are different from classical theory estimates, as well as to insure that
it is clear which family of tests, NRT or CRT, is involved in any particular
discussion.
ADMINISTRATION 2
Masters A B A+B
ADM1NISTRATION 1
C+D
Non-masters c D
B+D
Figure 1: Master /non-master classifications for two test administrations
Figure 1 shows a way of categorizing the results on the two tests so that
you can easily calculate Po· In some cases, classifications agree between the
two tests. Thus when students are classified as masters on both
administrations of the test, you need only count them up and record the
number in cell A in Figure 1. Similarly, the number of students classified as
non-masters by both tests is put cell C. In other cases, the classifications
disagree between the two administrations. Some students may be classified as
masters on the first administration and non-masters on the second. This
number would be tallied and put into cell B, while those students classified as
CRITERION-REFERENCED TEST REliABILITY 95
Masters 64 6 70
ADMINISTRATION 1
Non-masters 5 25 30
69 31 100
Figure 2: Example master /non-master classifications for two test
administrations
p = A+D
o N [3]
Where: Po = agreement coefficient
A = number of students in cell A
D =number of students in cell D
N = total number of students
Po = A+D
N
=64+25 =.89
100
This result indicates that the test administrations classified the students
with about 89 percent agreement. Thus the decision consistency is about 89
percent, and this CRT appears to be very consistent.
Notice that, if all of the students were classified in exactly the same way
by both administrations, the coefficient would be 1.00 [for example, (A + D) I
N = (80 + 20) I 100 = 1.00, or (99 + 1) I 100 = 1.00]. Thus, 1.00 is the maximum
value that this coefficient can take on. However, unlike the reliability
coefficients discussed above for NRTs, this coefficient can logically be no lower
than the value that would result from chance distribution across the four cells.
For 100 students, you might expect 25 students per cell by chance alone. This
would result in a coefficient of .50 [(A+ D) IN= (25 + 25) I 100 = 501100 =
.50]. This is very different from NRT reliability estimates which have a logical
lower limit of .00.
Kappa coefficient. The kappa coefficient (K) adjusts for this problem of a
.50 lower limit. It reflects the proportion of consistency in classifications
beyond that which would occur by chance alone. The adjustment is given in
the following formula:
.
[4]
Where:
Po = agreement coefficient
Pchance =proportion classification agreement that
could occur by chance alone
= [(A+B)(A+C) + (C+D)(B+D)] I N2
{c- .5 -X}
z = ,___--=-~
s [5]
Where:
z = standardized but-point score
.
c = raw cut-pOint score
;
X=mean
5 = standard deviation
Second, you will need to calculate any of the traditional NRT internal
consistency reliability estimates including those described above (Split-half
adjusted, or K-R20).
lzl .10 .20 .30 .40 .50 .60 .70 .80 .90
,00 ,53 .56 .60 .63 .67 .70 .75 .80 .86
.10 .53 .57 .60 .63 .67 .71 .75 .80 .86
.20 .54 .57 .61 .64 .67 .71 .75 ,80 .86
,30 .56 .59 .62 .65 .68 .72 ,76 .80 .86
.40 .58 .60 .63 .66 ,69 .73 .77 .81 .87
.so .60 .62 ,65 .68 .71 .74 .78 .82 .87
.60 .62 .65 .67 . 70 .73 .76 .79 .83 ,88
. 70 .65 .67 .70 .72 .75 .77 .80 .84 .89
.so .68 .70 .72 .74 .77 .79 .82 .85 .90
.. 90 .71 .73 . 75 .77 .79 ,81 .84 .87 ,90
1.00 .75 .76 .77 .77 .81 .83 .85 .88 .91
1.10 .78 .79 .so .81 .83 . 85 .87 .89 .92
1.20 .80 .81 .82 .84 .85 .86 .BB .90 ,93
1.30 .83 .84 .85 .86 ,87 ,88 .90 .91 .94
1.40 .86 .86 .87 .88 .89 .90 .91 .93 .95
1.50 .88 ,88 .89 .90 .90 .91 .92 .94 .95
1.60 .90 .90 .91 .91 .92 .93 .93 .95 .96
1. 70 .92 .92 .92 ,93 .93 .94 .95 .95 .97
1.80 .93 .93 .94 .94 .94 .95 .95 .96 .97
1.10 .95 .95 .95 .95 .95 .96 .96 .97 .98
2.00 .96 .96 .96 .96 .96 .97 .97 .97 .98
Table 3: Approximate values of the agreement coefficient (from Subkoviak 1988, p. 49)
reliability estimate. Where the row for your z value meets the column for
your reliability coefficient, you will find the approximate value of threshold
agreement for your CRT. Table 3 gives the approximations for agreement
coefficients and Table 4 gives
i
the same information for Kappa coefficients.
!zl ,lQ ,ZQ ,JQ ,4Q .~o ,§Q ,ZQ ,§Q ,90
.00 .06 .13 .19 .26 .33 .41 .49 .59 .71
.10 .06 .13 .19 .26 .33 .41 .49 .59 .71
.20 .06 .13 .19 .26 .33 .41 .49 .59 .71
.30 .06 .12 .19 .26 .33 .40 .49 .59 .71
.40 .06 .12 .19 .25 .32 .40 .48 .58 .71
.so .06 .12 .18 .25 .32 .40 .48 .58 .70
.60 .06 .12 .18 .24 .31 .39 .47 .57 .70
.70 .05 .11 .17 .24 .31 .38 .47 .57 .70
.80 .05 .11 .17 .23 .30 . 37 .46 .56 .69
.90 .05 .10 .16 .22 .29 . 36 .45 .55 .68
1.00 .05 .10 .15 .21 .28 .35 .44 .54 .68
1.10 .04 .09 .14 .20 .27 .34 .43 .53 .67
1.20 .04 .08 .14 .19 .26 .33 .42 .52 .66
1.30 .04 .08 .13 .18 .25 .32 .41 .51 .65
1.40 .03 .07 .12 .17 .23 .31 .39 .50 .64
1.50 .03 .07 .11 .16 .22 .29 .38 .49 .63
1.60 .03 .06 .10 .15 .21 .28 .37 ,47 .62
1. 70 .02 .OS .09 .14 .20 .27 .35 .46 .61
1.80 .02 .OS .08 .13 .18 .25 .34 .45 .60
1.90 .02 .04 .08 .12 .17 .24 .32 .43 .59
2.00 .02 .04 .07 .11 .16 .22 .31 .42" .58
Table 4: Approximate values of the kappa coefficient (from Subkoviak 1988, p. 50)
Consider the data in Table 2. Remember that these are a set of CRT
posttest scores with a mean of 21.29, a standard deviation of 4.65 and a K-R20
reliability estimate of .78. Assume that this CRT has a cut-point (c) of 22 out of
30. To obtain the standardized cut-point score, formula [5] above would be
applied as follows:
{c -.5-X'
% = .:...,_--=_A..:.,J
s
- (22- .5- 21.29) - .21 - 0452
- 4.65 - 4.65 - '
100 J.D. BROWN
[6]
Where:
fll(A.) = phi(lambda) dependability index
k = number of items
A. = cut-point expressed as a proportion
XP = mean of proportion scores
SP =standard deviation of proportion scores (using theN
formula rather than N- 1)
Consider the example data in Table 2, but this time analyzing the test as
the CRT that it actually was designed to be. Using formula [6], the k indicates
the total number of items, or 30 in this case, and the is the A. cut-point expressed
as a proportion. Let's say that the cut-point for mastery in this case was 60
percent (.60, if expressed as a proportion). Notice that the proportion scores
given in the column furthest to the right are the raw scores divided by the total
possible (i.e., k). The mean (.7095) and standard deviation (.1550) of these
proportion scores are ><p and SP' respectively. [It is important to realize that it
is not necessary to calculate the proportion scores before getting these values.
They can instead be calculated directly from the raw score mean and standard
deviation by dividing each by k. Using the example data in Table 2, the mean
divided by k turns out to be 21.2857/30 = .7095 and the standard deviation
divided by k is 4.6511 I 30 = .1550.]
Substituting all of these values into formula [6] (from Brennan 1984, but
adapted to the symbols used in this paper), you will get:
102 J.D. BROWN
30 1
-
2
( .7095- .60 ) + .1550
2
J
1 1 [ .2061097- .0240250]
= -29 .0119902 + .0240250
.1820847]
= 1 - .0344828 .0360 = 1 - ( .0344828 X 5.0557736 )
[ 152
=1-.1743372 =.8258 =.83
Remember this is a short-cut index of dependability that takes into account the
distances of students from the cut-point for the master/non-master
classification. A more complete analysis is provided in Appendix A for a
number of different cut-points, but such additional analyses are beyond the
purpose of this paper (also see Brennan 1984 for much more on this topic).
r6p2 [K-R20]
n-1
~= ------=,......,...---=-=----
r6p2 Xp(1-Xp}-Sp2
n-1 [KR-20] + k-1
[6]
Where:
n = number of persons who took the test
k = number of items
XP = mean of proportion scores
SP = standard deviation of proportion scores (using theN
formula rather than N - 1)
K-R20 =Kuder-Richardson formula 20 reliability estimate
_rS_p_2 [K-R20]
n-1
~ = ------=,......,...-~:----
rSpz Xp(t-XJ-spz
n-1 [KR-20] + k-1
104 J.D. BROWN
2
42x(.1550366) [.776296]
42-1
=--------~~----------------------------~
2 2
42x(.1550366) [.776296]+ .7095233 ( 1- .7095233)- .1550366
42-1 30-1
l.00!~246 [.776296]
=---=~------------------
1.0095246 [ 776296] + .1820636
41 . 29
It is important to note that this result in calculating phi matches exactly the
result obtained in Appendix C using the full set of generalizability procedures
including analysis of variance, estimation of G Study variance components,
estimation of D Study variance components and finally calculation of the G
coefficient for absolute error. In other words, while the full generalizability
study is clearer conceptually, precisely the same result has been obtained here
using only n, k, Xp, Sp and the K-R20 reliability. In addition, this short-cut for
estimating the dependability of domain-referenced scores is analogous to the
reliability estimates for NRTs in terms of interpretation.
DISCUSSION
You have seen that there are fairly straightforward ways to estimate the
consistency of criterion-referenced tests----no more difficult conceptually or
computationally than the classical theory approaches. There are several
additional points that must be stressed. The first is that some of the coefficients
presented in this paper are related in rather predictable ways. The second is
that there are a number of cautions that must be kept in mind when making
calculations----particularly for the phi coefficient.
Cautions
In doing all of the calculations necessary to apply any of the reliability or
dependability estimates demonstrated in this paper, particularly the phi
coefficient, there are three cautions that should be kept in mind. The first is
that these techniques are only applicable when the items on the test are
dichotomously scored (i.e., right or wrong). The second is the N formula
(rather than the N-1 formula) should always be used in calculating the means
and standard deviations of the proportion scores that are used in the phi and
phi(lambda) formulas. The third caution is that, particularly for phi or
phi(lambda) coefficients, as much accuracy as possible should be used when
doing all of the calculations. In other words, throughout the calculations, as
many places should be carried to the right of the decimal point as possible, i.e.,
rounding should be avoided until the final coefficient is estimated.
In addition, it is important to remember that the phi and phi(lambda)
coefficients are dependent on the relative variance components involved in the
test and, as Brennan states, "it is strongly recommended that whenever
possible one report variance components, and estimate indices of
dependability in terms of variance components" (1984, p. 332). Thus if the
resources are available for doing a full-fledged generalizability study, that is
the best way to proceed.
CONCLUSION
In summary, this paper began by reviewing the similarities and differences
between the NRT and CRT families of tests. Then the classical theory
106 J.D. BROWN
REFERENCES
Brown, J.D. & KM. Bailey. (1984). A categorical instrument for scoring second
language writing skills. Language Learning,34, 4, 21-42.
Cohen, J.A. (1960). A coefficient of agreement for nominal scales. Educational
and Psychological Measurement,20,37-46.
Cronbach, L.J. (1970). Essentials of psychological testing (3rd ed.). New York:
Harper and Row.
Cziko, G.A. (1982). Improving the psychometric, criterion-referenced, and
practical qualities of integrative language tests. TESOL Quarterly, 16,
367-379.
Cziko, G.A. (1983). Psychometric and edumetric approaches to language
testing. In J.W. Oller, Jr. (Ed.) Issues in language testing research. Rowley,
MA: Newbury House.
Glaser, R. (1963). Instructional technology and the measurement of learning
outcomes: Some questions. American Psychologist, 18,519-521.
Guilford, J.P. (1954). Psychometric Methods. New York: McGraw-Hill.
Hively, W., H.L. Patterson & S.A. Page. (1968). A universe-definedu system of
11
phi(lambda) lambda xP Sp k
The derivation of formula [7] is based on combining (in a novel way) three
important elements from Brennan's work on <I> and generalizability theory (for
a p xI design). To begin with, it is useful to realize that the~ coefficient is
defined as a ratio of the estimated person variance, a2(p), to the estimated
person variance plus the estimated absolute error variance cr2 (a) as follows
(Brennan 1984, p. 308):
2
cr (p)
2 2
cr (p) + cr (a) [8]
Estimating cr(p)
Let's begin with the definition of a generalizability coefficient for
relative error which is defined as a ratio of the estimated person variance,
a2(p), to the estimated person variance plus the estimated relative error
variance a2 (a) as follows (Brennan 1983, p. 17):
2
Fp 2~B) _a.....:::;{p..:....)_ _
2 2
cr (p) +a (s) [9]
Then, it should be noted that, when items are dichotomously scored, Ep 2 and
K-R20 are algebraic identities for the p x I design (Brennan 1984, p . 306). In
short, Ep 2(B) = K-R20. In fairness, it must be pointed out that Brennan stresses
that this 11is an algebraic identity with respect to numerical estimates, not a
conceptual identity with respect to underlying theoretical models." (1983,
p.18).
With regard to relative error, Brennan points out that 11.. . for the p xI
design cr2(B) is identical to the estimated error variance in classical theory,
which is observed score variance multiplied by one minus reliability." (1983, p.
17). Since the estimated test variance is nSP2' the estimated relative error
variance can be calculated using the following equation:
2 nS 2
cr ~B)=-P [1-K-R20]
n-1 [12]
In other words, that portion (based on K-R20) of the observed score variance
that is due to error is equal algebraically, if not conceptually, to cr2{S).
Since K-R20 is made up of error variance and true score variance (see
body of paper), it follows that the portion (based on K-R20) of the observed
score variance that remains is true score variance and equal algebraically, if not
conceptually, to cr2(S).
112 J.D. BROWN
Expressed as an equation:
2 nS 2
a (p) =_P_ [K-R20]
n-1 [13]
[14]
Recall that the equation for <J) was as follows:
2
<J) a (p)
2 2
a (p)+ a (~) (equation [8] above)
_n_Sp_l [K-R20]
n-1
~=----------~~--~~---
nSpl Xp(l-Xp)-Spz
n - 1 [K-R 20] + k- 1
(the same as equation [7])
CRITERION-REFERENCED TEST REUABIUTY 113
Mean =.7095;
Sum of p means sqr = 22.1533;
Sum of item means sqr =15.8866;
Total # correct answers = 894
n =42
k =30
(1) 664.6000
(2) 634.3143
(3) 667.2381
(4) 894