Brown (1989) - WP8

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

CRITERION-REFERENCED TEST

RELIABILITY

J. D. BROWN
University ofHawai'i

The distinction between norm-referenced and criterion-referenced tests is


relatively new in the language testing literature. As recently as 1982, Cziko
(1982 & 1983) called for the use of criterion-referenced interpretations in
language testing. Brown 1984a listed some of the key differences between
norm-referenced and criterion-referenced tests, but it remained for Hudson
and Lynch (1984) to outline the basic principles of criterion-referenced test
construction and analysis. Bachman added to the small but growing literature
in this area with his 1987 article discussing the place of criterion-referenced
tests in language program evaluation and curriculum development. And
finally Brown (1989), has shown one way that criterion-referenced item
analysis, in the form of the difference index can be used in conjunction with
norm-referenced item analysis and revision techniques.
It seems strange that criterion-referenced tests have not been discussed
much more in language testing given that they have been a part of educational
testing for many years, dating back to Glaser 1963 (see Popham & Husek 1969,
Popham 1978, 1981, and Berk 1980 for much more on criterion-referenced
testing and its background). Despite the fact that they have been around for
over a quarter century, criterion-referenced approaches to testing have only
been touched on in the language testing field. In addition, only a handful of
people seem interested in working on the topic. Since I am one of those people,
I have had to ask myself if I am simply on the wrong track pursuing a line of
research that has no meaning, or if, instead, there is some element in criterion-
referenced testing that makes it unattractive to practicing language testers.
Could it be that criterion-referenced approaches to testing are simply not
important? I think not. I find a great demand, particularly among the teachers
that I work with, for information on techniques that they can use to develop
sound classroom level achievement and diagnostic tests. It seems to me that
criterion-referenced tests are ideal for these purposes (discussed below).

University of Hawai'i Working Papers in ESL, Vol. 8, No. 1, May 1989, pp. 79-113.
80 J.D. BROWN

Surely there are other teacher trainers who have had the same experiences in
trying to provide teachers with useful information about measurement.
Could it be that the statistical concepts that have been developed to
estimate the reliability of criterion-referenced tests are so full of new and
esoteric looking symbols and techniques that they scare off most measurement
specialists trained in classical measurement theory? This question seems more
to the point because I have myself found it very difficult to explain criterion-
referenced test concepts in my own testing course. An entirely new set of
symbols and analyses has evolved for criterion-referenced tests, and these
analyses are not covered in introductory educational or psychological
measurement books. They are also missing from the most widely available test
analysis software packages.
Since I feel that most classroom tests are best designed as criterion-
referenced tests and since most of my testing course students are teachers
responsible for administering such tests, I have long wanted to include
information about criterion-referenced test development techniques and
reliability analysis of criterion-referenced tests in my testing course. One result
was a recently published article (Brown 1989) which discussed criterion-
referenced test development techniques. The other was this paper, the purpose
of which is to explain and demonstrate the short-cut reliability-like estimates
that I have managed to find and/ or derive from the literature on criterion-
referenced tests. To effectively explain these short-cut estimates, the paper will
first present background information about the similarities and differences
among norm-referenced, criterion-referenced and domain-referenced tests.
Then, a new criterion-referenced test development program in the English
Language Institute at the University of Hawai'i will be briefly described.
Example data from that program will be presented and explained so that they
can serve as a basis for illustrating the testing and reliability concepts that
follow.
The discussion of test reliability will start with a brief review of some of
the key concepts involved in any reliability estimate. Then the paper will focus
on traditional methods for arriving at consistency estimates for norm-
referenced tests in classical theory reliability. Finally, there will be a discussion
of criterion-referenced test consistency including presentation of four useful
and relatively easy-to-calculate estimates for criterion-referenced tests. These
short-cut techniques all have the advantage of being based on statistics familiar
CRITERION-REFERENCED TEST REUABIL11Y 81

to testers working in the world of traditional testing statistics. They are also
techniques that are straightforward enough to be calculated by hand with
relative ease.

BACKGROUND
In order to define and clarify the similarities and differences between norm-
referenced, criterion-referenced and domain-referenced tests, this paper will
begin by briefly focussing on some of the practical differences between norm-
referenced and criterion-referenced tests in the ways that scores are interpreted
and distributed, as well as in the purposes for giving each type of test and in
the students' knowledge of question content (for more details see Brown 1989).
There are also numerous contrasts between norm-referenced and criterion-
referenced tests in the ways that they are viewed empirically and treated
statistically, as will become apparent later in the paper (also see Hudson &
Lynch 1984). Domain-referenced tests will then be discussed within the
framework of criterion-referenced tests.
As shown in Table 1, norm-referenced tests (NRTs) are most often
designed to measure general language skills or abilities (e.g., overall English
language proficiency, academic listening ability, reading comprehension, etc.).
Each student's score on an NRT is interpreted relative to the scores of all other
students who took the test. Such interpretations are typically done with
reference to the statistical concept of normal distribution (familiarly known as
the "bell curve") of scores dispersed around a mean, or average. The purpose
of an NRT is to spread students out along a continuum of scores so that those
with "low'' abilities are at one end of the normal distribution, while those with
"high" abilities are found at the other (with the bulk of the students found
between the extremes, clustered around the mean). Another characteristic of
NRTs is that, even though the students may know the general form that the
questions will take on the examination (e.g., multiple-choice, true-false, etc.),
they typically have no idea what specific content will be tested by those
questions.
82 J.D. BROWN

Table 1: Differences between Norm-referenced, Domain-referenced and


Criterion-referenced Tests (from Brown 1984a, 1989)

CHARACTERISTIC NRT CRT

1. Type of To measure general To measure specific


Measurement language abilities objectives-based
or proficiencies. language points.

2 Typeof ~ative(astudenrs Absolute (a studenrs


Interpretation performance is compared performance is compared
to that of all other only to the amount, or
students). percent, of material
learned).

3. Score Normal distribution of If all students know


Distribution scores around a mean all of the material,
all should score 100%.

4. Purpose of Spread students out Assess the amount of


Testing along a continuum of material known, or
general abilities or learned, by each
proficiencies. student.

5. Knowledge of Students have little or Students know exactly


Questions no idea what content to what content to expect
expect in the in test questions.
questions.
CRITERION-REFERENCED TEST REUABILITY 83

Criterion-referenced tests {CRTs), on the other hand, are usually


produced to measure well-defined and fairly specific instructional objectives.
Often these objectives are unique to a particular program and serve as the basis
for the curriculum. Hence it is important that the teachers and students know
exactly what those objectives are so that appropriate time and attention can be
focused on teaching and learning them. The interpretation of CRT results is
considered absolute in the sense that a student's score is meaningful unto itself,
without reference to the other students' performances. Instead, each student's
score on a particular objective indicates the percentage of the skill or
knowledge in that objective which has been learned. Moreover, the
distribution of scores on a CRT need not necessarily be normal. If all of the
students know 100 percent of the material on all of the objectives, it follows
that all of the students will receive the same score with no variation among
them. The purpose of CRTs, then, is to measure the degree to which students
have developed knowledge or skill on a specific objective or set of objectives.
In most cases, the students would know in advance what types of test
questions, tasks and content to expect for each objective because they would be
implied, or perhaps explicitly stated, in the objectives of the course.
Domain-referenced tests {DRTs) must also be discussed here because of
their relationship to CRTs. As defined here, DRTs are considered a type of
CRT differing primarily in the way that the item sampling is defined. With
DRTs, the items are sampled from a well-defined domain of behaviors. The
results on a DRT are used to describe a student's status with regard to that
domain similar to the way that CRT results are used to describe the student's
status on smaller subtests based on course or program objectives. Thus the
term domain-referenced is NOT being used in the original and rather restricted
sense that Hively, Patterson & Page (1968) proposed, but rather as a variation
within the overall concept of criterion-referenced testing (after Popham 1978).

Example Data
Throughout the discussion that follows, reference will be made to the
example data shown in Table 2 on the following pages.
84 J.D. BROWN

Table 2: Example Data Set - ELI Reading (ELI 72) Course Final Examination
Achievement Test

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
3 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
39 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
15 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
5 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1
9 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0
2 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1
40 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
31 1 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1 0
16 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0
12 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 l 1
25 l 1 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0
42 1 1 0 1 1 0 1 1 1 0 0 1 0 1 1 1 0
38 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1
22 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1
10 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1
18 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1
32 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1
13 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1
41 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0
8 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0
1 0 1 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0
11 0 1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 1
26 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0
30 1 1 1 1 1 0 1 0 1 1 0 0 1 1 l 1 0
14 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 0
34 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0
35 1 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 1
27 0 1 0 0 1 1 1 1 0 0 0 0 1 1 l 1 1
24 0 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1
7 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1
6 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1
29 0 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0
17 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1
20 0 1 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1
23 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1
36 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0
37 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0
4 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 1 1
19 1 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0
21 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
28 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0
IF .7143 . 8810 .5952.3810 . 6905.6667 . 9048 .7857 . 9048 . 3810.4762 . 7619 .6667.8810 . 8095 . 9048.5476

SI2 2041 . 1049 .2409 .2358 .2137 .2222 .0862 .1684 .0862 .2358 .2494 .1814 .2222 .1049 .1542 .0862 . 2477
DI 0839 .0983 .0300 . 1201 .1687 .1667 .3178 .0683 .0787 .1201 . 2153 .1315 . 0362 .1201.1356 .1004 .0694
CRITERION-REFERENCED TEST REUABILITY 85

18 19 20 21 22 23 24 25 26 27 28 29 30 TOTAL I?ROI?

1 1 1 1 1 1 1 1 1 1 1 1 1 29 .9667
1 1 1 1 1 1 1 1 1 1 1 1 1 28 .9333
0 1 1 1 0 1 1 1 0 1 1 1 1 27 .9000
1 1 1 1 0 1 0 1 1 1 1 1 1 26 .8667
1 1 1 1 1 1 1 1 1 1 1 1 1 26 .8667
0 1 1 1 1 1 0 1 1 1 1 1 1 26 .8667
1 1 1 1 1 1 1 1 l 1 1 0 1 26 .8667
1 1 1 1 1 0 0 1 1 0 1 1 1 25 .8333
1 1 1 1 1 1 1 1 1 l 1 1 1 25 .8333
1 l 1 1 1 1 1 0 1 1 1 1 1 25 .8333
1 1 1 1 1 1 0 1 1 1 1 1 1 24 .8000
1 1 1 1 0 0 1 1 1 1 1 1 1 23 .7667
1 1 1 1 1 1 1 0 1 1 1 1 1 23 .7667
1 1 1 1 0 1 1 1 0 1 1 0 0 23 . 7667
1 1 1 1 0 0 1 1 0 1 1 0 1 23 .7667
1 1 1 0 1 0 1 1 1 0 0 1 1 23 .7667
1 1 1 1 1 1 1 1 0 0 1 1 0 23 .7667
1 1 1 1 0 1 0 1 1 0 1 1 1 23 . 7667
0 1 1 1 1 1 0 1 1 0 1 0 1 23 • 7667
1 1 1 1 0 l 0 1 0 0 1 1 1 23 • 7667
0 1 1 1 1 0 1 1 0 1 1 0 1 23 .7667
1 1 1 1 0 l 1 1 1 1 1 1 1 22 . 7333
0 1 1 1 1 1 0 1 1 0 1 1 1 21 . 7000
1 1 1 0 0 0 1 1 1 1 1 1 1 21 .7000
1 0 1 1 1 1 1 0 0 0 1 1 1 21 .7000
1 1 1 1 0 1 0 0 1 1 1 1 1 21 .7000
1 1 1 1 0 0 1 1 1 0 1 0 0 21 .7000
1 1 1 1 1 0 1 1 0 0 1 1 0 21 .7000
1 1 1 0 1 1 1 1 1 1 1 1 0 21 . 7000
1 1 1 1 0 1 1 0 1 0 1 1 0 21 .7000
1 1 1 0 0 0 1 1 1 0 1 1 1 20 .6667
1 1 1 1 0 0 1 1 1 0 1 1 1 19 .6333
1 1 0 1 1 1 0 1 0 0 1 1 1 19 .6333
1 0 1 1 1 0 1 1 0 0 0 0 0 19 .6333
1 1 1 1 0 0 0 0 1 0 1 1 0 18 .6000
0 1 1 0 0 0 0 0 0 0 1 1 1 17 .5667
1 1 1 1 0 0 0 0 1 0 1 1 1 16 .5333
1 1 1 1 0 0 0 0 1 0 1 1 1 16 .5333
0 1 1 0 0 0 0 0 1 1 0 1 1 15 .5000
1 1 1 0 0 0 0 0 0 0 1 0 0 14 .4667
0 0 0 0 0 1 0 0 0 0 0 1 0 7 .2333
1 0 0 0 1 0 0 0 0 0 0 0 1 7 .2333
.8095.9048 . 9296 . 7857 .5000 . 5114 .5114 .6905 .6667 .4762 .8810 .7857 .761921.2857 .7095233 MEAN
4. 6511 . 1550366 SD
.1542.0862.0663.1684 .2500 .2449 .2449 .213"1 .2222 .249 4 .1049 .1684 .1814 5.3991 sum of Sr2
.0704.0787.0807 . 0683.1522.0497.1584.0600 .0362 .1284 .1201.0466.0445
86 J.D. BROWN

These data are based on one of the many criterion-referenced tests that have
recently been developed at the University of Hawai 'i. There are seven courses
commonly offered in the English Language Institute-two in the academic
listening skill, two in academic reading, and three in academic writing. Each of
these courses now has CRTs to test the specific objectives of the course. In all
cases, there are two forms (cleverly labeled Forms A and B). These tests are
administered in a randomly assigned counterbalanced design such that each
student is tested for diagnosis at the beginning of the course (pretest) and for
achievement at the end (posttest) without taking the same test form twice.

TEST RELIABILITY
In general, the reliability of a test is defined as the extent to which the results
can be considered consistent or stable. For example, when a placement test is
administered, it is desirable for the test to be reliable because an unreliable test
might produce wildly different scores if the students were to take it again. The
decisions about which levels of language study the students should'enter are
important ones that can make big differences for those students in terms of the
amounts of time, money and effort that they will have to invest in learning the
language. A responsible language professional will naturally want the
placement of students to be as accurate and consistent as possible.
The degree to which a test is consistent can be estimated by calculating a
reliability coefficient (;xx• ). A reliability coefficient is interpreted like a
correlation coefficient in that it can go as high as +1.0 for a perfectly reliable
test. But it is also different from a correlation coefficient in that it can only go
as low as 0. In part, this is because a test cannot logically have less than NO
reliability?

True Scores and Observed Scores


In estimating reliability, at least in a theoretical sense, the purpose is to
approximate the degree to which the students' scores are likely to be the same
if they were to take the test repeatedly for an infinite number of times. An
average of each student's infinite number of scores on a test would be that
student's true score on the test. The true score would be the best possible
estimate of the underlying ability that the test is trying to measure. Since it is
obviously impossible to get students to take an examination an infinite number
CRITERION-REFERENCED TEST REUABILITY 87

of times, it is necessary to settle for an estimate of how close the students'


scores on a test are to those ideal true score values. The reliability coefficient is
one way testers estimate the percentage of variance in the actual scores
obtained on a test, also called observed scores, which can be attributed to
variance in true scores. Since observed score variance is all of the variance on a
given test, it is logically set at 1.00, or 100 percent. Typically, a reliability
estimate produces a value that can be interpreted in relation to observed score
variance as a ratio of true score variance to observed score variance as follows:

Reliabili = True soore variance .60


ty Observed soore variance = 1.00

For example, a reliability coefficient of .60 (.60/1.00) indicates that 60


percent of the variance in observed scores was due to true score variance, while
the remaining 40 percent cannot be accounted for and is called error variance,
or unsystematic variance. In other words, the observed score variance in a test
is made up of true score variance plus error variance (i.e., observed score
variance = true score variance + error variance = .60 + .40). Substituting those
components for the observed score variance in the formula given above, you
obtain:

Reliabili = True srore variance = .60


ty Observed soore variance + Enur variance .60 + .40

Notice that, the reliability coefficient is directly interpretable as the


proportion of true score variance found in the observed score variance. Thus,
unlike the correlation coefficient, it is not necessary to square the value of the
reliability coefficient in order to interpret it in percent terms. If a test has a
reliability coefficient of rxx• = .92, it can be said to contain 92 percent true score
variance. Since true score variance is completely reliable variance, it can also
be said that the test contains 92 percent consistent, or reliable, variance-with 8
percent (100- 92 = 8) error, or random variance. If rxx• = .20, the variance on
the test is only 20 percent systematic and 80 percent random, and so forth.
88 J.D. BROWN

CONSISTENCY ESTIMATES FOR NRTS


In language testing for norm-referenced purposes, there are three classical
theory strategies that are commonly used to estimate reliability: test-retest,
equivalent-forms and internal-consistency approaches. Let's review these
estimates very briefly so that they can serve as a basis for the discussion of
CRTs that follows.
Test-retest reliability is the closest conceptually to estimating the
correlation of observed scores to true scores. This strategy involves first
administering a test to one group of students on two separate occasions, then
calculating the correlation coefficient between the scores of the two
administrations. The correlation coefficient that results is a conservative
estimate (that is, an underestimate) of the correlation between the observed
scores and the true scores. Thus it is a reliability estimate and can be
interpreted as the percent of reliable variance on the test, or as the percent of
true score variance accounted for by the observed scores.
An alternative strategy is the equivalent-forms approach (also called
parallel-forms reliability). Equivalent-forms reliability is similar to test-retest
reliability except that, instead of administering the same test twice, two
different tests (say Forms A and B which were designed to be equivalent) are
administered to a group of students. Again a correlation coefficient is
calculated for the degree of relationship between the scores on the two forms.
Also similar to the test-retest strategy, the resulting reliability coefficient can be
directly interpreted as the percent of reliable, or consistent, variance on the test,
or as the percent of true score variance accounted for by the observed scores.
To avoid the work and complexity involved in the test-retest or
equivalent-forms strategies, the internal-consistency reliability strategies are
considered very useful. As the name implies, all of these internal-consistency
approaches have the advantage of estimating the reliability of a test with only
one form and only one administration of that form.
The easiest internal-consistency strategy to understand conceptually is
called the split-half method. This approach is very similar to the equivalent-
forms technique except that in the internal-consistency method, equivalent
forms are created from the single test being analyzed. This entails dividing the
test into two equal parts. The test is usually split on the basis of odd and even
numbered items. The odd-numbered and even-numbered items on the test are
CRITERION-REFERENCED TEST REliABILITY 89

then scored separately as though they were different forms. Next, a correlation
coefficient is calculated for the two new sets of scores. The resulting coefficient
r!! could be interpreted as a reliability estimate except that it represents the
J~gree of reliability for only half of the test, either half, but still just half of the
test. Since, all things held constant, a longer test will be more reliable than a
short one, the correlation calculated between the odd-numbered and even-
numbered items must be adjusted. This adjustment of the half-test correlation
to estimate the full-test reliability is accomplished using the Spearman-Brown
Prophecy formula.
The procedures for the split-half reliability are still fairly complex
particularly when compared to the most commonly reported internal
consistency estimates which were worked out by Kuder and Richardson (1937).
These are known as the K-R21 and K-R20 formulas.
The Kuder-Richardson formula 21 (K-R21) is a relatively easy-to-
calculate estimate of the internal consistency of a test. K-R21 can be estimated
using the following formula:

K-R2t =-k-(
k -1
1- x(k-
kST
x)) 2
[1]
Where:
K-R21 = Kuder-Richardson Formula 21
k =Number of items
X = mean scores on the test
~2 = variance for the scores on the test (that is, the standard
deviation of the tests scores squared)

This formula contains several important elements. The first of these is


the k, which is simply the number of items on the test. It is assumed that these
items are dichotomously scored (i.e., right or wrong). Next, it is necessary to
know the X, or the raw score mean of the scores on the test. Last, it is necessary
to have in hand ~2, which is the variance of the raw scores on the test (i.e., the
standard deviation squared). In other words, all that is needed to calculate K-
R21 is the mean, standard deviation and number of items. Based on the
information provided in Table 2, the mean (21.2857}, test variance (4.65112) and
number of items (30) can be substituted into the formula to calculate K-R20 as
90 J.D. BROWN

follows:

k ( 1- X ( k- X} ) =-
K-R21=-- 30 ( 1----~----'-
21.2857 x (30- 21.2857))
k- 1 kST2 29 30 X 4.65112

= 1.0354 (1 - :.9:9) = 1.034511 - .2858)

= 1.0345 X .7142 = .7388 = .74

While K-R21 is relatively simple to calculate, it is important to realize that the


result is usually an underestimate of the reliability of a test and is sometimes a
serious underestimate for some types of language tests (see Brown 1983,
1984b).
The Kuder-Richardson formula 20 (K-R20) appears to avoid the problem
of seriously underestimating the reliability. Though marginally more difficult
to calculate, K-R20 is also a much more accurate estimate of reliability than is
K-R21. K-R20 can be estimated using the following formula:

2
l:S1- )
k ( 1-
K-R20=--
k-1 ST2
[2]
Where:
KR-20 =Kuder-Richardson Formula 20
k =numberofitems
5 12 =item variance =IF(l-IF)
ST2 =variance for the scores on the test (that is, the standard
deviation of the test scores squared)
This formula contains several new elements. The first is the sum of the item
variances, symbolized by LSr. These item values (see second row from bottom
of Table 2) are derived from the item facility values (see third row from the
bottom of Table 2). Using the first item in Table 2 as an example, the process
begins by getting the IF value for each item (.7143 for item 1). Recall that this
value represents the proportion of students who answered each item correctly.
Next, 1 - IF must be calculated for each item. The result of subtracting the IF
from 1.00 will yield the proportion of students who answered each item
incorrectly (1 - .7143 = .2857 for item 1). These results must then be multiplied
by the IF, which will yield the item variance, or S12 = IF(1- IF)= .7143(.2857} =
CRITERION-REFERENCED TEsT RELIABILITY 91

.2041. In other words, the item variance for each item is equal to the
proportion of students who answered correctly multiplied times the
proportion who answered incorrectly. In Table 2, the item variances for each
item are shown in the second row from the bottom and are summed to the far
right in the table. This sum is substituted into the second numerator in the K-
R20 formula.
The other essential element in the K-R20 formula is the one symbolized
by Sr2. Again, this is a label for the variance of the whole test (i.e., the standard
deviation squared). Based on the information provided in Table 2, the sum of
the item variances (5.3391), test variance (4.65112) and number of items (30) can
be substituted into the formula to calculate K-R20 as follows:

K-R20 =-k-(
k -1
1_ rsl
ST2
1
) =
3o ( 1 _ 5.3991 )
29 4.65112

5.3391 )
= 1.0345 ( 1 - 21.6327 = 1.0345 (1 - .2496)

= 1.0344827 X .7504196 =.77696 = .78


If accuracy is the main concern, the K-R20 formula should be applied
rather than the K-R21. In either case, any error in estimating the reliability will
be in the direction of an underestimate of the actual reliability of the test, i.e.,
all of these statistics are conservative in the sense that they should not
overestimate the existing state of affairs.
This coverage of internal-consistency reliability has necessarily been
brief. There are numerous other ways to approach the estimation of NRT
reliability. For instance, references are often made to Cronbach alpha,
Flanagan's coefficient, Rulon's coefficient, or others like the Guttman
coefficient [please refer to Cronbach 1970, J.P. Guilford 1954 or Stanley 1971 for
more on these]. The strategies chosen for presentation here were selected
because they facilitate the discussion of CRT consistency estimates.
Remember, internal-consistency estimates are the ones most often reported by
language testers because they have the distinct advantages of being estimable
from a single form of a test administered only once. Again, this is completely
unlike test-retest and equivalent forms reliabilities, which require either two
administrations or two forms. However, the internal consistency strategies
92 J.D. BROWN

demonstrated here are only applicable when the items are scored
dichotomously, that is, when the items are either correct or incorrect, and when
the test is for norm-referenced purposes.

CONSISTENCY ESTIMATES FOR CRTS


As has been previously noted (Brown 1984a; Hudson & Lynch 1984),
CRTs may not produce normal distributions if they are functioning correctly.
On some occasions, like at the beginning of instruction, they may form normal
distributions, but the point is that the tester cannot count on the normal
distribution as part of the strategy for demonstrating the reliability of a test. If
all of the students have learned all of the material, they should all score near
100 percent on the end-of-course achievement CRT. Hence, a CRT that
produces little variation in scores may be ideal from a CRT point of view. In
other words, a low standard deviation on a CRT posttest may actually be a by-
product of developing a sound CRT for making absolute decisions. This is
quite opposite from the goals and results when developing a good NRT for
relative decisions.
Looking back over the reliability strategies described above, it seems
clear that they all depend in one way or another on the magnitude of the
standard deviation. Consider the test-retest and equivalent-forms strategies.
In both cases, a correlation coefficient is calculated. Since· correlation
coefficients are designed to estimate the degree to which two sets of numbers
go together, or vary in similar manner, scores that are very tightly grouped
(that is, have a low standard deviation) will probably not line the students up
in a similar manner. In fact, as the standard deviation approaches zero, so do
any associated correlation coefficients. In addition, a quick glance back at the
K-R20 and K-R21 will indicate that as the standard deviation goes down,
relative to all other factors, so will the internal consistency estimates (see
Brown 1984b for a demonstration of the degree to which this relationship exists
in language testing data).
In short, all of the above strategies for reliability are fine for NRTs
because they are sensitive to the magnitude of the standard deviation, and a
relatively high standard deviation is one result of developing a test that
spreads students out along a continuum of abilities. However, those same
strategies may be inappropriate for CRTs because CRTs are not developed for
the purpose of producing variance in scores as represented by the standard
CRITERION-REFERENCED TEST RELIABIUTY 93

deviation.
Because of the potential lack of variance in scores on CRTs as well as
because of the nature of the absolute decisions that are typically based on such
tests, other strategies have been worked out for demonstrating CRT
consistency. There are many such strategies but they generally fall into one of
three categories (Berk 1984, p. 235): threshold loss agreement, squared-error
loss agreement and domain score dependability. These three strategies can be
applied specifically to the estimation of CRT consistency.
It should be noted that most of the CRT consistency strategies presented
in the remainder of this paper have only fairly recently been developed and are
to some degree controversial. However, they do provide tools for analyzing
CRTs that may prove useful to language testers. Like all statistics, they should
be used with caution and interpreted carefully as just what they are: estimates
of CRT consistency.
You will notice, as the discussion proceeds, that the terms agreement or
dependability will be used in lieu of the term reliability to refer to estimates of
the consistency for CRTs. The term reliability is being reserved for the classical
theory NRT consistency estimates. This is done to emphasize the fact that CRT
estimates are different from classical theory estimates, as well as to insure that
it is clear which family of tests, NRT or CRT, is involved in any particular
discussion.

Threshold Loss Agreement Approaches


Of the methods that exist today for estimating threshold loss agreement,
two are particularly appropriate for this paper because they are prominent in
the literature, yet straightforward enough mathematically so that they can be
calculated in most language teaching situations. These are the agreement
coefficient (Subkoviak 1980) and the kappa coefficient (Cohen 1960). Both of
these coefficients measure the consistency of master /non-master
classifications. A master is defined as a student who knows the material or has
mastered the ability being tested, while a non-master is a student who does
not. These two threshold loss agreement approaches are sometimes termed
decision consistency estimates because they approximate the degree to which
decisions classifying students as masters or non-masters are consistent. In
principle, these estimates require the administration of a test on two occasions.
The conceptual examples will be based on this relatively impractical strategy.
94 J.D. BROWN

However, recently developed methods based on a single administration will


also be presented.
Agreement coefficient. Simply put, the agreement coefficient (p0 ) is an
estimate of the proportion of students who have been consistently classified as
masters and non-masters on two administrations of a CRT. To apply this
approach, the test should be administered twice such that there is enough time
between administrations for the students to "forget'' the test, yet not so much
time that they have learned any substantial amount. Using a predetermined
cut-point the students are then classified on the basis of their scores either into
the master or non-master groups on each test. The cut-points are usually
determined by the purpose of the test. On an achievement test, for instance, a
passing score might be considered 60 percent or higher. If this achievement
test were administered twice near the end of a term of instruction, it would
only be necessary to tally the number of students who passed (masters) and
those who failed (non-masters) on the two administrations.

ADMINISTRATION 2

Masters A B A+B

ADM1NISTRATION 1
C+D
Non-masters c D

B+D
Figure 1: Master /non-master classifications for two test administrations

Figure 1 shows a way of categorizing the results on the two tests so that
you can easily calculate Po· In some cases, classifications agree between the
two tests. Thus when students are classified as masters on both
administrations of the test, you need only count them up and record the
number in cell A in Figure 1. Similarly, the number of students classified as
non-masters by both tests is put cell C. In other cases, the classifications
disagree between the two administrations. Some students may be classified as
masters on the first administration and non-masters on the second. This
number would be tallied and put into cell B, while those students classified as
CRITERION-REFERENCED TEST REliABILITY 95

non-masters on the first administration and masters on the second would be


tallied and put into cell C. Notice that A+B and C+D are totaled to the right of
the figure; in addition, A+C are B+D are totaled below it. Observe also that
A+B+C+D is shown in the bottom right comer. These are all called marginals.
Consider a contrived example for the sake of understanding how the
agreement coefficient works. Let's say that the master /non-master
classifications for one group of 100 students on two administrations of a
posttest turn out as shown in Figure 2. 'This would mean that 64 out of the 100
students were classified as masters by both tests, while 25 other students were
classified by both as non-masters. It also turns out that 11 students (11 = 6 + 5
students in cells C and B, respectively) were classified differently by the two
tests.
ADMINISTRATION2

Masters 64 6 70

ADMINISTRATION 1

Non-masters 5 25 30

69 31 100
Figure 2: Example master /non-master classifications for two test
administrations

With this information in hand, the calculation of the agreement


coefficient merely requires solving the following formula:

p = A+D
o N [3]
Where: Po = agreement coefficient
A = number of students in cell A
D =number of students in cell D
N = total number of students

Substituting the values found in Figure 2 into formula [3], the


calculations turn out as follows:
96 J.D. BROWN

Po = A+D
N
=64+25 =.89
100

This result indicates that the test administrations classified the students
with about 89 percent agreement. Thus the decision consistency is about 89
percent, and this CRT appears to be very consistent.
Notice that, if all of the students were classified in exactly the same way
by both administrations, the coefficient would be 1.00 [for example, (A + D) I
N = (80 + 20) I 100 = 1.00, or (99 + 1) I 100 = 1.00]. Thus, 1.00 is the maximum
value that this coefficient can take on. However, unlike the reliability
coefficients discussed above for NRTs, this coefficient can logically be no lower
than the value that would result from chance distribution across the four cells.
For 100 students, you might expect 25 students per cell by chance alone. This
would result in a coefficient of .50 [(A+ D) IN= (25 + 25) I 100 = 501100 =
.50]. This is very different from NRT reliability estimates which have a logical
lower limit of .00.
Kappa coefficient. The kappa coefficient (K) adjusts for this problem of a
.50 lower limit. It reflects the proportion of consistency in classifications
beyond that which would occur by chance alone. The adjustment is given in
the following formula:
.

[4]
Where:
Po = agreement coefficient
Pchance =proportion classification agreement that
could occur by chance alone
= [(A+B)(A+C) + (C+D)(B+D)] I N2

As mentioned above, two-way classifications like those shown in the example,


will always have a Pchance· Hence before calculating the K value, you must
calculate the Pchance level for the particular classification table involved.
These levels will of course differ depending on the score used as a cut-point in
making the absolute decision. For the example data, the calculations for
CRITERION-REFERENCED TEST REUABIUTY 97

formula [4] would be as follows:

Pchance = [(A+B)(A+C) + (C+D)(B+D)] I N2

= [(70)(69) + (30)(31)] I 10000

= 4830 + 930 I 10000 = 5760 I 1oooo = .5760 = .58

K =(Po- Pdlance) = .89-.58 = .31 = .7381 ....74


( 1 - pchaoce ) 1 - .58 .42

This coefficient is an estimate of the classification agreement that


occurred beyond what would be expected by chance alone and can be
interpreted as a percentage of that agreement. Since it is the percentage of
classification agreement beyond chance, it will generally be lower than the
agreement coefficient. Like the agreement coefficient, it has an upper limit of
1.00, but unlike the agreement coefficient with its chance lower limit, the kappa
coefficient has the more familiar lower limit bound of .00
Single administration Agreement and Kappa coefficients. Because
administering a test twice is cumbersome and hard on everyone involved,
many approaches have been worked out to estimate threshold agreement from
one administration (see, for instance, Huynh 1976, Marshall 1976 and
Subkoviak 1980). Typically, these have been far too complex for practical
application by anyone but a trained statistician. Recently however, Subkoviak
(1988) presented practical approaches for approximating both the agreement
and kappa coefficients. In order to approximate either of these coefficients
from a single test administration, you will need to calculate two values. The
first is a value for the cut-point score converted to a standard score. This is
calculated using the following formula:
J.D. BROWN

{c- .5 -X}
z = ,___--=-~
s [5]
Where:
z = standardized but-point score
.
c = raw cut-pOint score
;

X=mean
5 = standard deviation

Second, you will need to calculate any of the traditional NRT internal
consistency reliability estimates including those described above (Split-half
adjusted, or K-R20).

lzl .10 .20 .30 .40 .50 .60 .70 .80 .90
,00 ,53 .56 .60 .63 .67 .70 .75 .80 .86
.10 .53 .57 .60 .63 .67 .71 .75 .80 .86
.20 .54 .57 .61 .64 .67 .71 .75 ,80 .86
,30 .56 .59 .62 .65 .68 .72 ,76 .80 .86
.40 .58 .60 .63 .66 ,69 .73 .77 .81 .87
.so .60 .62 ,65 .68 .71 .74 .78 .82 .87
.60 .62 .65 .67 . 70 .73 .76 .79 .83 ,88
. 70 .65 .67 .70 .72 .75 .77 .80 .84 .89
.so .68 .70 .72 .74 .77 .79 .82 .85 .90
.. 90 .71 .73 . 75 .77 .79 ,81 .84 .87 ,90
1.00 .75 .76 .77 .77 .81 .83 .85 .88 .91
1.10 .78 .79 .so .81 .83 . 85 .87 .89 .92
1.20 .80 .81 .82 .84 .85 .86 .BB .90 ,93
1.30 .83 .84 .85 .86 ,87 ,88 .90 .91 .94
1.40 .86 .86 .87 .88 .89 .90 .91 .93 .95
1.50 .88 ,88 .89 .90 .90 .91 .92 .94 .95
1.60 .90 .90 .91 .91 .92 .93 .93 .95 .96
1. 70 .92 .92 .92 ,93 .93 .94 .95 .95 .97
1.80 .93 .93 .94 .94 .94 .95 .95 .96 .97
1.10 .95 .95 .95 .95 .95 .96 .96 .97 .98
2.00 .96 .96 .96 .96 .96 .97 .97 .97 .98

Table 3: Approximate values of the agreement coefficient (from Subkoviak 1988, p. 49)

Once you have the standardized cut-point score and a traditional


reliability estimate in hand, you simply enter the appropriate table and look in
the first column for the z value (regardless of sign, + or-) closest to your
obtained value and across the first row for the rxx• closest to your obtained
CRITERION-REFERENCED TEST REUABILITY 99

reliability estimate. Where the row for your z value meets the column for
your reliability coefficient, you will find the approximate value of threshold
agreement for your CRT. Table 3 gives the approximations for agreement
coefficients and Table 4 gives
i
the same information for Kappa coefficients.

!zl ,lQ ,ZQ ,JQ ,4Q .~o ,§Q ,ZQ ,§Q ,90
.00 .06 .13 .19 .26 .33 .41 .49 .59 .71
.10 .06 .13 .19 .26 .33 .41 .49 .59 .71
.20 .06 .13 .19 .26 .33 .41 .49 .59 .71
.30 .06 .12 .19 .26 .33 .40 .49 .59 .71
.40 .06 .12 .19 .25 .32 .40 .48 .58 .71
.so .06 .12 .18 .25 .32 .40 .48 .58 .70
.60 .06 .12 .18 .24 .31 .39 .47 .57 .70
.70 .05 .11 .17 .24 .31 .38 .47 .57 .70
.80 .05 .11 .17 .23 .30 . 37 .46 .56 .69
.90 .05 .10 .16 .22 .29 . 36 .45 .55 .68
1.00 .05 .10 .15 .21 .28 .35 .44 .54 .68
1.10 .04 .09 .14 .20 .27 .34 .43 .53 .67
1.20 .04 .08 .14 .19 .26 .33 .42 .52 .66
1.30 .04 .08 .13 .18 .25 .32 .41 .51 .65
1.40 .03 .07 .12 .17 .23 .31 .39 .50 .64
1.50 .03 .07 .11 .16 .22 .29 .38 .49 .63
1.60 .03 .06 .10 .15 .21 .28 .37 ,47 .62
1. 70 .02 .OS .09 .14 .20 .27 .35 .46 .61
1.80 .02 .OS .08 .13 .18 .25 .34 .45 .60
1.90 .02 .04 .08 .12 .17 .24 .32 .43 .59
2.00 .02 .04 .07 .11 .16 .22 .31 .42" .58

Table 4: Approximate values of the kappa coefficient (from Subkoviak 1988, p. 50)

Consider the data in Table 2. Remember that these are a set of CRT
posttest scores with a mean of 21.29, a standard deviation of 4.65 and a K-R20
reliability estimate of .78. Assume that this CRT has a cut-point (c) of 22 out of
30. To obtain the standardized cut-point score, formula [5] above would be
applied as follows:

{c -.5-X'
% = .:...,_--=_A..:.,J
s
- (22- .5- 21.29) - .21 - 0452
- 4.65 - 4.65 - '
100 J.D. BROWN

To approximate the agreement coefficient, enter Table 3 at the row for z


that is the closest to .0452 (.00 in this case). Look across the top for the
reliability closest to .78 (.80 in this case). Where the identified row and column
meet you should find a value of .80 for the approximate agreement coefficient.
Following the same steps in Table 4 will give you an estimate for the Kappa
coefficient of .59.
According to Subkoviak (1988), approximations obtained in this way
will yield underestimates of the values that would be obtained using two
administrations. Thus they are safe estimates, but will generally be on the low
side of what would be obtained in a two administration situation. Thus they
can be used to give you an idea, or rough approximation, of the consistency of
your test. If they are high, this will be fine. However, if they are low, you
might want to double check the consistency of your test using other
approaches. It is a good idea to use a variety of approaches in any case.

Squared-error Loss Agreement Approaches


Threshold loss agreement coefficients focus on the degree to which
classifications in clear-cut categories (master or non-master) have been
consistent. Squared-error loss agreement strategies do this too, but they do so
with "...sensitivity to the degrees of mastery and nonmastery along the score
continuum..." (Berk 1984, p. 246). Thus these approaches attempt to account
for the distances of students' scores from the cut-point, that is, for the degree of
mastery or non-mastery rather than the simple categorization.
Only the phi(lambda) dependability index (Brennan 1980, 1984) will be
explained here because it can be estimated using one test administration and
because Brennan has provided a formula that can be based on raw score test
statistics. Using common test statistics, the following formula can be used to
calculate phi(lambda):
CRITERION-REFERENCED TEST REUABILITY 101

[6]
Where:
fll(A.) = phi(lambda) dependability index
k = number of items
A. = cut-point expressed as a proportion
XP = mean of proportion scores
SP =standard deviation of proportion scores (using theN
formula rather than N- 1)

Consider the example data in Table 2, but this time analyzing the test as
the CRT that it actually was designed to be. Using formula [6], the k indicates
the total number of items, or 30 in this case, and the is the A. cut-point expressed
as a proportion. Let's say that the cut-point for mastery in this case was 60
percent (.60, if expressed as a proportion). Notice that the proportion scores
given in the column furthest to the right are the raw scores divided by the total
possible (i.e., k). The mean (.7095) and standard deviation (.1550) of these
proportion scores are ><p and SP' respectively. [It is important to realize that it
is not necessary to calculate the proportion scores before getting these values.
They can instead be calculated directly from the raw score mean and standard
deviation by dividing each by k. Using the example data in Table 2, the mean
divided by k turns out to be 21.2857/30 = .7095 and the standard deviation
divided by k is 4.6511 I 30 = .1550.]
Substituting all of these values into formula [6] (from Brennan 1984, but
adapted to the symbols used in this paper), you will get:
102 J.D. BROWN

<I> (.60) = 1 - 1 [ .7095 ( 1 - .7095)- .1550 1


2

30 1
-
2
( .7095- .60 ) + .1550
2
J
1 1 [ .2061097- .0240250]
= -29 .0119902 + .0240250

.1820847]
= 1 - .0344828 .0360 = 1 - ( .0344828 X 5.0557736 )
[ 152
=1-.1743372 =.8258 =.83
Remember this is a short-cut index of dependability that takes into account the
distances of students from the cut-point for the master/non-master
classification. A more complete analysis is provided in Appendix A for a
number of different cut-points, but such additional analyses are beyond the
purpose of this paper (also see Brennan 1984 for much more on this topic).

Domain Score Dependability


All three of the threshold loss and squared-error loss agreement
coefficients given above have been criticized because they are dependent in
one way or another on the cut-point score. The domain score estimates of
dependability are independent of such cut-scores. However, it is important to
note that they apply solely to domain-referenced interpretations as defined
above.
Only the phi (<I>) coefficient will be explained here. This coefficient, also
known as the generalizability coefficient for absolute error (for more on
generalizability theory, see Bolus, Hinofotis & Bailey 1982; Brown 1984c;
Brown & Bailey 1984), can be interpreted as a general purpose estimate of the
domain-referenced dependability of a test. This interpretation assumes that
the items are sampled from a well-defined domain and gives no information
about the reliability of the individual objectives based subtests. Nevertheless,
CRITERION-REFERENCED TEST RELIABILITY 103

it does provide a handy way of estimating the overall dependability of the


scores. The formula that is presented here was derived from information
provided in Brennan (1983, 1984) as shown in Appendix B. The resulting
formula is as follows:

r6p2 [K-R20]
n-1
~= ------=,......,...---=-=----
r6p2 Xp(1-Xp}-Sp2
n-1 [KR-20] + k-1
[6]
Where:
n = number of persons who took the test
k = number of items
XP = mean of proportion scores
SP = standard deviation of proportion scores (using theN
formula rather than N - 1)
K-R20 =Kuder-Richardson formula 20 reliability estimate

All that is necessary for calculating this coefficient of dependability is


the number of students, the number of items, the mean of the proportion
scores, the standard deviation of the proportion scores and the K-R20 reliability
estimate. The only two elements that are perhaps a bit unfamiliar are the mean
and standard deviation of the proportion scores. Remember, they are simply
the raw score mean and standard deviation divided by the number of items (k).
Once again, consider the CRT example shown in Table 2. Again, k is the
number of items (or 30 in this case); n is the number of students (42); Xp is the
mean (.7095) of the proportion scores; and, SP is the standard deviation (.1550)
of the same proportion scores.
Substituting all of these values into formula [7], the result is:

_rS_p_2 [K-R20]
n-1
~ = ------=,......,...-~:----
rSpz Xp(t-XJ-spz
n-1 [KR-20] + k-1
104 J.D. BROWN

2
42x(.1550366) [.776296]
42-1
=--------~~----------------------------~
2 2
42x(.1550366) [.776296]+ .7095233 ( 1- .7095233)- .1550366
42-1 30-1

l.00!~246 [.776296]
=---=~------------------
1.0095246 [ 776296] + .1820636
41 . 29

- .019114 - .019114- 752757- 75


- .019114 + .006278 - .025392 - . -.

It is important to note that this result in calculating phi matches exactly the
result obtained in Appendix C using the full set of generalizability procedures
including analysis of variance, estimation of G Study variance components,
estimation of D Study variance components and finally calculation of the G
coefficient for absolute error. In other words, while the full generalizability
study is clearer conceptually, precisely the same result has been obtained here
using only n, k, Xp, Sp and the K-R20 reliability. In addition, this short-cut for
estimating the dependability of domain-referenced scores is analogous to the
reliability estimates for NRTs in terms of interpretation.

DISCUSSION
You have seen that there are fairly straightforward ways to estimate the
consistency of criterion-referenced tests----no more difficult conceptually or
computationally than the classical theory approaches. There are several
additional points that must be stressed. The first is that some of the coefficients
presented in this paper are related in rather predictable ways. The second is
that there are a number of cautions that must be kept in mind when making
calculations----particularly for the phi coefficient.

Relationships Among Types of Coefficients


At some point it might be helpful to realize that there are predictable
relationships between some of the NRT coefficients of reliability and the
CRITERION-REFERENCED TEST RELIABILITY 105

domain-referenced estimates of dependability. One interesting relationship


that Brennan (1984, pp. 315-316) demonstrates is that, for a given test, K-R21
will be less than~ which will in turn be less than K-R20 as follows:
K-R21 < ~ < K-R20
Using the example data above (where K-R21 = .74; ~ = .75; K-R20 = .78), it is
clear that indeed:
.74 < .75 < .78
This has one important implication, if indeed K-R21 is lower than~, then K-
R21 can be used as a conservative "rough and ready'' underestimate of the
domain-referenced dependability(~) of a test (Brennan 1984: 331-332).

Cautions
In doing all of the calculations necessary to apply any of the reliability or
dependability estimates demonstrated in this paper, particularly the phi
coefficient, there are three cautions that should be kept in mind. The first is
that these techniques are only applicable when the items on the test are
dichotomously scored (i.e., right or wrong). The second is the N formula
(rather than the N-1 formula) should always be used in calculating the means
and standard deviations of the proportion scores that are used in the phi and
phi(lambda) formulas. The third caution is that, particularly for phi or
phi(lambda) coefficients, as much accuracy as possible should be used when
doing all of the calculations. In other words, throughout the calculations, as
many places should be carried to the right of the decimal point as possible, i.e.,
rounding should be avoided until the final coefficient is estimated.
In addition, it is important to remember that the phi and phi(lambda)
coefficients are dependent on the relative variance components involved in the
test and, as Brennan states, "it is strongly recommended that whenever
possible one report variance components, and estimate indices of
dependability in terms of variance components" (1984, p. 332). Thus if the
resources are available for doing a full-fledged generalizability study, that is
the best way to proceed.

CONCLUSION
In summary, this paper began by reviewing the similarities and differences
between the NRT and CRT families of tests. Then the classical theory
106 J.D. BROWN

approaches to reliability (test-retest, equivalent-forms and internal consistency)


estimates were briefly reviewed for comparison with the CRT techniques
presented here. These short-cut CRT techniques turned out to be
straightforward enough to be calculated with relative ease by hand or with the
help of a simple calculator. Each of the techniques was demonstrated using
example data from a CRT administered in Fall semester 1988 to all sections of
the intermediate reading course in the English Language Institute at the
University of Hawai'i.
Two coefficients within the threshold loss agreement approach to CRT
decision consistency were demonstrated. These were the agreement coefficient
and kappa coefficient, both of which could, until recently, only be applied in
pretest-posttest CRT situations. Alternative and simpler methods (suggested
by Subkoviak 1988) were given for estimating these same coefficients from the
results of a single CRT administration.
Next, the squared-error loss agreement approach to CRT consistency
was represented by the phiOambda) dependability index, for which simplified
calculations, derived from Brennan (1983, 1984) were given. The formula was
explained and additional example solutions were provided in Appendix A.
Finally, the domain score dependability approach to test consistency
was represented by a short-cut method for estimating a phi coefficient (derived
from Brennan's 1983, 1984 discussions of phi and generalizability theory).
Appendix B shows the derivation of this formula and Appendix C provides a
full-blown generalizability study on the same data with exactly equivalent
results.
It is hoped that the short-cut estimates provided in this paper were
adequately explained, demonstrated and compared so that the paper will
stimulate further discussion of the principles of criterion-referenced language
testing, which will in turn encourage more widespread use of those principles
in the development of sound criterion-referenced tests.

Received March 20,1989


Author's address for correspondence:
J.D. Brown
Department of English as a Second Language
1890 East-West Road
University of Hawai'i
Honolulu, HI 96822
CRITERION-REFERENCED TEST REUABILI1Y 1(}7

REFERENCES

Bachman, L.F. (1987). The development and use of criterion-referenced tests of


language proficiency in language program evaluation. In A. Wangsotom,
K. Prapphal, A. Maurice and B. Kenny (Eds.) Trends in language
programme evaluation (pp. 252-274). Bangkok: Chulalongkom University.
Berk, RA. (1980). Criterion-referenced measurement: the state of the art. Baltimore:
Johns Hopkins University Press.
Berk, R.A. (1984). Selecting the index of reliability. In R.A. Berk (Ed.) A guide
to criterion-referenced test construction. Baltimore: Johns Hopkins
University Press.
Bolus,R.E., F.B. Hinofotis & K.M. Bailey. (1982). An introduction to
generalizability theory in second language research. Language Learning,
32, 245-258.
Brennan, R.L. (1980). Applications of generalizability theory. In RA. Berk (Ed.)
Criterion-referenced measurement: the state of the art. Baltimore: Johns
Hopkins University Press.
Brennan, R.L. (1983). Elements of generalizability theory. Iowa City, lA:
American College Testing Program.
Brennan, R.L. (1984). Estimating the dependability of the scores. In RA. Berk
(Ed.) A guide to criterion-referenced test construction. Baltimore: Johns
Hopkins University Press.
Brown, J.D. (1983) A closer look at doze-part IT reliability. In J.W. Oller, Jr.
(Ed.) Issues in language testing. Rowley, MA: Newbury House.
Brown, J.D. (1984a). Criterion-referenced language tests: what, how and why?
Gulf Area TESOL Bi-annual, 1, 32-34.
Brown, J.D. (1984b). A cloze is a cloze is a cloze? In J. Handscombe, R.A.
Orem, & B.P. Taylor (Eds.) On TESOL '83: the question of control (pp.
109-119). Washington, D.C.: TESOL.
Brown, J.D. (1984c). A norm-referenced engineering reading test. In A.K. Pugh
& J.M. Ulijn (Eds.) Reading for professional purposes: studies and practices in
native and fareign languages. London: Heinemann Educational Books.
Brown, J.D. (1989). Improving ESL placement tests using two perspectives.
TESOL Quarterly, 23, 1.
108 J.D. BROWN

Brown, J.D. & KM. Bailey. (1984). A categorical instrument for scoring second
language writing skills. Language Learning,34, 4, 21-42.
Cohen, J.A. (1960). A coefficient of agreement for nominal scales. Educational
and Psychological Measurement,20,37-46.
Cronbach, L.J. (1970). Essentials of psychological testing (3rd ed.). New York:
Harper and Row.
Cziko, G.A. (1982). Improving the psychometric, criterion-referenced, and
practical qualities of integrative language tests. TESOL Quarterly, 16,
367-379.
Cziko, G.A. (1983). Psychometric and edumetric approaches to language
testing. In J.W. Oller, Jr. (Ed.) Issues in language testing research. Rowley,
MA: Newbury House.
Glaser, R. (1963). Instructional technology and the measurement of learning
outcomes: Some questions. American Psychologist, 18,519-521.
Guilford, J.P. (1954). Psychometric Methods. New York: McGraw-Hill.
Hively, W., H.L. Patterson & S.A. Page. (1968). A universe-definedu system of
11

arithmetic achievement tests. journal of Educational Measurement, 5,


275-290.
Hudson, T. & B. Lynch. (1984). A criterion-referenced approach to ESL
achievement testing. Language Testing, 1,171-201.
Huynh, H. (1976). On the reliability of decisions in domain-referenced testing.
journal of Educational Measurement, 13,253-264.
Kuder, G.F. & M.W. Richardson. (1937). The theory of estimation of test
reliability. Psychometrilal, 2, 151-160.
Marshall, J.L (1976). The mean split-half coefficient of agreement and Its relation to
other test indices (Technical Report # 350). Madison, WI: Wisconsin
Research and Development Center for Cognitive Learning.
Popham, W.J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ:
Prentice-Hall.
Popham, W.J. (1981). Modern educational measurement. Englewood Cliffs, NJ:
Prentice-Hall.
Popham, W.J. & T.R. Husek. (1969). Implications of criterion-referenced
measurement. Journal of Educational Measurement, 6, 1-9.
CRITERION-REFERENCED TEST REUABIUTY 109

Stanley, J.C. (1971). Reliability. In R.L. Thorndike (Ed.) Educational


Measurement (2nd ed.). Washington, DC: American Council on
Education.
Subkoviak, M.J. (1980). Decision-consistency approaches. In R.A. Berk, (Ed.).
Criterion-referenced measurement: the state of the art. Baltimore: Johns
Hopkins University Press.(1980), 129-185.
Subkoviak, M.J. (1988). A practitioner's guide to computation an interpretation
of reliability indices for mastery tests. Journal of Educational Measurement,
25,47-55.
110 J.D. BROWN

Appendix A: More complete phi(lambda) analysis

phi(lambda) lambda xP Sp k

~(.90) = 0.8959 0.9000 0.7095 0.0240 30


~(.80) = 0.8052 0.8000 0.7095 0.0240 30
~ (mean=.7095) = 0.7388 0.7095 0.7095 0.0240 30
~(.70) = 0.7398 0.7000 0.7095 0.0240 30
~(.60) = 0.8258 0.6000 0.7095 0.0240 30
<I> (.SO) = 0.9076 0.5000 0.7095 0.0240 30
~(.40) = 0.9476 0.4000 0.7095 0.0240 30
~(.30) = 0.9673 0.3000 0.7095 0.0240 30
<I> (.20) =0.9779 0.2000 0.7095 0.0240 30
~(.10) = 0.9841 0.1000 0.7095 0.0240 30

Appendix B: Derivation of formula [7]

The derivation of formula [7] is based on combining (in a novel way) three
important elements from Brennan's work on <I> and generalizability theory (for
a p xI design). To begin with, it is useful to realize that the~ coefficient is
defined as a ratio of the estimated person variance, a2(p), to the estimated
person variance plus the estimated absolute error variance cr2 (a) as follows
(Brennan 1984, p. 308):

2
cr (p)
2 2
cr (p) + cr (a) [8]

To develop a short-cut estimate of this coefficient, it was only necessary to find


accessible and easy ways to calculate the component parts, cr2(p) and o2(6),
then recombine them.
CRITERION-REFERENCED TEST RELIABILITY 111

Estimating cr(p)
Let's begin with the definition of a generalizability coefficient for
relative error which is defined as a ratio of the estimated person variance,
a2(p), to the estimated person variance plus the estimated relative error
variance a2 (a) as follows (Brennan 1983, p. 17):

2
Fp 2~B) _a.....:::;{p..:....)_ _
2 2
cr (p) +a (s) [9]

Then, it should be noted that, when items are dichotomously scored, Ep 2 and
K-R20 are algebraic identities for the p x I design (Brennan 1984, p . 306). In
short, Ep 2(B) = K-R20. In fairness, it must be pointed out that Brennan stresses
that this 11is an algebraic identity with respect to numerical estimates, not a
conceptual identity with respect to underlying theoretical models." (1983,
p.18).
With regard to relative error, Brennan points out that 11.. . for the p xI
design cr2(B) is identical to the estimated error variance in classical theory,
which is observed score variance multiplied by one minus reliability." (1983, p.
17). Since the estimated test variance is nSP2' the estimated relative error
variance can be calculated using the following equation:

2 nS 2
cr ~B)=-P [1-K-R20]
n-1 [12]

In other words, that portion (based on K-R20) of the observed score variance
that is due to error is equal algebraically, if not conceptually, to cr2{S).
Since K-R20 is made up of error variance and true score variance (see
body of paper), it follows that the portion (based on K-R20) of the observed
score variance that remains is true score variance and equal algebraically, if not
conceptually, to cr2(S).
112 J.D. BROWN

Expressed as an equation:

2 nS 2
a (p) =_P_ [K-R20]
n-1 [13]

Estimating a2(L1) and f/J


Brennan (1983, p.19) further points out that for dichotomously scored
data absolute error, a2(~), is equivalent to the following (symbols changed to
those used in this paper):

[14]
Recall that the equation for <J) was as follows:

2
<J) a (p)
2 2
a (p)+ a (~) (equation [8] above)

Substituting equations [13] and [14] into the above equation:

_n_Sp_l [K-R20]
n-1
~=----------~~--~~---
nSpl Xp(l-Xp)-Spz
n - 1 [K-R 20] + k- 1
(the same as equation [7])
CRITERION-REFERENCED TEST REUABIUTY 113

Appendix C: Generalizability study

After Brennan 1983 pp. 11-15

Mean =.7095;
Sum of p means sqr = 22.1533;
Sum of item means sqr =15.8866;
Total # correct answers = 894
n =42
k =30

(1) 664.6000
(2) 634.3143
(3) 667.2381
(4) 894

SSp = (1) - (2)


ssi = <3>- <2>
SSpi = (4)- (1)- (3) + (2)

SOURCE ss df MS GStudy N DStudy


Var.Com. Var.Com. G Coeff.

persons (p) 30.2857 41 0.7387 .019114 1 .019114p


items (i) 32.9238 29 1.1353 .023097 30 .000770 i
p xi 196.4762 1189 0.1652 .165245 30 .005508 pi
total 259.6857 1259 .005508 Rei error 0.776296
.006278 Abs error 0.752757

You might also like