Module 5 The Concept of Realiability (YENI)

Module 5
The concept of reliability

5.0. Introduction
5.1.. Reliability
5.2. Methods of estimating reliability
5.2.1. Stability (test-retest reliability)
5.2.2. Internal consistency
5.2.2.1. Split-half reliability estimates
5.3. True score and error score
5.4. Kuder-Richardson reliability coefficients (KR-
20&KR21)
5.5. Rater consistency
5.6. Concluding Remarks
The concept of reliability
•Reliability refers to the consistency of measurement, that is, how

consistent the test results and other assessment results are form one
measurement to another. The reliability of a language test is an
essential criterion. The concern here is with our reliance on the results
that a test produces.
•Three aspects of reliability should be taken into consideration: The
first concerns the consistency of rating among different markers, e.g.,
when scoring a test of written or spoken expression. This is called
inter-rater reliability, the degree of which is established by correlating
the scores obtained by testtakers from rater A with those from rater B.
The consistency of each individual rater (intra-rater reliability) is
achieved by getting them to remark a selection of scripts at a later
date and correlating the scores given on the two occasions ( see the
rating process Module )
•The tester should consider how to enhance the agreement
between raters by establishing, and maintaining adherence to,
explicit guidelines for the conduct of this rating. The criteria of
assessment should be established and agreed upon and then
raters should be trained in the application of these criteria
through rigorous standardisation procedures (see Murphy, 1979).
During the rating process there needs to be a degree of cross-
checking to ensure that agreed standards are being maintained.
•It is also essential to try and ensure that relevant sub-tests are
internally consistent in the sense that all items in a sub-test are
judged to be measuring the same attribute or construct. The
Kuder-Richardson formulae for estimating this internal consistency
are readily available in most statistics manuals.
•The third aspect of reliability is that of parallel-forms reliability,
the requirements of which have to be kept in mind when future
alternative forms of a test have to be devised. This is often very
difficult to achieve for both theoretical and practical reasons. To
achieve it, two alternative versions of a test need to be produced
which are in effect clones of each other.
•The reliability of the versions is directly proportional to the
similarity of the results obtained when administered to the same
test population. Less frequently reliability is checked by the test-
retest method where the same test is re-administered to the same
sample population after a short intervening period of time.
•The concept of realiability is particularly important when
considering language tests within the communicative paradigm
(see Porter, 1983). For as Davies (1965, p. 14) stressed:
'reliability is the first essential for any test; but for certain kinds
of language test may be very difficult to achieve.
•Lin and Gronlund (1995:82-82) notes the following points
concerning reliability:
•Reliability refers to the results obtained with an assessment
instrument and not to the instrument itself. It means it is more
appropriate to speak of the reliability of test scores or of
assessment results than of the test or the assessment.
•An estimate of reliability always refers to a particular type of
consistency. Assessment results are not reliable in general. They
are reliable or generalizable over different periods of time, over
different samples of tasks, over different raters and the like
The appropriate type of consistency in a particular case is dictated
by the use to be made of the results. For instance, if we are
interested in what the test takers will be like at some future time,
the consistency of scores over time will be important. It means
different interpretations require different analyses of consistency.
Treating reliability as a general characteristics can lead to wrong
interpretations.
•Reliability is a necessary but not sufficient
condition for validity.
To provide valid information about the performance being
measured, we need consistent assessment results. It does not
mean that highly consistent assessment results always measure
the right thing. In brief, reliability only provides the consistency
that makes validity possible.
•Reliability is primarily statistical.
To evaluate the consistency of the scores assigned by different raters,
two or more raters must score the same set of student
performances. Similarly an evaluation of the consistency of the
scores assigned to different forms of a test or different collections of
performance based assessment tasks requires the administration of
both test forms or collection of tasks to an appropriate group of
learners.
Whether the focus is on the inter-rater or the consistency across
forms or collections of tasks, the consistency may be expressed in
terms of shifts in the relative standing of persons in the group, or in
terms of the amount of variation to be expressed in an individual’s
score. In the first case it is reported by means of a correlation
coefficient, called a reliability coefficient. In the second case it is
reported by means of standard error of measurement. Both methods
are essential for the interpretation of assessment or test results.
Methods of estimating reliability
Methods Type of reliability measure Procedure
Test-retest method Measure of stability Give the same test to the same group with any
time interval between tests, from several
minutes to several minutes
Equivalent forms method Measure of equivalence Give two forms of the test to the same group in
close succession
Test-retest with equivalent Measure of stability and equivalence Give two forms of the test to the same group
forms with increased time interval between forms
Split-half method Measure of internal consistency Give test once. Score two equivalent halves of
test (e.g., odd items and even items); correct
correlation between halves to fit whole test by
Spearman- Brown formula
Kuder-Richardson method Measure of internal consistency Give test once. Score total test and apply Kuder-
and coefficient Alpha Richardson formula.
Inter-rater method Measure of consistency of ratings

Give a set of student responses requiring
17.06.2024 İlgili sunu Ahmet Yesevi Üniversitesijudgmental
için scoring to two and more raters and
"Ders Öğretim Elemanı" tarafından h
have them independently score the responses.
The Classical True Score ( CTS) model is used to figure out the
different methods of estimating the reliability of language
tests. Within this theory, there are generally three different
methods of estimating reliability, each of which indicating
various sources of measurement errors. Internal consistency
estimates, stability estimates, and equivalence estimates are
the three major approaches to estimating reliability .
1. Internal consistency estimates deal with the sources of errors
within the test and the scoring procedures.
2. Stability estimates show how consistent test scores are over
time.
3. Finally, equivalence estimates indicate how scores on alternate
forms of a test are equivalent.
5.2.1. Stability (test-retest reliability)
In order to compute the reliability estimate of cloze and dictation
tests, we had better use a method other than internal consistency
estimates because the test items are not independent from each
other. In such situations, reliability can be computed by giving the
test more than once to the same candidates. This method to
reliability is called the test-retest approach, and it provides an
estimate of the stability of the test scores
5.2.2. Internal consistency
Internal consistency indicates how the performances of test takers
on different parts of a given test are consistent with each other.
For instance, performance on the parts of a reading
comprehension test might be inconsistent if passages are of
different length and difficulty level.
5.2.2.1. Split-half reliability estimates
One way to estimate the internal consistency of a given test is the
split-half method, in which the designed test is divided into two
equal halves to determine whether the scores of these two halves
are consistent or not.
The two halves are independent from each other. In other words,
an individual performance on one half does not affect his
performance on the other. The two halves are said to be correlated
because they are assumed to test the same trait or ability. The
simplest way to split the test into two halves is to divide the test
into first and second halves.
Due to the fact that most language tests are designed as power
tests, test items are designed from easy to difficult, hence, the first
half would be quite easy and the second half would be quite
difficult after splitting the test into two halves. This very fact violates
• the premise of equivalence of the two halves, and
• the homogeneity of the items in two halves.
We can split the test into random halves. Random selection is the
basis for odd-even method, in which all odd-numbered items are
grouped in one half and then all even-numbered items are grouped
in the other half.
This method is mostly applicable to those tests that measure the
same trait. For instance, a multiple-choice test of vocabulary or
grammar. Such test items are normally independent from each
other. Once the test is divided into two equal halves, it is re-scored,
obtaining two sets of scores - one for each half - for each test taker.
In order to compute the reliability of the test, we should first
compute the correlation coefficient between the two sets of scores,
then by means of the Spearman-Brown prophecy formula also
known as The Spearman–Brown prediction formula, (Spearman
1910; Brown 1910).
"Reliability can be increased by pooling raters, using the Spearman-
Brown equation. ... If the reliability of a single rating is .50, then the
reliability of two, four, or six parallel ratings will be
approximately .67, .80, and .86, respectively" (Houston, Raymond, &
Svec, 1991, p. 409). ……. Averaging ratings (or using Spearman-Brown)
if one rater is, for example, systematically lenient, simply does not fit
the assumption.
If essays are each rated by two raters, one more lenient than the
other, the problem is like that of using two multiple choice tests of
unequal difficulty (nonparallel forms). Scores based on different
(unequated) test forms are not comparable. So it is with mixing lenient
and difficult raters; the reliability of the pooled ratings is incorrectly
estimated by the Spearman-Brown equation of classical test theory.
Matters are worse if each judge defines a construct a bit differently."
(Guion, R. M, 2011. p.477)
Charles Spearman and William Brown separately derived a formula
to predict the reliability of a test when its length was altered by the
addition or subtraction of parallel items. Each presented their
formula in 1910 in Volume Three of The British Journal of
Psychology. Because their articles were published at the same time,
the formula is known as the Spearman–Brown prophecy formula.
The basic premise of classical test theory is that an observed test
score consists of the sum of the following two components: true
score and error score. For any individual examinee, the two cannot
be separated, but for a group of examinees, the variance attributed
to each source can be estimated. Test reliability, which is an
important characteristic of test score quality, is defined as the ratio
of true score variance to observed score variance.
5.3. True score and error score
•Classical true score or CTS model (CTS measurement theory)

involves a set of assumptions about the relationship between
the observed test scores and the factors that affect such scores.
The first tenet of this model holds that an observed score, the score
we normally get by the common and quotidian scoring procedures,
on a test involves two components: a true score, which is the true
ability of the candidate, and the error score which is due to
miscellaneous factors other than the ability to be tested.
The second set of assumptions holds the relationship between true
and error scores. These tenets state that error scores are
unsystematic, or random. However, variations in measurements are
unsystematic or random, and are not correlated with the true score.
The CTS measurement theory defines two sources of variance: the
true score variance, and the error score variance. Systematic
variations are reasonable or logical variations. If a candidate studies
hard, in the second administration, is going to do better. Since the
reason for such a variation is known, it is called systematic.
However, if the rater does not find out the logical reasons for the
observed variations, they are called unsystematic. Reliability is
estimated to calculate the true scores of the candidates. True scores
are usually difficult to calculate, since the tests or measurements
have errors.
Another approach to estimating reliability from split-half method is
Guttman split-half estimate (1945), which does not assume two
equal halves, and which does not require computing a correlation
coefficient.
5.4. Kuder-Richardson reliability coefficients (KR-20)
Reliability estimates based on item variances calls for splitting the test into halves
in every way possible, and then computing the reliability coefficients based on
these different splits, and then find the average of these coefficients. For instance
in a four-item test, the three possible splits would be (1) items 1 and 2 in one half
and items 3 and 4 in the other; (2) 1 and 3, 2 and 4; (3) 1 and 4, 2 and 3. Using
the split-half method, we could divide the test into a series of halves according
to Complementary Aspects of Measurement every possible combination of items,
and for each division we correlate the scores from two halves, we would obtain a
series of correlation coefficients.
This way of estimating reliability which needs the utilization of the prophecy
formula and Fisher Z transformations seems to be quite problematic. Instead,
Kuder-Richardson reliability coefficients (KR-20) allow us to arrive at the same
conclusion with more convenience without computing the reliability of every
possible split half combination. e test, and S2 is the total variance of the test
scores.
Kuder-Richardson reliability coefficients (KR-21)
If the items are of nearly equal difficulty and independent of each

other, the reliability coefficient can be computed by using formula
that is both easier to compute and that require less information. KR-
21 for certain is a shortcut method which is less accurate than KR-
20.
5.6. Rater consistency
•Raters normally apply two methods for evaluating test scores, say
subjective and objective evaluation. Test scores that are obtained
subjectively, like oral interviews and compositions, bring forth a
source of error, which is inconsistency in these ratings. In the case
of a single rater, we are concerned about the consistency within
the individual ratings, or with intra-rater reliability. When there are
several raters, we want to examine the consistency among the
raters, or inter-rater reliability.
With intra-rater reliability, the consistency of scores one rater
gives to the same candidates is sought. In order to do so, the
rater applies a set of criteria consistently in rating the language
performance of the candidates, whether spoken or written, in
order to yield a reliable set of ratings. In order to compute the
amount of reliability of such ratings, we need to obtain at least
two independent ratings.
Rating the candidates at one time and then with a time interval
re-rating the candidates normally carries this out. Once we
obtain the two sets of ratings, the reliability can be obtained in
two ways. One way is to treat the two sets of ratings as scores
from parallel tests and compute the correlation coefficient
between the two sets of ratings.
The crucial point of language testing is to identify the
potential sources of errors in measuring the communicative
language ability of learners and to minimize the impact of
such debilitative factors on the learner's true performance.
As applied linguists, teachers, test designers and raters we
ought to be concerned about errors of measurement or
unreliability, because we know that test performance is
affected by factors other than the abilities we have decided to
measure. For instance, factors like, test-wiseness, lack of
motivation and interest, fatigue, poor health, bad mood, and
poor conditions that can affect learner's performance are not
among those factors we are going to measure.
If we minimize such sources of measurement error, we could maximize
the amount of reliability. When we maximize reliability of our measures,
we are indubitably satisfying a necessary condition for validity: in order
for a test to be reliable, it must be valid first. The concerns of reliability
and validity can thus be seen as leading to two complementary
objectives in designing and developing tests: (1) to minimize the effects
of measurement error, and (2) to maximize the effects of the language
abilities we want to measure, Bachman (1995).
"Thus, reliability refers to the consistency of measurements; that is, in
case they are repeated, almost the same results are obtained. This is
what every rater expects. If measurements of weight and height of
people are not consistent, people look funny. In fact, reliability of a test
is a statistical concept, which involves both logical analysis and empirical
research.
5.6. Concluding Remarks
The main concern is often by necessity with content,
construct and face validity though the predictive and concurrent
validity of tests should always be examined as far as circumstances
allow. Validation might prove to be an ineffective effort unless care
has also been taken over test reliability.
References
Bachman, Lyle F. 1995. Fundemental Considerations in Language Testing. Oxford:
Oxford University Press.
Brown, H. D., 1994, Principles of Language Learning and Teaching, Englewood
Cliffs NJ: Prentice Hall Regents
Cohen, A.D., 1991, ‘Second Language Testing’ in Celce-Murcia, M. (ed.),
Teaching English as a Second or Foreign Language, Boston: Heinle &
Heinle.
Guion, R. M (2011). Assessment, Measurement, And Prediction For Personnel
Decisions, 2nd edition.
Mousavi, S.A., 1997, A Dictionary of Language Testing, Tehran: Rahnama
Publications.
Weir, C.J., 1990, Communicative Language Testing, Hemel Hempstead: Prentice
Hall International (UK).
http://wps.prenhall.com/chet_miller_measurement_10/88/22580/5780717.cw/
index.html
17.06.2024 İlgili sunu Ahmet Yesevi Üniversitesi için "Ders Öğretim Elemanı"

Module 5 The Concept of Realiability (YENI)

Uploaded by

Copyright:

Available Formats

Module 5 The Concept of Realiability (YENI)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 5 The Concept of Realiability (YENI)

Uploaded by

Copyright:

Available Formats

Module 5

The concept of reliability

•Reliability refers to the consistency of measurement, that is, how

Inter-rater method Measure of consistency of ratings

•Classical true score or CTS model (CTS measurement theory)

If the items are of nearly equal difficulty and independent of each

You might also like