Language Test Reliability

In the name of God
Language Test Reliability

Teacher: Dr. Golshan
Prepared by:Tahere
by: Bakhshi
November 2015
A test should have:
Reliability: (Same result under the same condition)
Validity: (Scale to measure the size of head Not sth
else)
Usability or Practicality: (Not too difficult, practical
to use)
•The problem of measuring mental traits, language

proficiency, motivation and … !
• Tests should measure consistently !
V ar ia nce
Variance: variance measures how far a set of numbers is spread out.

Variance of Zero: Identical values
Small Variance: Expected value close to mean
High Variance: Spread out values, far from mean
Sources of
Variance
 Meaningful Variance
Those sources that make variance related to the purpose
of the test.
To gain the goal: (Items be related to the purpose of designed test & students’
knowledge on topic. Test validity issue: (see Table 8.1, P. 170) )
Other Factors unrelated to the aim of the test :
 Measurement error or Error Variance
Those sources that make variance related to other
extraneous variables.
 Types of issues related to error
variance
1.Variance due to the environment: ( Noise, classroom
temperature, outside noises, distractions, amount of space per person,
lighting, ventilation, or other environmental factors ) .
2. Variance due to the administration procedure: ( Directions

of test, Quality of equipment and timing (Cassette or teachers ) ) . Table
2.5, p.35
3. Variance due to examinees: ( Condition of students: their

fatigue, health, hearing or vision) . ( Psychological factors:
motivation, memory, concentration, forgetfulness, impulsiveness,
carelessness and…). (Students’
( testwiseness and Strategies)
4. Variance due to scoring procedure: Errors in doing scoring.

Subjective nature of scoring procedure.
5. Variance due to test and test items: ( Printing, knowing answer

sheet, number of items, item selection, quality of test items, test
security)
The mentioned sources of measurement error should be
minimized so that there is no Variance in students’
 Reliability of NRTs
( Dependable or trustworthy)
A test is considered reliable if it would give us the same result
over and over again.
How is reliability measured?

By comparing two sets of scores for a single assessment (such as two
rater scores for the same person). After having two sets of scores for
a group of students, we can determine how similar they are by
computing a statistic known as the reliability coefficient.
Reliability Coefficient:
A numerical index of reliability, ranging from 0 to 1.
Number closer to 1 = high reliability. A low reliability

coefficient indicates more error in the assessment results.
Reliability is considered good or acceptable if the reliability

coefficient is .80 or above.
 Three Basic Strategies to Estimate the
reliability of a Test:
1. Test-Retest Reliability:
Situation:
Situation Same people taking two administrations of the same test.
Procedure:
Procedure Correlate scores on the two tests which yields the
coefficient of stability.
Meaning:
Meaning the extent to which scores on a test can be generalized
over different occasions (temporal stability).
Appropriate use: Information about the stability of the trait over time.
Disadvantages:
Disadvantages Requires two testing sessions, Learning, Test
effect.
2. Parallel / Equivalent-Forms Reliability:
 Situation:
Situation Testing of same people on different but comparable
forms of the test. (Forms A & B)
 Procedure:
Procedure correlate the scores from the two tests which yields a
coefficient of equivalence.
 Meaning: the consistency of response to different item samples (where
testing is immediate) and across occasions (where testing is delayed).
 Appropriate use: to provide information about the equivalence of

forms.
Ali usually ………… late at night. A. study b. studies
c. studying
Reza often ………… the shopping in the afternoon. A. do b. does

c. doing
3. Internal Consistency
Reliability:
• Situation: a single administration of one test form. All items in
an internally consistent scale assess the same construct.
•Procedure: Divide test into comparable halves and correlate
scores from both halves.
– Split Half with Spearman Brown adjustment
– Kuder Richardson #20 and #21
– Cronbach’s Alpha
•Meaning: consistency across the parts of a measuring instrument
(“parts” = individual items or subgroups of items).
•Appropriate Use: Where focus is on the degree to which same
characteristic is being measured. A measure of test homogeneity.
Internal Consistency Strategies
All items in the test should be homogenous. And there should be a
relationship among them.
Split
Split –– Half
Half
Reliability
Reliability
Cronbach
Cronbach Alpha
Alpha
Kuder-Richardson
Kuder-Richardson
Formulas
Formulas
Split – Half Reliability)
In split-half reliability we randomly divide all items that purport to measure
the same construct into two sets. We administer the entire instrument to a
sample of people and calculate the total score for each randomly divided
half. the split-half reliability estimate, as shown in the figure, is simply the
correlation between these two total scores. In the example it is .87.
Odd/ even Items, easy and difficult item equally distributed.

Spearman Brown Prophecy Formula
k * r11
rkk =
1 + r11( k − 1)
k = the number of items I WANT
to
estimate the reliability for divided
by
the number of items I HAVE
Cronbach Alpha
Cronbach Coefficient Alpha used only if the item scores are other
than 0 & 1. (Such as Likert scale). )This is advisable for essay items,
problem solving and 5-scaled items. ; based on 2 or more parts of the
test, requires only one administration of the test.
Kuder – Richardson Formulas
Kuder and Richardson believed that all items in a test are designed to
measure a single trait.
KR21 is the most practical, frequently used and convenient method

of estimating reliability.
K – R20 = most advisable if the p values vary a lot
K – R21 = most advisable if the items do not vary much in difficulty,

i.e., the p values are more or less similar.
The KR21 formula is a simplified version of the

KR20.
Inter-rater Reliability
Having a sample of test papers (essays) scored independently by
two examiners.
Inter-rater reliability is a measure of reliability used to assess the

degree to which different judges or raters agree in their assessment
decisions. Inter-rater reliability is useful because human observers will
not necessarily interpret answers the same way; raters may disagree as
to how well certain responses or material demonstrate knowledge of
the construct or skill being assessed.
Intra-rater Reliability
The degree of stability observed when a measurement is repeated
under identical conditions by the same rater.
•Note: Intra-rater reliability makes it possible to determine the
degree to which the results obtained by a measurement procedure can
be replicated.
Standard Error of Measurement
 All tests scores contain some error
 For any test, the higher the reliability estimate, the lo wer the error
 The standard error or measurement is the average standard

deviation of the error variance over the number o f people in the
sample.
 Can be used to estimate a range within which a true score would
likely fall.
 W e never know the true score
 By know ing the S.E.M . and by understanding the normal curve, w e

can assess the likelihoo d of the true score being within certain
limits.
 The higher the reliability the lower the standard erro r of
measurement, hence more confidence we can place in the
accuracy of a person’s test score.
Factors That Affect The Reliability
Coefficient
• Test length
• Range of scores
• Item similarity
Questions?

Language Test Reliability

Uploaded by

Copyright:

Available Formats

Language Test Reliability

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Test Reliability

Uploaded by

Copyright:

Available Formats

In the name of God

Language Test Reliability

•The problem of measuring mental traits, language

Variance: variance measures how far a set of numbers is spread out.

Other Factors unrelated to the aim of the test :

 Measurement error or Error Variance

Those sources that make variance related to other

2. Variance due to the administration procedure: ( Directions

3. Variance due to examinees: ( Condition of students: their

4. Variance due to scoring procedure: Errors in doing scoring.

5. Variance due to test and test items: ( Printing, knowing answer

How is reliability measured?

Number closer to 1 = high reliability. A low reliability

Reliability is considered good or acceptable if the reliability

 Appropriate use: to provide information about the equivalence of

Reza often ………… the shopping in the afternoon. A. do b. does

Odd/ even Items, easy and difficult item equally distributed.

KR21 is the most practical, frequently used and convenient method

K – R20 = most advisable if the p values vary a lot

K – R21 = most advisable if the items do not vary much in difficulty,

The KR21 formula is a simplified version of the

Inter-rater reliability is a measure of reliability used to assess the

 The standard error or measurement is the average standard

 By know ing the S.E.M . and by understanding the normal curve, w e

You might also like