Reliability and Validity

RELIABILITY &
VALIDITY
PAULYN ANN S. MARCOS, MSc, RPm

RELIABILITY
Refers to the consistency of scores obtained by the same person when re-
examined with the same test on different occasions, or with different sets of
equivalent items.
Classical Test Theory
• this assumes that individuals has a true score that would
be obtained if there were no errors in measurement.
Measurement Error
• the difference between observed score and the true score results.
Source of Measurement Error
1. Item Selection
2. Test Administration
3. Test Scoring
FORMS OF RELIABILITY
Test-retest Reliability
• a measure of stability, most straight forward of all
• established by comparing the scores obtained from two successive
measurements of the same individuals and calculating a
correlation between the two sets of scores.
• also known as time-sampling reliability since it measures the
error associated with administering a test at two different times
• commonly used when measuring traits or characteristics that are
relatively stable over time (e.g., IQ, EQ, personality)
Test-retest Reliability
A test designed to measure the students’ IQ level will be

administered to 3rd year BS Psychology students twice, with
the second administration perhaps coming a week after the
first. The obtained correlation coefficient would indicate the
stability of the scores. The higher the correlation, the better.
Limitations of Test-Retest
• Carryover Effect - occurs when the first testing session
influences the results of the second session.
• Practice Effect - a type of carryover effect wherein the

scores on the second test administration are higher than
they were on the first.
Parallel Forms Reliability
• a measure of equivalence
• also known as item sampling reliability or alternate
form reliability since it compares two equivalent forms of a
test that measure the same attribute to make sure that the
items indeed assess a specific characteristics
• the correlation between the scores obtained on the two forms
represents the reliability coefficient of the test
Parallel Forms Reliability

If you intend to evaluate the reliability of a self-resilience tool, you
need to create a large set of items (test bank) which contains items
that particularly measures the construct-- then, randomly split the
questions up into two sets, which would represent the parallel forms.
The tool should contain the same number of items, should be

expressed in the same form, should cover the same type of content,
and same level of difficulty. Instructions, time limits, illustrative
example, format, and all other aspects of the test must likewise be
checked for equivalence.
Inter-rater Reliability
• it is the degree of agreement between observers, judges,

and/or professionals who are knowledgeable on the given
construct
Inter-rater Reliability
A job performance assessment by office managers. If the employee
being rated received a score of 9 (a score of 10 being perfect) from
three managers, and a score of 2 from another manager, then
inter-rater reliability could be used to determine that something
is wrong with the method of scoring. There could be many
explanations for this lack of consensus (managers did not
understand how the scoring system worked and did it incorrectly,
the low-score manager had a grudge against the employee, etc.)
Kuder-Richardson
- measures the homogeneity of the items in a test; mean of all the
possible split-half correlations
- Kuder-Richardson 20 (K- R20)
– for dichotomous data
- Kuder-Richardson 21 (K- R21)
– for items that do not vary much in difficulty
- Cronbach’s Alpha
– for non-dichotomous data
Split Half Reliability

• a measure of internal consistency
• it is obtained by splitting the items of a test in half, computing
a separate score for each half, and then calculating the degree of
consistency between the two scores for a group of participants
• split-half testing is a measure of how well the test components
contribute to the construct that is being measured
• most commonly used for multiple choice tests, but you can
theoretically use it for any type of test-- even tests with essay
questions
Split Half Reliability

Suppose that a researcher would like to measure the internal
consistency of a particular test that has 100 questions all related to
optimism. He split the test in half based on odd-numbered and even-
numbered questions, administer each half of the test to the same
individuals, and compute for the correlation between the scores for
both halves.
Consequently, if the correlation is fairly high, it can be assured that

all parts of the test are contributing equally to measuring optimism.
VALIDITY
Refers to the degree to which the measurement procedure measures the
variable it intends to measure (strength and usefulness).
FORMS OF VALIDITY
Face Validity
• simplest and least scientific form of validity and it is

demonstrated when the face value or superficial appearance of a
measurement measures what it is supposed to measure
FORMS OF VALIDITY
Content Validity
• inspection of items through expert judgment
• a panel of experts can review that test item and rate them in
terms of how closely they match the objective or domain
specification
• if the test items adequately represents the domain of possible
items for a variable, then the test has a good content validity
FORMS OF VALIDITY
Criterion Validity
• involves the correlation between the test scores and scores on
some measurement representing an identical criterion.
• the R (validity coefficient) can be computed between the
scores on the test being validated (predictor) and the scores
on the criterion
TYPES OF CRITERION VALIDITY
• Predictive Validity
- extent to which a test score obtained from a measure
accurately predict scores on a criterion measure
EXAMPLE
1. The correlation between the high-school grade point average
(GPA) and the GPA achieved in the 1st year of studies in a
university.
2. Scores between a job performance questionnaire of an
applicant, and the actual job performance rating six months
after being hired.
TYPES OF CRITERION VALIDITY
• Concurrent Validity
- measures how well a new test compares to a well-
established test
EXAMPLE
1. A psychologist creates a test for anxiety, and compares it to
a specific anxiety scale to gain high validity. She will
administer her self-made test today, and the anxiety scale the
following day. The tests should be administered concurrently.
2. The correlation between the laboratory test scores and
paper test scores of Medtech students.
FORMS OF VALIDITY
Construct Validity
• A test has a good construct validity if there is an existing
psychological theory which can support what the test items
are measuring
• Establishing construct validity involves both logical
analysis and empirical data
TYPES OF CONSTRUCT VALIDITY
• Convergent Validity
- takes two measures that are supposed to be measuring
the same construct and shows that they are related.
EXAMPLE
• Let’s say you were researching depression in college students.
In order to measure depression (the construct), you use two
measurements: a survey and participant observation. If the
scores from your two measurements are close enough (i.e.
they converge), this demonstrates that they are measuring
the same construct. If they don’t converge, this could indicate
they are measuring different constructs (for example, anger
and depression or self-worth and depression).
TYPES OF CONSTRUCT VALIDITY
• Divergent Validity
- shows that an instrument is poorly correlated to instruments
that measure different variables
EXAMPLE
• Your newly constructed psychological test about self-esteem should
have a weak correlation with a test which measures gender
identity
• A test which measures spelling ability should have a low
correlation with a test with abstract reasoning
Error
• Random Error
- happens when the value of what is being measured
sometimes goes up or sometimes goes down. A very simple example
is our blood pressure.
• Systematic Error
- if the error consistently changes in the same direction. For
example, this could happen with blood pressure measurements if,
just before the measurements were to be made, something always
or often caused the blood pressure to go up.
• Construct underrepresentation
- failure to capture important components of a construct.
(e.g., An English test which only contains vocabulary items but
no grammar items will have a poor content validity)
• Construct- irrelevant variance

- happens when scores are influenced by factors
irrelevant to the construct.
(e.g., test anxiety, reading speed, reading comprehension,
illness)
Reliability and Validity
1. Reliability and Validity are partially related and partially
independent.
2. Reliability is a prerequisite for validity, meaning a measurement
cannot be valid unless it is reliable.
3. It is not necessary for a measurement to be valid for to be considered
reliable.
THANK YOU!

Reliability and Validity

Uploaded by

Copyright:

Available Formats

Reliability and Validity

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability and Validity

Uploaded by

Copyright:

Available Formats

RELIABILITY &

PAULYN ANN S. MARCOS, MSc, RPm

Source of Measurement Error

A test designed to measure the students’ IQ level will be

• Practice Effect - a type of carryover effect wherein the

Parallel Forms Reliability

Parallel Forms Reliability

The tool should contain the same number of items, should be

• it is the degree of agreement between observers, judges,

Split Half Reliability

Split Half Reliability

Consequently, if the correlation is fairly high, it can be assured that

• simplest and least scientific form of validity and it is

• Construct- irrelevant variance

You might also like