Reliability and Validity
Reliability and Validity
Reliability and Validity
VALIDITY
Measurement Error
• the difference between observed score and the true score results.
1. Item Selection
2. Test Administration
3. Test Scoring
FORMS OF RELIABILITY
Test-retest Reliability
• a measure of stability, most straight forward of all
• established by comparing the scores obtained from two successive
measurements of the same individuals and calculating a
correlation between the two sets of scores.
• also known as time-sampling reliability since it measures the
error associated with administering a test at two different times
• commonly used when measuring traits or characteristics that are
relatively stable over time (e.g., IQ, EQ, personality)
FORMS OF RELIABILITY
Test-retest Reliability
Limitations of Test-Retest
• Carryover Effect - occurs when the first testing session
influences the results of the second session.
• a measure of equivalence
• also known as item sampling reliability or alternate
form reliability since it compares two equivalent forms of a
test that measure the same attribute to make sure that the
items indeed assess a specific characteristics
• the correlation between the scores obtained on the two forms
represents the reliability coefficient of the test
FORMS OF RELIABILITY
Inter-rater Reliability
Inter-rater Reliability
A job performance assessment by office managers. If the employee
being rated received a score of 9 (a score of 10 being perfect) from
three managers, and a score of 2 from another manager, then
inter-rater reliability could be used to determine that something
is wrong with the method of scoring. There could be many
explanations for this lack of consensus (managers did not
understand how the scoring system worked and did it incorrectly,
the low-score manager had a grudge against the employee, etc.)
FORMS OF RELIABILITY
Kuder-Richardson
- measures the homogeneity of the items in a test; mean of all the
possible split-half correlations
- Kuder-Richardson 20 (K- R20)
– for dichotomous data
- Kuder-Richardson 21 (K- R21)
– for items that do not vary much in difficulty
- Cronbach’s Alpha
– for non-dichotomous data
FORMS OF RELIABILITY
FORMS OF RELIABILITY
Face Validity
Content Validity
• inspection of items through expert judgment
• a panel of experts can review that test item and rate them in
terms of how closely they match the objective or domain
specification
• if the test items adequately represents the domain of possible
items for a variable, then the test has a good content validity
FORMS OF VALIDITY
Criterion Validity
• involves the correlation between the test scores and scores on
some measurement representing an identical criterion.
• the R (validity coefficient) can be computed between the
scores on the test being validated (predictor) and the scores
on the criterion
TYPES OF CRITERION VALIDITY
• Predictive Validity
- extent to which a test score obtained from a measure
accurately predict scores on a criterion measure
EXAMPLE
1. The correlation between the high-school grade point average
(GPA) and the GPA achieved in the 1st year of studies in a
university.
2. Scores between a job performance questionnaire of an
applicant, and the actual job performance rating six months
after being hired.
TYPES OF CRITERION VALIDITY
• Concurrent Validity
- measures how well a new test compares to a well-
established test
EXAMPLE
1. A psychologist creates a test for anxiety, and compares it to
a specific anxiety scale to gain high validity. She will
administer her self-made test today, and the anxiety scale the
following day. The tests should be administered concurrently.
2. The correlation between the laboratory test scores and
paper test scores of Medtech students.
FORMS OF VALIDITY
Construct Validity
• A test has a good construct validity if there is an existing
psychological theory which can support what the test items
are measuring
• Establishing construct validity involves both logical
analysis and empirical data
TYPES OF CONSTRUCT VALIDITY
• Convergent Validity
- takes two measures that are supposed to be measuring
the same construct and shows that they are related.
EXAMPLE
• Let’s say you were researching depression in college students.
In order to measure depression (the construct), you use two
measurements: a survey and participant observation. If the
scores from your two measurements are close enough (i.e.
they converge), this demonstrates that they are measuring
the same construct. If they don’t converge, this could indicate
they are measuring different constructs (for example, anger
and depression or self-worth and depression).
TYPES OF CONSTRUCT VALIDITY
• Divergent Validity
- shows that an instrument is poorly correlated to instruments
that measure different variables
EXAMPLE
• Your newly constructed psychological test about self-esteem should
have a weak correlation with a test which measures gender
identity
• A test which measures spelling ability should have a low
correlation with a test with abstract reasoning
Error
• Random Error
- happens when the value of what is being measured
sometimes goes up or sometimes goes down. A very simple example
is our blood pressure.
• Systematic Error
- if the error consistently changes in the same direction. For
example, this could happen with blood pressure measurements if,
just before the measurements were to be made, something always
or often caused the blood pressure to go up.
• Construct underrepresentation
- failure to capture important components of a construct.
(e.g., An English test which only contains vocabulary items but
no grammar items will have a poor content validity)