VALIDITY

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

VALIDITY

A. Concept of validity
Validity, as applied to a test, is a judgment or estimate of how well a test
measures what it purports to measure in a particular context. More specifically, it is a
judgment based on evidence about the appropriateness of inferences drawn from test
scores. An inference is a logical result or deduction. Characterizations of the validity
of tests and test scores are frequently phrased in terms such as “acceptable” or
“weak.” These terms reflect a judgment about how adequately the test measures what
it purports to measure.
Validation is the process of gathering and evaluating evidence about validity.
Both the test developer and the test user may play a role in the validation of a test for
a specific purpose. It is the test developer’s responsibility to supply validity evidence
in the test manual
One way measurement specialists have traditionally conceptualized validity is
according to three categories:
1. Content validity. This is a measure of validity based on an evaluation of the
subjects, topics, or content covered by the items in the test.
2. Criterion-related validity. This is a measure of validity obtained by evaluating
the relationship of scores obtained on the test to scores on other tests or
measures
3. Construct validity. This is a measure of validity that is arrived at by executing
a comprehensive analysis of
a. how scores on the test relate to other test scores and measures, and
b. how scores on the test can be understood within some theoretical
framework for understanding the construct that the test was designed to
measure.
In this classic conception of validity, referred to as the trinitarian view (Guion,
1980), it might be useful to visualize construct validity as being “umbrella validity”
because every other variety of validity falls under it. Why construct validity is the
overriding variety of validity will become clear as we discuss what makes a test valid
and the methods and procedures used in validation. Indeed, there are many ways of
approaching the process of test validation, and these different plans of attack are often
referred to as strategies. We speak, for example, of content validation strategies,
criterion-related validation strategies, and construct validation strategies
a. Face Validity
Face validity relates more to what a test appears to measure to the person
being tested than to what the test actually measures. Face validity is a judgment
concerning how relevant the test items appear to be. In reality, a test that lacks
face validity may still be relevant and useful. However, if the test is not perceived
as relevant and useful by testtakers, parents, legislators, and others, then negative
consequences may result. These consequences may range from poor testtaker
attitude to lawsuits filed by disgruntled parties against a test user and test
publisher. Ultimately, face validity may be more a matter of public relations than
psychometric soundness. Still, it is important nonetheless, and (much like Rodney
Dangerfield) deserving of respect.

b. Conttent Validation
Content validity describes a judgment of how adequately a test samples
behavior representative of the universe of behavior that the test was designed to
sample. For example, the universe of behavior referred to as assertive is very
wide-ranging. A content-valid, paper-and-pencil test of assertiveness would be
one that is adequately representative of this wide range.
With respect to educational achievement tests, it is customary to consider a
test a content-valid measure when the proportion of material covered by the test
approximates the proportion of material covered in the course. A cumulative
final exam in introductory statistics would be  considered content-valid if the
proportion and type of introductory statistics problems on the test approximates
the proportion and type of introductory statistics problems presented in the
course.
The early stages of a test being developed for use in the classroom—be it one
classroom or those throughout the state or the nation—typically entail research
exploring the universe of possible instructional objectives for the course.
Included among the many possible sources of information on such objectives are
course syllabi, course textbooks, teachers of the course, specialists who develop
curricula, and professors and supervisors who train teachers in the particular
subject area.
B. Criterion-Related Validity
Criterion-related validity is a judgment of how adequately a test score can be
used to infer an individual’s most probable standing on some measure of interest
—the measure of interest being the criterion. Two types of validity evidence are
subsumed under the heading criterion-related validity. Concurrent validity is an
index of the degree to which a test score is related to some criterion measure
obtained at the same time (concurrently). Predictive validity is an index of the
degree to which a test score predicts some criterion measure. Before we discuss
each of these types of validity evidence in detail, it seems appropriate to raise
(and answer) an important question.
Caracteristic of a criterion An adequate criterion is relevant. By this
we mean that it is pertinent or applicable to the matter at hand. We would
expect, for example, that a test purporting to advise testtakers whether they
share the same interests of successful actors to have been validated using the
interests of successful actors as a criterion.
Here is another example of criterion contamination. Suppose that a
team of researchers from a company called Ventura International Psychiatric
Research (VIPR) just completed a study of how accurately a test called the
MMPI-2-RF predicted psychiatric diagnosis in the psychiatric population of
the Minnesota state hospital system. As we will see in Chapter 12, the
MMPI-2-RF is, in fact, a widely used test. In this study, the predictor is the
MMPI-2-RF, and the criterion is the psychiatric diagnosis that exists in the
patient’s record. Further, let’s suppose that while all the data are being
analyzed at VIPR headquarters, someone informs these researchers that the
diagnosis for every patient in the Minnesota state hospital system was
determined, at least in part, by an MMPI-2-RF test score. Should they still
proceed with their analysis? The answer is no. Because the predictor
measure has contaminated the criterion measure, it would be of little value to
find, in essence, that the predictor can indeed predict itself.
a. Concurrent Validity
If test scores are obtained at about the same time as the criterion
measures are obtained, measures of the relationship between the test scores
and the criterion provide evidence of concurrent validity. Statements of
concurrent validity indicate the extent to which test scores may be used to
estimate an individual’s present standing on a criterion. If, for example,
scores (or classifications) made on the basis of a psychodiagnostic test were
to be validated against a criterion of already diagnosed psychiatric patients,
then the process would be one of concurrent validation. In general, once the
validity of the inference from the test scores is established, the test may
provide a faster, less expensive way to offer a diagnosis or a classification
decision. A test with satisfactorily demonstrated concurrent validity may
therefore be appealing to prospective users because it holds out the potential
of savings of money and professional time.
b. Predictive Validity
Test scores may be obtained at one time and the criterion measures
obtained at a future time, usually after some intervening event has taken
place. The intervening event may take varied forms, such as training,
experience, therapy, medication, or simply the passage of time. Measures of
the relationship between the test scores and a criterion measure obtained at a
future time provide an indication of the predictive validity of the test; that is,
how accurately scores on the test predict some criterion measure. Measures
of the relationship between college admissions tests and freshman grade
point averages, for example, provide evidence of the predictive validity of
the admissions tests. Generally, a base rate is the extent to which a particular
trait, behavior, characteristic, or attribute exists in the population (expressed
as a proportion). In psychometric parlance, a hit rate may be defined as the
proportion of people a test accurately identifies as possessing or exhibiting a
particular trait, behavior, characteristic, or attribute. For example, hit rate
could refer to the proportion of people accurately predicted to be able to
perform work at the graduate school level or to the proportion of
neurological patients accurately identified as having a brain tumor. In like
fashion, a miss rate may be defined as the proportion of people the test fails
to identify as having, or not having, a particular characteristic or attribute.
Here, a miss amounts to an inaccurate prediction. The category of misses
may be further subdivided. A false positive is a miss wherein the test
predicted that the testtaker did possess the particular characteristic or
attribute being measured when in fact the testtaker did not. A false negative
is a miss wherein the test predicted that the testtaker did not possess the
particular characteristic or attribute being measured when the testtaker
actually did.
C. Construct Validity
Construct validity is a judgment about the appropriateness of inferences
drawn from test scores regarding individual standings on a variable called a
construct. A construct is an informed, scientific idea developed or hypothesized
to describe or explain behavior. Intelligence is a construct that may be invoked to
describe why a student performs well in school. Anxiety is a construct that may
be invoked to describe why a psychiatric patient paces the floor. Other examples
of constructs are job satisfaction, personality, bigotry, clerical aptitude,
depression, motivation, self-esteem, emotional adjustment, potential
dangerousness, executive potential, creativity, and mechanical comprehension, to
name but a few.

a. Evidence of construct Validity


A number of procedures may be used to provide different kinds of evidence
that a test has construct validity. The various techniques of construct validation
may provide evidence, for example, that
■ the test is homogeneous, measuring a single construct;
■ test scores increase or decrease as a function of age, the passage of time, or an
experimental manipulation as theoretically predicted;
■ test scores obtained after some event or the mere passage of time (or, posttest
scores) differ from pretest scores as theoretically predicted;
■ test scores obtained by people from distinct groups vary as predicted by the
theory;
■ test scores correlate with scores on other tests in accordance with what would
be predicted from a theory that covers the manifestation of the construct in
question.

A brief discussion of each type of construct validity evidence and the procedures
used to obtain it follows.

Evidence of homogeneity When describing a test and its items, homogeneity


refers to how uniform a test is in measuring a single concept. A test developer
can increase test homogeneity in several ways. for example, a test of academic
achievement that contains subtests in areas such as mathematics, spelling, and
reading comprehension. The Pearson r could be used to correlate average subtest
scores with the average total test score. Subtests that in the test developer’s
judgment do not correlate very well with the test as a whole might have to be
reconstructed (or eliminated) lest the test not measure the construct academic
achievement. Correlations between subtest scores and total test score are
generally reported in the test manual as evidence of homogeneity.

Evidence of changes with age Some constructs are expected to change over
time. Reading rate, for example, tends to increase dramatically year by year from
age 6 to the early teens. If a test score purports to be a measure of a construct that
could be expected to change over time, then the test score, too, should show the
same progressive changes with age to be considered a valid measure of the
construct. For example, if children in grades 6, 7, 8, and 9 took a test of eighth-
grade vocabulary, then we would expect that the total number of items scored as
correct from all the test protocols would increase as a function of the higher
grade level of the testtakers.

Evidence of pretest–posttest changes Evidence that test scores change as a


result of some experience between a pretest and a posttest can be evidence of
construct validity. Some of the more typical intervening experiences responsible
for changes in test scores are formal education, a course of therapy or
medication, and on-the-job experience. Of course, depending on the construct
being measured, almost any intervening life experience could be predicted to
yield changes in score from pretest to posttest. Reading an inspirational book,
watching a TV talk show, undergoing surgery, serving a prison sentence, or the
mere passage of time may each prove to be a potent intervening variable.

Evidence from distinct groups Also referred to as the method of contrasted


groups, one way of providing evidence for the validity of a test is to demonstrate
that scores on the test vary in a predictable way as a function of membership in
some group. The rationale here is that if a test is a valid measure of a particular
construct, then test scores from groups of people who would be presumed to
differ with respect to that construct should have correspondingly different test
scores. Consider in this context a test of depression wherein the higher the test
score, the more depressed the testtaker is presumed to be. We would expect
individuals psychiatrically hospitalized for depression to score higher on this
measure than a random sample of Walmart shoppers.

D. Validity, Bias, and Fairness


In the eyes of many laypeople, questions concerning the validity of a test are
intimately tied to questions concerning the fair use of tests and the issues of bias
and fairness. Let us hasten to point out that validity, fairness in test use, and test
bias are three separate issues. It is possible, for example, for a valid test to be
used fairly or unfairly.
a. Test bias
For the general public, the term bias as applied to psychological and
educational tests may conjure up many meanings having to do with prejudice and
preferential treatment (Brown et al., 1999). For federal judges, the term bias as it
relates to items on children’s intelligence tests is synonymous with “too difficult
for one group as compared to another” (Sattler, 1991). For psychometricians,
bias is a factor inherent in a test that systematically prevents accurate, impartial
measurement. Psychometricians have developed the technical means to identify
and remedy bias, at least in the mathematical sense. As a simple illustration,
consider a test we will call the “flip-coin test” (FCT). The “equipment” needed to
conduct this test is a two-sided coin. One side (“heads”) has the image of a
profile and the other side (“tails”) does not. The FCT would be considered biased
if the instrument (the coin) were weighted so that either heads or tails appears
more frequently than by chance alone. If the test in question were an intelligence
test, the test would be considered biased if it were constructed so that people who
had brown eyes consistently and systematically obtained higher scores than
people with green eyes—assuming, of course, that in reality people with brown
eyes are not generally more intelligent than people with green eyes. Systematic is
a key word in our definition of test bias.
Rating error A rating is a numerical or verbal judgment (or both) that places
a person or an attribute along a continuum identified by a scale of numerical or
word descriptors known as a rating scale. Simply stated, a rating error is a
judgment resulting from the intentional or unintentional misuse of a rating scale.
Thus, for example, a leniency error (also known as a generosity error) is, as its
name implies, an error in rating that arises from the tendency on the part of the
rater to be lenient in scoring, marking, and/or grading. From your own
experience during course registration, you might be aware that a section of a
particular course will quickly.

E. D

You might also like