Assessment

Unit 2: Test Construction and Standardisation: Item writing, Item analysis, Norms and Test
Standardisation, Reliability, and Validity.
Def. Of a psychological test
A psychological test is a standardized instrument designed to measure objectively one or more aspects
of behaviour. An objective test is one which is free from subjective biases. A good test is one which is
objective and should have the following qualities:
 Item analysis
 Validity
 Reliability
 Validity
 Norms
 Practicability: a test may have all the above qualities yet may not be useful as it may not be
practical. A test needs to be efficient in terms of time taken for completion, scoring, energy and
effort required.
Test development can be broken down into three distinct stages: (Ref: Murphy)
A. Test construction: The first stage involves test construction. Following issues are considered
here:
1. Defining the test, Scaling the test and item writing
2. item analysis
3. reliability
4. validity
B. Test Standardization: This not only includes Defining the test, Scaling the test and item writing,
item analysis, reliability and validity but also norms. New tests are standardized for use on
particular target populations, norms are developed, and considerable research is undertaken to
establish estimates of the tests’ reliability and validity.
C. Test Revision: Finally, with the passage of time, tests need to be revised to keep them
contemporary and current, both with respect to the norms available and item content
Test construction
1. Defining the Test, Scaling the test and item writing (Ref: Gregory)
Defining the Test: In order to construct a new test, the developer must have a clear idea of what
the test is to measure and how it is to differ from existing instruments. Insofar as psychological
testing is now entering its second one hundred years, and insofar as thousands of tests have
already been published, the burden of proof clearly rests on the test developer to show that a
proposed instrument is different from, and better than, existing measures.
In proposing the Kaufman Assessment Battery for Children (K-ABC), a new test of general
intelligence in children, the authors listed six primary goals that define the purpose of the test
and distinguish it from existing measures: 1. Measure intelligence from a strong theoretical and
research basis 2. Separate acquired factual knowledge from the ability to solve unfamiliar
problems 3. Yield scores that translate to educational intervention 4. Include novel tasks 5. Be
easy to administer and objective to score 6. Be sensitive to the diverse needs of preschool,
minority group, and exceptional children (Kaufman & Kaufman, 1983).
Selecting a Scaling Method The immediate purpose of psychological testing is to assign

numbers to responses on a test so that the examinee can be judged to have more or less of the
characteristic measured. The rules by which numbers are assigned to responses define the
scaling method.
a. Expert Rankings: This method involves experts in the field subjectively ranking items
or responses based on their professional judgment and expertise. Example: In a
depression assessment, mental health professionals may rank symptoms based on their
perceived severity or clinical significance. For instance, suicidal ideation might be
ranked higher than feelings of sadness.
b. Method of Equal-Appearing Intervals: In the early 1900s, L. L. Thurstone came up with
a method called equal-appearing intervals for making scales that measure attitudes. This
method starts by gathering a lot of true or false statements about a topic, like physical
exercise. Then, judges put these statements into categories (usually 11) based on how
positive or negative they are. After that, the average positivity or negativity for each
statement is carried out based on the category in which it is placed. They get rid of
statements that vary too much in how people rate them. Finally, they pick about 20 to
30 statements to make up the scale.
c. Method of Absolute Scaling: In 1925, Thurstone introduced the method of absolute
scaling as a means to quantify the difficulty levels of test items objectively. This
approach involves administering a set of common test items to multiple age groups,
with one group serving as the reference or anchor. By comparing the relative difficulty
of items across different age groups, typically using standard units like standard
deviations, an absolute measure of item difficulty is derived. For instance, if a math
problem proves challenging for younger children but manageable for older ones, it
indicates higher item difficulty. Thurstone illustrated this methodology using data from
testing 3,000 schoolchildren on the 65 questions of the original Binet test. He
established a scale ranging from negative to positive values, representing varying
degrees of difficulty, and positioned each question accordingly. This analysis
highlighted clusters of questions at certain difficulty levels and shortages at others.
d. Likert Scales: Respondents indicate their level of agreement or disagreement with a
series of statements on a scale with predefined response options. A 5-point Likert scale
ranging from "Strongly Disagree" to "Strongly Agree" is commonly used to assess
attitudes or opinions. Participants rate their agreement with statements like "I enjoy
socializing with others."
e. Guttman Scales: Items are arranged hierarchically based on the extent to which
respondents endorse them, with each item reflecting a more extreme manifestation of
the construct being measured. In a scale measuring assertiveness, items may progress
from less assertive behaviors (e.g., "I avoid confrontation") to more assertive ones (e.g.,
"I confidently express my opinions").
f. Method of Empirical Keying: Test items are selected based on their ability to
empirically discriminate between groups known to differ in the characteristic being
measured. Items on a personality test are chosen based on their effectiveness in
distinguishing between individuals with high and low levels of extraversion, as
determined through statistical analysis.
g. Rational Scale Construction: Scale items are developed based on theoretical principles,
expert judgment, and logical reasoning, aiming to capture different facets or levels of
the construct being measured. Items on a self-esteem scale are designed to assess
various aspects of self-worth, such as feelings of competence, social acceptance, and
self-respect, guided by theoretical frameworks of self-esteem.
These scaling methods offer different approaches to quantifying responses in psychological

testing, each suited to specific measurement needs and contexts. Test developers select the most
appropriate method based on factors such as the nature of the construct being measured, the
intended use of the test, and the available resources.
ITEM WRITING (Ref: Kaplan): The first step in the test construction process is to generate an item
pool. Test developers face a choice between a variety of formats as follows:
a. Item Formats: The format of item needs to be chosen while writing the items. There area number of
formats to choose from:
Dichotomous format: The dichotomous format, exemplified by true-false tests, presents two choices per
statement, ensuring easy construction and scoring. Although it streamlines administration, its drawbacks
include promoting memorization and guessing. Dichotomous items simplify scoring, particularly in tests
with multiple subscales. This format is also prevalent in personality tests, where respondents make
absolute judgments on statements like "I often worry about my sexual performance."
Polytomous format: The polytomous format, similar to dichotomous but with more than two alternatives
per item, is often exemplified by multiple-choice exams. When taking a multiple-choice examination,
you must determine which of several alternatives is “correct.” Incorrect choices are called distractors
and effective distractor selection during item writing is crucial. While traditional practices involve four
or five alternatives, psychometric analysis indicates that three-option multiple-choice items can be as
effective or better in terms of validity and reliability. However, poorly written distractors can
compromise test quality.
Guessing: The scoring of multiple-choice exams involves considerations for guessing. However, if a
correction is applied, random guessing is ineffective. The formula to correct for guessing on a test is:
The test had 100 items, each with four choices. She gets 25 answers correct. The expected score
corrected for guessing would be:
Research indicates students are more likely to guess when expecting a low grade. Some methods
discourage guessing by giving partial credit for blank items.
Likert Format: The Likert format, commonly used in attitude and personality scales, asks respondents to
indicate their degree of agreement with statements on a 3 point, 5 point or 7 point scale. This format is
popular in attitude measurements, allowing researchers to assess people's endorsement of statements.
Category Format: The Category Format is similar to Likert but with more response options-- often
employs 10-point scales for rating. E.g.- Employee Feedback: "On a scale of 1 to 10, how satisfied are
you with your work environment?" A 10-point scale generally provides adequate discrimination, but the
optimal number of categories depends on respondent engagement or involvement. An approach related
to category scales is the visual analogue scale. The visual analogue scale is a straightforward method
where participants mark their response on a 100-millimeter line between two clear points. It's commonly
used for self-rated health but not as much for scales with multiple items due to the time-consuming
scoring process (Clark & Watson, 1998).
Free-response format or Essay type: The free-response format offers the advantage of allowing
examinees to express open ended and thoughtful responses, fostering a deeper engagement with the
subject matter. However, the variability in responses poses challenges in consistent scoring for
examiners.
Adjective checklist and Q sort: In personality measurement, the adjective checklist is sometimes used.
With this method, a subject receives a long list of adjectives and indicates whether each one is
characteristic of himself/herself or even others. On the other hand, the Q-sort method broadens
categories by asking individuals to sort statements into nine piles based on their perceived relevance. In
a typical Q-sort scenario, participants might receive statements about personal traits and arrange them
into piles ranging from highly descriptive to not descriptive at all. The resulting distribution often
resembles a bell-shaped curve.
Example of Q sort: Imagine you are asked to rate your friend's personality using a Q-sort. You receive
cards with statements like:
1. Has a great sense of humor.
2. Enjoys spending time alone.
3. Is always energetic and outgoing.
4. Prefers routine and dislikes change.
5. Easily adapts to new situations.
6. Often seeks adventure and thrills.
You have nine piles to sort these statements based on how well they describe your friend. Pile 1 might be
for statements that don't describe them at all, and pile 9 for those that perfectly match. Most cards might
end up in piles 4, 5, and 6, creating a bell-shaped distribution. This Q-sort gives a more nuanced
understanding of your friend's personality traits.
b. Item Content: Test developers employ distinct strategies when writing items. Some use a specific
theoretical framework, translating its concepts into test items. For instance, Edwards, using Murray's
personality theory, created items reflecting autonomy, like "I like to avoid conventional ways" and "I
avoid responsibilities." However, theory-based items can be transparent, prompting respondents to
shape responses to appear a certain way rather than reflecting genuine perceptions. Conversely, an
atheoretical approach involves creating a diverse item pool without a common theme. The Minnesota
Multiphasic Personality Inventory (MMPI) exemplifies this method, aiming to differentiate groups
without strict theoretical ties. These items are generally less transparent, reducing the likelihood of
examinee distortion.
c. General guidelines for item writing: General guidelines for item writing (Ref: Kaplan; (DeVellis,
2016)):
 Define clearly what you want to measure. To do this, use substantive theory as a guide and try
to make items as specific as possible.
 Generate an item pool. Theoretically, all items are randomly chosen from a universe of item
content. In practice, however, care in selecting and developing items is valuable. Avoid
redundant items. In the initial phases, you may want to write three or four items for each one that
will eventually be used on the test or scale.
 Avoid exceptionally long items. Long items are often confusing or misleading.
 Keep the level of reading difficulty appropriate for those who will complete the scale.
 Avoid “double-barreled” items that convey two or more ideas at the same time. For example,
consider an item that asks the respondent to agree or disagree with the statement, “I vote
Democratic because I support social programs.” There are two different statements with which
the person could agree: “I vote Democratic” and “I support social programs.”
 Consider mixing positively and negatively worded items. Sometimes, respondents develop the
“acquiescence response set.” This means that the respondents will tend to agree with most items.
To avoid this bias, you can include items that are worded in the opposite direction. For example,
in asking about depression, the CES-D uses mostly negatively worded items (such as “I felt
depressed”). However, the CES-D also includes items worded in the opposite direction (“I felt
hopeful about the future”)
Times change, and tests can get outdated (Kline, 2015). When “writing items, you need to be sensitive
to ethnic and cultural differences. For example, items on the CES-D concerning appetite, hopefulness,
and social interactions may have a different meaning for African American respondents than for white
respondents (DeVellis, 2012; Foley, Reed, Mutran, & DeVellis, 2002).
2. Item Analysis
Meaning: The effectiveness and usefulness of any test depends upon the qualities of the items that are
included in it. A good test has good items. Item analysis is the process by which each item is analyzed
critically so that only good or suitable items are included in the test. Its purpose is to identify and reduce
errors of measurement. Item analysis is usually designed to answer questions such as the following:
 Did the item function as intended?

 Were the test items of appropriate difficulty?
 Were the test items free of irrelevant clues and other defects?
 Was each of the distracters effective (in multiple-choice items)?
Item analysis is used to eliminate ambiguous or misleading items in a test. It does so by providing
information regarding the difficulty and discrimination of items
a. Item Selection and item writing: This would depend on the judicious selection of items. Some
items may be adapted from previous tests while others are written by the test constructor. A good test
item must have clarity in meaning. These set of items are given to a panel of judges who rate them on
the basis of whether they are relevant for the test. An item is selected if the majority of the judges
feel that they should be included in the test
b. Difficulty level: The items that are selected in step 1 are then given to a sample of examinees to find
out item discrimination. The item should not be so easy that it is answered by all examinees and neither
should it be so difficult that it is failed by all respondents. So extremes are not taken.
3 ways of determining item difficulty;
1) Judgmental method (ranking): By the judgment of competent people who rank items in
order of difficulty
2) Time: By how quickly the item can be solved
3) Empirical method: By the number of examiners in the group who get the items right
The first two procedures are usually the first step while the third is a standard procedure. This method is
statistical as contrasted with the judgmental approach to item difficulty. Difficulty level is a measure of
the proportion of examinees who answered the item correctly; for this reason it is frequently called
the p-value and hence can be seen as an easiness index. Thus, the difficulty index is simply a measure of
the number of people who answered a given item correctly and is expressed as a decimal, for
example, .80 means 80 percent of the people taking the test answered the item correctly. The difficulty
index can range from .00 (nobody got the item right) to 1.00 (everyone got the item right) with a higher
value indicating that a greater proportion of examinees responded to the item correctly, and it was thus
an easier item. To calculate item difficulty, you simply count the number of examinees that responded
correctly and divide by the number of respondents.
P= R/N
 where p is the index of difficulty,
 R is the number of examinees who pass the item and
 N is the total=l number of examinees who take the test
In the following data set, an examinee is given 0 for wrong answer and 1 for right answer.
Item 6 has a high difficulty index, meaning that it is very easy. Item 4 and Item 5 are typical items,
where the majority of persons are responding correctly. Item1 is extremely difficult; no one got it right!
Thus, in practice if an item is to distinguish among the individuals it should not be so easy that it is
passed by all persons (p=1) nor should it be so difficult that no one is able to pass it (p=0). The closer
the difficulty level approaches 0.50 the more differentiating the items can be. For maximum
differentiation then it will seem that one should choose all items at the 0.50 level. The decision is
complicated by the fact that all items may tend to be highly intercorrelated. In an extreme case if all
items were perfectly intercorrelated and all were of 0.50 difficulty level then the same 50% should pass
each item and one item will suffice to differentiate the groups. Hence half of the examinees would
obtain perfect scores while the other half would obtain zero. Because of intercorrelations it is best to
choose items with approx moderate level of difficulty level i.e. between 0.30 to .0.70.
c Discrimination power/validity index is a measure of how well an item (i.e. a question) distinguishes
between those respondents who are high from those who are low on that skill. The researcher will want
to examine those items that are found to be non discriminating. These items may be ambiguous or they
may be statements not really expressing feelings about the object that is being measured. Revising these
items may make them usable.
The principal measure of item discrimination is the discrimination index. There are many methods of
calculating the index two of which are as follows:
Method 1: Kelly’s 27% criterion (Extreme Group Method)

 Arrange the scores in a descending (or ascending order).
 Divide the scores into groups: Top and Bottom group. Kelly has suggested the 27% criterion i.e.
top 27% make one group (high scorers) and low 27% make bottom group (low scorers) while
the middle group is discarded. In the below example top 3 and bottom 3 students are considered
for analysis.
 Analyze each item by counting the number of students in the upper and lower group who got the
item correct. E.g., For item 1, in top 27% of group, only 2 students got it right while in bottom
27% also two students got it right. Now see if there is a difference in top and bottom groups for
each item (use ‘t’ test). If there is a significant difference in top and bottom groups then include
the item in the test. This show that item differentiates between those who are at the top and those
who are at the bottom. In case of item 1 there is no difference and hence discard the item.
Method 2: Based on difficulty level: Determining the difficulty level is significant for a test’s
discriminative power/index which is given by p x q where p is the proportion who pass an item while q
is the proportion who fails an item (q=1-p). As has been stated earlier the maximum discrimination
occurs when p = q = 0.5 i.e. half of the sample gets it right and other half answers in wrong. The
discrimination power of the test is therefore maximum when p x q=.25 (approximately). Practically we
don’t choose items that have DI of exactly 0.25 but which are approximately 0.25.
p q
.9 .1
.8 .2
.7 .3
.9 .1
.7 .3
.6 .4
.4 .6
.1 .4
.2 .8
.5 .5
 As a rule of thumb:
1. If D  0.30 then the item is functioning satisfactorily
2. If D  0.19 then the item should be eliminated or completely revised
d Item consistency- Item consistency includes two types of statistical information: item test correlation
and inter item correlation
a) Item test correlation: In the item test correlation, each item of the test is correlated with the total
score with the help of special correlations like point biserial correlations. To be useful, an item should
correlate at least .25 with the total score. Items that have very low correlation or negative correlation
with the total score should be eliminated because they are not measuring the same thing as the total
scale and hence are not contributing to the measurement of the attitude. If the correlation is negative or
low then the item should be eliminated from the test.
The average item-total correlation involves computing a total score for the test and treating that total
score as though it were another item (i.e. Item 6 in the following table). We then compute the item to
total score correlations. Figure 5.22 shows the sixth item-to-total correlations at the bottom of the
correlation matrix. They range from .82 to .88 in this sample analysis, with the average of these at .85.
b) Interitem correlation: Next one should look at the interitem correlation among the items (i.e., the
questions on the test) to see if they are reliable measures of the constructs they are presumed to measure.
This simply means that if we have a number of questions that are intended to measure intelligence in a
population, then those who score high on one of those questions (indicating that they are highly
intelligent) should score high on other questions of the same type. The interitem correlation for those
items that one thinks measure the same concept should be very high (correlation should be over 0.80),
and the interitem correlation between items that one thinks measure different concepts should be very
low (below 0.20 at the uppermost limit, and preferably below 0.10).
For example, if you have six items, you will have fifteen different item pairings (fifteen correlations).
The average inter-item correlation is simply the average or mean of all these correlations. In the
example, you find an average inter-item correlation of .90 with the individual correlations ranging
from .84 to .95.
Item distractor analysis: This is done in multiple choice tests. The incorrect alternatives for an item
are referred to as distractors. If a distractor is selected by a majority of examinees in the upper group
then that means that this item needs revision.
Item characteristic curves (ICC) and Item Response Theory (IRT): Perhaps the most important new
development relevant to psychometrics is item response theory. The item characteristic curve is the
basic building block of item response theory. Each item in a test has its own item characteristic curve.
The total test score, an estimate of an individual's trait level, is plotted on the Y axis, while the X axis
shows the probability of answering item correctly (based on percentage of people who correctly answer
an item - indicating difficulty level). We can see that if someone has average ability (i.e., IQ of 100), he
or she has a 50% chance of getting the item correct on the left side of the chart. If an individual has less
ability, say an IQ of 85, his or her chance of getting the item correct has been reduced to 25%.
We can see that if the shape flattens out, the item has less ability to discriminate or provide a range of
probabilities of getting a correct or incorrect response.
If we were administering our test via computer, we could provide different questions for different
abilities dependent on how the previous question was answered. If Anne got our first question correct at
the 100 IQ level, we could then give her a question at the 115 level. If she missed this item, we could
give her one at the 107 level. We can continually fine-tune Jo Anne’s score this way. With this
approach, test takers do not have to suffer the embarrassment of attempting multiple items beyond their
ability. Conversely, they do not need to waste their time and effort on items far below their capability.
Of course, there are many difficulties with applications of IRT. For instance, the method requires a bank
of items that have been systematically evaluated for level of difficulty. Considerable effort must go into
test development, and complex computer software is required.
3. Reliability
Definition: Reliability refers to the extent to which the scores are ‘consistent’ from one administration
to another and also within the test itself. Synonyms of word reliability include terms like consistency,
stability and repeatability. For example, if a test is designed to measure a trait (such as introversion),
then each time the test is administered to a subject, the results should be approximately the same.
Statistically, reliability is also defined as the self correlation of the test.
Reliability in terms of classical theory: Classical test theory emerged from the work of Charles
Spearman. Classical test score theory assumes that each person has a true score that would be obtained
if there were no errors in measurement. However, because measuring instruments are imperfect, the
score observed for each person almost always differs from the person’s true ability or characteristic. The
difference between the true score and the observed score results from measurement error. In symbolic
representation, the observed score (X) has two components; a true score (T) and an error component
(E):
Or we can say that the difference between the score we obtain and the score we are really interested is
the error of measurement:
There are two types of errors:
 Systematic means are constant errors that do not fluctuate. An example of systematic error is a test
question that contains a typographical error and everyone who takes the test reads that same error.
Reliability would not measure this error because it is systematic.
 Unsystematic are presumed to be random. Reliability can thus be defined in terms of the extent to
which the test is free from random or unsystematic errors. Unsystematic errors are those errors
that are not consistent, such as a typographical error on just one persons test. According to
Layman (1978) random errors may be related to 5 factors: (a) examinee-specific factors such as
fatigue, boredom, momentary lapses of memory b) test examiner such as poor directions by
examiner, and (c) test situation such as noise in the environment (d) test-content factors such as
ambiguous or tricky items, and e) time influence
Importance of reliability: A test’s reliability is important for two reasons.

 First reliability is an indicator of the extent to which the test is free from random/measurement
errors when the test is administered. In an unreliable test, students’ scores consist largely of
measurement error. An unreliable test offers no advantage over randomly assigning test scores to
students.
 The second reason to be concerned with reliability is that it is a precursor to test validity. If the
test is unreliable, one needn’t spend the time investigating whether it is valid–it will not be. If
the test has adequate reliability, however, then a validation study would be worthwhile.
TYPES OF RELIABILITY
The reliability coefficient is symbolized with the letter "r" and a subscript that contains two of the
same letters or numbers (e.g., ''rxx''). The subscript indicates that the correlation coefficient was
calculated by correlating a test with itself rather than with some other measure.
1. Test-Retest Reliability: Time sampling
The same test is administered twice to the same group after a time gap (ideal time gap is a fortnight or
15 days). A correlation coefficient is then found between the two sets of scores. This is referred to as the
reliability coefficient or coefficient of stability as it measures the stability of the test over a period of
time. A reliable test will have high correlation, indicating that a participant would perform equally well
(or as poorly) on both testings.
The determination of test-retest reliability appears quite simple and straightforward, but there is one
main problem associated with it. The first has to do with the “suitable” interval before retesting. If the
interval is too short, for example a couple of hours, we may obtain substantial consistency of scores, but
that may be more reflective of the relative consistency of people’s memories over a short interval than
of the actual measurement device. If the interval is quite long, for example a couple of years, then
people may have actually changed from the first testing to the second testing and this may affect the
reliability. Usually then, test-retest reliability is assessed over a short period of time (a few days to a few
weeks or a few months), and the obtained correlation coefficient is accompanied by a description of
what the time period was. In effect, test retest reliability can be considered a measure of the stability of
scores over time. Different periods of time may yield different estimates of stability. Note also that some
variables, by their very nature, are more stable than others.
The primary sources of error for test retest reliability are random factors related to the time gap between
two administrations (called ‘time sampling’). These include random fluctuations in examinees over time
(anxiety, motivation), and random variations in testing situation. For instance, we usually assume that
an intelligence test measures a consistent general ability. As such, if an IQ test administered at two
points in time produces different scores, then we might conclude that the lack of correspondence is the
result of random measurement error. Usually we do not assume that a person got more or less intelligent
in the time between tests
Advantage:
 It is easy to find as it does not require development of two equivalent forms.
Limitations:
 One concern of test-retest reliability is termed practice or carryover effects. This effect occurs
when the first testing session influences scores from the second session. If the time interval is
short, then due to practice people may be overly consistent because they remember some of the
questions and their responses.
 If the interval is too long, then the results may be influenced by learning and maturation, that is,
changes in the persons themselves.
 This type of analysis is of value only when we measure “traits” or characteristics that do not
change over time.
2. Alternate (Equivalent, Parallel) Forms Reliability: Item sampling
Parallel forms reliability compares two equivalent forms of a test that measure the same attribute. When
two equivalent forms of the test are available, one can compare performance on one form versus the
other. A correlation coefficient is then found between the two sets of scores. A reliable test will have
high correlation, indicating that a participant would perform equally well (or as poorly) on both forms of
the test.
Most standardized tests provide equivalent forms that are identical in all respects except the actual item
that are included in the test. These forms can be used interchangeably. These alternate forms have same
variables, same number of items, same structure, same difficulty level, same direction for
administration, scoring and interpretation. The alternate forms therefore measure the same traits to the
same extent. The equivalent forms are developed by selecting two or more than two sets of items from
the same behavioral domain. Statistically two forms are said to be parallel or equivalent if the mean and
SDs of a group of examinees get the same score. Three forms are said to be parallel if the means, SDs
and correlations are equal.
Good experimental practice requires that to eliminate any practice or transfer effects, half of the subjects
take form A followed by form B, and half take form B followed by form A.
The primary source of error is due to different content (called content sampling) included in the two
forms (Form A and Form B). When there is a time gap between the testing then error is also due to time
sampling.
Advantage:
 Some researchers consider this as the best method for assessing reliability.
 Used for speed tests
Disadvantage:
 Even with the best test items, each test would contain slightly different content and, it may be
difficult to obtain parallel tests.
 Should not be used when scores are likely to be affected by repeated measurement.
 Often test developers find it burdensome to develop two forms of the same test, and practical
constraints make it difficult to retest the same group of individuals
3. Internal Consistency Reliability

This form of reliability is used to judge the consistency within the same test. It measures whether
different items of the same test produce the same scores. This method involves administering the test
once to a single group of examinees. Methods of measuring internal consistency:
a) Split half reliability – This is the most popular method of finding internal consistency i.e.
consistency within a test. The main requirement for calculating split half reliability is that items should
be arranged in order of difficulty and according to type. If the items are arranged randomly then this
method can not be used.
The first step test is to administer the test on a group of examinees. The test score is then split in two
equivalent parts. Although the test can be split by many methods such as first half and second half or
odd even method yet the odd even method is preferred. The odd even method is the best way of
obtaining two halves which are equivalent in terms of difficulty level, type of items, fatigue and
motivation. The next step is to obtain the odd even correlation. However, since the odd even correlation
is based on half the test length and as length of the test decreases reliability also decreases therefore a
correction formula has to be applied to correct for reduced test length. This correction formula is called
spearman brown prophecy formula (based on full test length) and is given below:
Where r is the correlation between the two halves of the test.

Or
b) Kuder-Richardson Formula (KR). In addition to the split-half technique, there are many other
methods for estimating the internal consistency of a test. Many years ago, Kuder and Richardson
(1937) greatly advanced reliability assessment by developing methods for evaluating reliability
within a single test administration rather than splitting the test in two halves. It has two formulas:
 Formula 20. The formula for calculating the reliability of a test in which the items are
dichotomous, scored 0 or 1 (usually for right or wrong), is known as the Kuder Richardson 20,
or KR20.The formula came to be labeled this way because it was the 20th formula presented in
the famous article by Kuder and Richardson.
 Formula 21: KR21 is a shortcut method that will yield reliability coefficients identical to KR20
but can be used only when all items are equally difficult
Advantage: It is a methods for evaluating reliability within a single test administration.
Disadvantage: The KR20 formula is not appropriate for evaluating internal consistency in some cases.
The KR20 formula requires that you find the proportion of people who got each item “correct.” There
are many types of tests, though, for which there are no right or wrong answers, such as many
personality and attitude scales.
c) Cronbach's α (or coefficient alpha) Cronbach’s alpha is also a measure of internal consistency, that
is, how closely related a set of items are as a group. Cronbach’s α is similar to KR, except
Cronbach's α is also used for non-dichotomous measures. It is based upon inter item correlation. The
stronger the items are inter-related, the more likely the test is consistent.
High Cronbach’s alpha values indicate that response values for each participant across a set of
questions are consistent. For example, when participants give a high response for one of the items,
they are also likely to provide high responses for the other items. This consistency indicates the
measurements are reliable and the items might measure the same characteristic. Conversely, low
values indicate the set of items do not reliably measure the same construct. High responses for one
question do not suggest that participants rated the other items highly. Consequently, the questions are
unlikely to measure the same property because the measurements are unreliable.
Analysts frequently use 0.7 as a benchmark value for Cronbach’s alpha. At this level and higher, the
items are sufficiently consistent to indicate the measure is reliable.
Advantage: Surveys and assessment instruments frequently ask multiple questions about the same
concept, characteristic, or construct. By including several items on the same aspect, the test can
develop a more nuanced assessment of the phenomenon.
4. Inter-rater (Inter-scorer, Inter-observer) Reliability:

This type of reliability is assessed by having two or more independent judges score the test. The scores
are then compared to determine the consistency of the ratings. Consistency is found by calculating a
correlation coefficient (kappa coefficient or coefficient of concordance)
Sources of error
- factors related to the raters (motivation, biases)
- characteristics of the measuring device
Reliability Coefficient
Reliability coefficient is generally stated in terms of a coefficient of correlation between two sets of
scores. There are many ways of estimating the reliability coefficient as given below:
Factors affecting reliability

When interpreting a reliability coefficient it is important to remember that there is no single index of
reliability coefficient. Instead the reliability coefficient will differ from sample to sample and situation
to situation.
Besides the sources of errors discussed above, there are many variables that affect reliability:
 Test length: the longer the test the higher the reliability coefficient
 Range of test scores: Reliability coefficient is high when the range is large.
 Heterogeneity of group: The more the heterogeneity in a group , the more the
reliability.
 Homogeneity of test items: If the items measure different functions and the inter-correlations of
items are very low, then the reliability is ‘very low and vice-versa.
 Difficulty level and discrimination: When all items are either easy or difficult the range of scores
will be restricted and this will lower the reliability. Therefore it is best to have items which will
have moderate difficulty level as this will increase reliability. Also high discrimination will lead
to high reliability.
 Guessing: As the probability of guessing the right answers increases the reliability decreases. All
things being equal chances of guessing the right answer decreases on multiple choice tests and
so reliability is higher as compared to yes/no test items.
 Momentary fluctuations in the examinees: Decrease reliability
 Unstable environmental conditions: Decrease reliability
 Complicated and ambiguous instructions give rise to difficulties in understanding the questions
and ultimately leading to low reliability.
Methods of improving reliability
 Test Length: In general, longer tests produce higher reliabilities up to a point.

 Item Quality: The item should be of good quality in that it should be clearly written, should be
able to discriminate between the two groups and should have a moderate level of difficulty.
 Homogeneity of test items: Items should be homogenous
3 Validity
The term validity means truth or fidelity. It refers to a test's accuracy. Validity is the extent to which a
test measures what it claims to measure as compared to a given criterion. For e.g. an intelligence test
should measure only intelligence and not any other variable. For finding the validity of the test, the test
has to be compared with a criterion that is assumed to measure the ‘truth’. A criterion could be any
reliable and valid measure of the same construct. For e.g. while constructing on intelligence test, a
criterion could be WAIS, teacher ratings, school performance, peer ratings etc. It is vital for a test to be
valid in order for the results to be accurately applied and interpreted.
Classical test theory assumes that the obtained score is composed of the true value plus some random
error value. But all error is not random. Some errors may also be systematic. One way to deal with this
notion is to revise the classical test theory by dividing the error component into two subcomponents,
random error and systematic error. While reliability is the extent to which the test is free from
random scores, validity can be seen as the extent to which the test is free from systematic errors.
Systematic errors may arise when for example the test claims to measure intelligence but instead also
measure aspects of aptitude/ fatigue etc.
Validity is found by finding the coefficient of correlation between the predictor and the criterion and is
referred to as the ‘validity coefficient’. There are three basic types of validity:
1. Content validity: Content Validity is the extent to which the content of the test represents the
universe of items. A test high on content validity must cover all major aspects of the content and in the
right proportion. Often a panel of experts or judges may rate each item’s relevance to the test.
Traditionally content validity is most relevant for achievement tests. The history test for example should
have test items that pertain to varied aspects of history.
In order to be adequate, content validity must contain both item validity and sampling validity. Item
validity reflects whether the items of the test actually represent the intended area. For e.g. a test
measuring creativity should have items relevant to creativity only. “Sampling validity” is concerned
with whether the test covers the full range of the construct’s meaning, i.e., covers all dimensions of a
concept. For e.g. a test of creativity should cover all dimensions of creativity.
Two new concepts that are relevant to content validity are construct underrepresentation which
emphasizes the failure of the test to capture important components of a construct (for e.g. if a test that
tries to measure maths knowledge includes algebra and not geometry then it is not a valid test) and
construct irrelevant variance when scores are influenced by factors irrelevant to the construct. For e.g. a
test of intelligence may be influenced by test anxiety or illness.
2. Criterion-related Validity: A test is said to have criterion-related validity when the test is
effective in predicting performance in certain activities called the ‘criterion’. It can best be considered as
‘practical validity’ for a specific purpose. The tester is not interested in knowing ‘why’ the test is able to
predict some performance but it is enough that it is able to predict. In other words the tester is not
interested in the theory underlying the test but only in prediction.
For e.g. if an engineering test (predictor) is able to predict who will go on to become good engineers
(i.e. criterion) then the engineering test is said to have criterion related validity. The tester is not
interested in knowing why the test is able to make certain predictions.
There are two different types of criterion validity:
 Concurrent Validity occurs when the criterion measures are obtained at the same time as the
test scores. A new test of adult intelligence, would have concurrent validity if it had a high
positive correlation with the Wechsler Adult Intelligence Scale since the Wechsler is an
accepted measure of the construct we call intelligence.
 Predictive Validity occurs when the criterion measures are obtained at a time after the test.
Examples of test with predictive validity are career or aptitude tests, which are helpful in
predicting who is likely to succeed or fail in certain subjects or occupations in future.
3. Construct Validity:
A test has construct validity if it measures a theoretical trait or construct (like intelligence,
personality, anxiety). Unlike criterion related validity in which the researcher is not interested in
the theory behind the test, here the question of ‘theory’ becomes very important. To find the
construct validity it is important to find the convergent and divergent validity of the test.
Convergent validity - A test should have high correlations with other methods of measuring the
same construct. (e.g. a test measuring intelligence should have high correlations with peer ratings
of intelligence and observational reports). Divergent validity - A test should have low correlation
with variables, which are unrelated. (e.g. measures of aggression and assertiveness should have
low correlation).
Out of all the different types of validity this is the most important one as it goes into the very
philosophy of the test. Some theorists believe that construct validity is an umbrella term and both
content and criterion related validity can be subsumed under it. What makes construct validity
different is that the theory must be specified here. “all validation is one, and in a sense all is
construct validation” (Cronbach)
Convergent Validity and Discriminant validity: Campbell and Fiske (1959) proposed that to show
construct validity, one must show that a particular test correlates highly with variables, which on
the basis of theory, it ought to correlate with; they called this convergent validity. They also
argued that a test should not correlate significantly with variables that it ought not to correlate
with, and called this discriminant validity. They then proposed an experimental design called the
multitrait-multimethod matrix to assess both convergent and discriminant validity.
Suppose we have a true-false inventory of depression that we wish to validate.
 We need first of all to find a second measure of depression that does not use a true-false or
similar format - perhaps a physiological measure or a 10-point psychiatric diagnostic
scale.
 Next, we need to find a different dimension than depression, which our theory suggests
should not correlate but might be confused with depression, for example, anxiety.
 We now locate two measures of anxiety that use the same format as our two measures of
depression.
 We administer all four tests to a group of subjects and correlate every measure with every
other measure.
To show convergent validity, we would expect our two measures of depression to correlate highly
with each other (same trait but different methods). To show discriminant validity we would expect
our true-false measure of depression not to correlate significantly with the true-false measure of
anxiety (different traits but same method). Thus the relationship within a trait, regardless of
method, should be higher than the relationship across traits.
Other ways to establish construct validity:
a) Empirical method: Most methods of test validation begin with a theory about the
nature of the construct. example: if want to develop a creativity test and believe that
creativity is unrelated to intelligence, is innate, and that creative people generate more
solutions, then the tester would want to determine the correlation between scores on
creativity tests and IQ tests, see if a training course in creativity affects scores, and see
if test scores distinguished between people who differ in number of solutions they
generate.
b) Factor analysis: Factor analysis is a complex statistical procedure based on

correlations which is used to identify the underlying structure and theory of the test.
Through factor analysis a large number of variables are reduced to a small number of
variables called factors.
4) Face validity: It is the extent to which the test appears to test what it intends to measure. Face
validity and content validity may appear similar but are very different. Face validity is the
acceptability of the test to the test taker while content validity is the appropriateness of the test as
judged by professionals. Face validity is not validity in the technical sense. Some tests may have
face validity but not be valid or some tests such as projective tests may not have face validity but
be valid.
5) Cross validation: In cross validation we validate the test on a population other than the one on
which it was originally standardized.
Factors affecting validity
When interpreting validity it is important to remember that there is no single index of validity
coefficient. Instead the validity coefficient will differ from sample to sample and situation to
situation. There are many variables that affect validity:
 Test length: the longer the test the higher the validity
 Homogeneity of items: When the test is homogeneous validity is also very high
 Range of test scores: Validity coefficient is high when the range is large.
 Heterogeneity of group: The more the heterogeneity in a group , the more
the validity.
 Homogeneity of test items: If the items measure different functions and the inter-
correlations of items are very low, then the validity is ‘very low and vice-versa.
 Difficulty level and discrimination: When all items are either easy or difficult the range
of scores will be restricted and this will lower the validity. Therefore it is best to have
items which will have moderate difficulty level as this will increase reliability. Also high
discrimination will lead to high validity.
 Guessing: As the probability of guessing the right answers increases the validity
decreases. All things being equal, chances of guessing the right answer decreases on
multiple choice tests and so validity is higher as compared to yes/no test items.
 Momentary fluctuations in the examinees Decrease validity
 Unstable environmental conditions: Decrease validity
 Complicated and ambiguous instructions give rise to difficulties in understanding the
questions and ultimately leading to low reliability
Improving validity
 Test Length: In general, longer tests produce higher validities up to a point.
 Item Quality: The item should be of good quality in that it should be clearly written,
should be able to discriminate between the two groups and should have a moderate
level of difficulty.
 Homogeneity of test items: Items should be homogenous
Relationship between reliability and validity

Both measure test efficiency. But reliability is a matter of self correlation while validity goes
beyond the test itself – it is a correlation with an external criterion. A valid test is a reliable test
but the vice versa may or may not be true i.e. a reliable test may or may not be valid.
A test cannot be valid if it is not reliable i.e. reliability is necessary but not sufficient for validity.
Validity cannot exceed the square root of its reliability coefficient. If reliability coefficient is .81,
validity coefficient must be <.90. This is true for homogeneous tests. But in the case of
heterogeneous test, internal consistency/reliability may be low but validity may be high.
Validity is more important as validity goes into the very philosophy of the test especially construct
validity. Reliability is more a matter of statistics and does not go deep into the philosophy of the
test. First high validity should be established and then its reliability rather than going the other way
round.
Range of reliability and validity
Both are expressed in terms of correlation and range of correlation is from -1 to +1. However, reliability
is only from 0 to 1 and can never be either negative or perfect. Most large-scale tests report reliability
coefficients that exceed .80 and often exceed .90. Validity ranges from -1 to +1 but it can never be
perfect. Most psychological tests report validity coefficients above .60. Unlike reliability however,
validity can be negative and sometimes may even be desirable. However, one must remember that
reliability and validity coefficients are not static attributes but differ from sample to sample.
NORMS
Definition of norms
Norms are scores obtained on representative or normative score with which you compare the
participant’s score.
Importance of norms
An individual's performance in any psychological and educational test is recorded in terms of the raw
scores. All these raw scores convey no meaning in themselves. For example when A has a score of 40
on the arithmetic reasoning test and a score of 30 on the history test, would it mean that A is superior or
inferior or equivalent on the arithmetical reasoning lest to the history test? In the absence of some
interpretative data, it is difficult to answer. Usually, there are two reference points that are applied in
interpreting the test scores (that is, in interpreting what the scores tell us about the examinees with
respect to the characteristics being measured).
 Norm referenced Testing: The first way is to compare an examinee's test score with the score of
a specific group of examinees on that test or a norm. Norms may be defined as the average
performance on a particular test made by a standardization sample. This process is known as
norm-referencing and a test based on norm-referencing is called as norm-referenced test. The
purpose of a norm-referenced test is to classify the persons from low to high across a continuum
of ability or achievement. The researchers might want to classify individuals in this way for the
purpose of selection to a specialized course or placement in any remedial or gifted programme.
 Criterion-referenced testing and assessment may be defined as a method of evaluation and a way
of deriving meaning from test scores by evaluating an individual’s score with some absolute
standard. No comparisons are made with other individuals. Some examples: To be eligible for a
college, students must get at least get 75% in college.
Steps in developing norms
 Defining the target population: In the initial phase of developing a psychological test, the focus
is on defining the target population, often referred to as the normative group. This involves a
clear delineation of the specific characteristics and demographics that make up the intended
group of test-takers. This normative group serves as the reference point for interpreting
individual test scores.
 Selecting a representative sample: Following the definition of the target population, the next step
involves selecting a representative sample from within this group. The sample size is carefully
chosen to adequately reflect the diversity and characteristics of the broader population for which
the test is designed. This selected sample is then utilized for the final trial runs of the test and for
establishing the norms that will be used in the interpretation of future test results.
 Standardizing the conditions: This involves establishing consistent and uniform procedures for
test administration. By standardizing the conditions, variations in the administration process are
minimized, contributing to the reliability and validity of the test results.
Types of Norms
Percentile Norms: Percentile norms express an individual's performance in terms of percentiles. A
percentile rank indicates an individual’s position relative to the standardization sample. It represents the
percentage of the standardization group who scored below a given raw score. For example, if a raw
score of 33 indicates a percentile rank of 80, it means that 80% of the group members fall below his or
her score. This score, at the 50th percentile, is the median and describes the average performance in a
percentile distribution. For e.g. scores on the Raven's Progressive Matrices, Wechsler Intelligence Scale
for Children and NEO FFI can be converted into percentile ranks.
The major advantages of percentile ranks are as follows: They have universal meaning. A score with a
percentile rank of 89 is high in any distribution. A score with a percentile rank of 32 is somewhat low in
any distribution.
Limitations:
1. As with other ordinal statistics, percentile ranks cannot be added, subtracted, multiplied, or
divided.
2. Equal differences between percentile ranks do not represent equal differences between the scores
in the original distribution. If there are many scores clustered together, a small change in score
will produce a major change in percentile rank. If there are few scores, a large change in score
will produce a small change in percentile rank.
Standard Scores: Standard scores, such as z-scores, T-scores, or stanines, represent an individual's
performance in terms of standard deviations from the mean of the normative group. Standard scores are
scores with a common mean and common standard deviation. There are various types of standard
scores:
a) Non Normalized Standard Scores

1. Z scores are standard scores that have a mean of 0 and SD of 1. Tests such as the Wechsler
Adult Intelligence Scale (WAIS) and the Wechsler Intelligence Scale for Children (WISC) use
z-scores to interpret an individual's performance on various cognitive tasks.
In psychological and educational researches the use of z-scores is very rare, because of its
decimal points and negative values. It is difficult for a layperson to understand these. It is used
in statistical and theoretical works only. The disadvantage involve in z-score is removed to
some extent by transforming it to the following types of standard scores.
2. T scores: T scores are standard scores with a mean of 50 and a standard deviation of 10. These T
scores are linear transformations which are non normalized; there is another type of derived
score which is also called T scores but are normalized and the two should not be confused.
b) Normalized standard scores

The transformation to normalized scores involves forcing the distribution of transformed scores to be as
close as possible to a normal distribution. Thus, the raw score distribution is forced into a normal
distribution.
 Advantages: Scores on different tests, if normalized become directly comparable, avoiding the
complications involved when frequency distributions have different shapes.
 Disadvantages: However, the use of normalized scores may not be reasonable if the underlying
trait has a very non-normal distribution.
Types of normalized standard scores:

1. Sten scores: The name “sten” comes from “standard ten”, because stens are expressed as a whole
number ranging from 1 to 10 (with 5.5 representing the mean). One psychological test that uses
Sten scores is the California Psychological Inventory (CPI). The CPI is a self-report inventory
designed to measure various personality traits and behavioral patterns in adults. It assesses areas
such as socialization, self-control, responsibility, and emotional stability. The CPI provides Sten
scores for each of its scales, allowing for the interpretation of an individual's personality profile in
relation to a normative sample. Sten scores are divided into nine categories, ranging from 1 (low)
to 9 (high), with a midpoint at 5 (average).
2. Stanines: Stanines (from “standard nine”) are similar to stens except that the mean is 5.0, and
Standard deviation of approximately 2. The scale ranges from 1 to 9 and average is considered
between 4 to 6. One psychological test that uses stanine scores is the Differential Aptitude Test
(DAT). Raw scores are converted into percentiles and then into stanines. The DAT is a widely
used assessment tool designed to measure specific aptitudes in adolescents and adults, including
verbal reasoning, numerical ability, abstract reasoning, clerical speed and accuracy, mechanical
reasoning, and spatial relations. Stanine scores are commonly used in the interpretation of the
DAT results.
3. T-scores are normalized standard scores which always have a mean of 50 and a standard deviation
of 10. Many personality assessments, such as the Minnesota Multiphasic Personality Inventory
(MMPI) and its various versions (MMPI-2, MMPI-2) use T scores. A T-score greater than 70
indicates psychopathy in that category. Beck Depression Inventory (BDI) is a self-report
questionnaire used to assess the severity of depression symptoms. It typically provides T scores
based on the total score, allowing clinicians to interpret
Age Norms: Here the score is expressed in terms of age level at which the examinee is performing.
Items in the test are grouped into age levels and the test developer measures the average performance at
each level. The examinee’s test score is expressed in terms of the age at which s/he is functioning. For
example, one edition of the Stanford-Binet Intelligence Test has a vocabulary subtest that consists of 46
items, arranged from the easiest to the most difficult. The procedure is to read each word and have the
examinee give a definition. This procedure continues until the person is unable to correctly define
several words in a row. This yields a score for the examinee that represents the number of words defined
correctly. Table 4 presents some normative data from this particular vocabulary test. For example, for
the typical child at age 6, the average number of correct definitions is about 18; for age 12, the number
is about 28. Therefore, if an examinee correctly defined about 25 words, his or her performance is
similar to that of the typical child who is 10 years of age
Limitations:
1. They lack a standard and uniform unit throughout the period of physical and psychological traits. For
example, the growth in the level of general intelligence from age 8 to 9 is in no way is equal to the
growth from 3 to 4.
2 These are usually used with children as their performance changes with time. Many psychological
characteristics change over time; vocabulary, mathematical ability, and moral reasoning are examples.
However with adults age norms are typically less important because we would not expect, for example,
the average 50-year-old person to know more (or less) arithmetic than the average 40-year-old.
Grade Norms: A test score expressed in terms of a grade/class at which s/he is performing, Similar to
age norms, grade norms compare an individual's performance to others in the same grade level. They
are often used in educational assessments. Helpful in evaluating academic achievement and identifying
potential learning difficulties. The interpretation of grade norms is similar to that for age norms.
For example, the median score for children just beginning the second grade (grade equivalent 2.0) is
about 22, whereas the typical score for a beginning fourth grader (grade equivalent 4.0) is
approximately 37. What happens if a student who is just beginning grade 2 earns a scale score of 37 on
this test? In this case, the child would receive a grade-equivalent score of 4.0.
There are some limitations of these norms.

1. Grade-equivalent norms in educational testing, assume uniform curriculum across schools
though this may not be true.
2. These norms are not suited in subjects where there is rapid growth in elementary grades but slow
growth in higher grades.
Despite these limitations, grade-equivalent norms are commonly used in educational assessments, and
intelligence tests, due to their simplicity.
Gender Norms: Gender norms take into account gender differences and provide separate norms for
males and females. Useful when there are known gender-related differences in the construct being
measured.
Cultural Norms: Cultural norms consider the cultural background of the individuals being assessed,
acknowledging that performance may vary across different cultural groups. Important for ensuring
fairness and avoiding cultural bias in assessments.

Assessment

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Assessment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessment

Uploaded by

Copyright:

Available Formats

Unit 2: Test Construction and Standardisation: Item writing, Item analysis, Norms and Test

Standardisation, Reliability, and Validity.

Def. Of a psychological test

Selecting a Scaling Method The immediate purpose of psychological testing is to assign

These scaling methods offer different approaches to quantifying responses in psychological

 Did the item function as intended?

Method 1: Kelly’s 27% criterion (Extreme Group Method)

2. If D  0.19 then the item should be eliminated or completely revised

There are two types of errors:

Importance of reliability: A test’s reliability is important for two reasons.

1. Test-Retest Reliability: Time sampling

 It is easy to find as it does not require development of two equivalent forms.

2. Alternate (Equivalent, Parallel) Forms Reliability: Item sampling

3. Internal Consistency Reliability

Where r is the correlation between the two halves of the test.

Advantage: It is a methods for evaluating reliability within a single test administration.

4. Inter-rater (Inter-scorer, Inter-observer) Reliability:

Factors affecting reliability

Methods of improving reliability

 Test Length: In general, longer tests produce higher reliabilities up to a point.

There are two different types of criterion validity:

Suppose we have a true-false inventory of depression that we wish to validate.

Other ways to establish construct validity:

b) Factor analysis: Factor analysis is a complex statistical procedure based on

Factors affecting validity

 Homogeneity of test items: Items should be homogenous

Relationship between reliability and validity

Range of reliability and validity

a) Non Normalized Standard Scores

b) Normalized standard scores

Types of normalized standard scores:

There are some limitations of these norms.

You might also like