Test Construction

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 40

Psychology 162 Psychological Measurement

This term refers to


Examinations in individual courses of instruction

in schools of all kinds and at all levels Standardized measures of achievement used routinely by all the instructors in a particular unit of instruction Commercially distributed tests of achievement used throughout the country.

Ensuring the content validity of an achievement test by an explicit plan for constructing the test ia more appropriate than determining the content validity after the test has been constructed.

If representative persons who are to use the test agree in advance on the appropriateness of the plan, arriving at an acceptable instrument is mainly a matter of technical skill and applied research.

1. An outline of content for the instrument to be constructed 2. Plan should describe the types of items to be employed
Approximate no. of items to be employed in each

section and subsection of the test. Give examples of types of items to be used

3. How long will the test take? How will it be administered? 4. How will it be scored? What types of norms will be obtained 5.When the plan is completed, it is reviewed by numerous persons (teachers, subject matter experts, administration officials etc) 6. Suggestions for revision are given. 7. Revised plan submitted to reviewers until it is approved by the reviewers.

Even in ordinary final exams


1. Make an outline of intended coverage 2. Discussion of outline with fellow instructors Whether a test is good or bad depends largely on the test items Types of items 1. Short answer essay questions 2. Longer essay questions 3. Objective items-multiple choice items

3 reasons why commercially distributed achievement tests rely mostly on multiplechoice items:
1. They are very easy to administer and score 2. expert item writers who are highly skilled at

composing such items are available 3.When multiple choice items are skillfully composed, they can accurately measure almost anything

4. higher reliability They sample the topic much more broadly than

would be possible on essay questions.

Essay questions are not as reliable due to 1. measurement error due to sampling of content.
50-minute exam could easily employ 50 multiple choice items compared with 5 one-paged essay questions. The latter will take longer time.

2.measurement error due to subjectivity of

scoring With 50 or more students, easier to use multiplechoice test items rather than essay With 15 students or less, saves time to construct and score essay examinations

Define clearly what you want to measure. Use substantive theory as a guide and try to make items as specific as possible. 2. Generate an item pool. Theoretically, all items are randomly chosen from a universe of item content. In practice, however,care in selecting and developing items is valuable. Avoid redundant items. 3. Avoid exceptionally long items, which are rarely good.
1.

4. Keep the level of reading difficulty appropriate for


those who will complete the scale. 5. Avoid double-barreled items, which convey two or more ideas at the same time. For example, consider an item that asks the respondent to agree or disagree with the statement, I vote Democratic because I support social programs. There are two different statements with which the person could agree: I vote Democratic and I support social programs.

6. Mix positively and negatively worded items. Sometimes, respondents develop the acquiescence response set. This means that the respondents will tend to agree with most items. To avoid this bias, you can include some items worded in the opposite direction. For example, in asking about depression, the CES-D (Center for Epidemologic Studies Depression Scale) uses most negatively worded items (such as I felt depressed)/ However, the CES-D also includes items worded in the opposite direction (I felt hopeful about the future).

Dichotomous format.

Offers two

alternatives A point is given for the selection of one of the alternatives. Presents students with a series of statements

The task is to determine

which statements are true, which are false. Easy to construct, easy to administer, easy to scoreThe mere chance of getting any item correct is 50%.

Polytomous format.
Each item has more than two alternatives Multiple-choice items are easy to score. Probability of obtaining a correct response by

chance is lower than it is for t/f items. Major advantage is that it takes very little time for test takers to respond to a particular item, because they do not have to write. The test can cover a large amount of information in a relatively short time.

When taking a multiple choice test, you must determine

which of several alternatives is correct. Incorrect choices are called distractors. It is usually best to develop three of four good distractors for each item Well-chosen distractors are an essential ingredient for good items. Psychometric analyses show that the validity and reliability were equal for five alternative and three alternative multiple choice items.

Formula to correct for guessing on a test is: corrected score=R W n-1

Where R= the number of right responses W= the number of wrong responses n=the number of choices for each item

The Likert format. One popular format for attitude and personality scales Requires that a respondents indicate the degree of agreement with a particular attitudinal satement. Used as part of Likerts method of attitude scale construction Example item: I am afraid of heights.

Five alternatives are offered:

strongly disagree, disagree, neutral, agree, strongly agree. In some six responses are used: SD, moderately disagree, mildly disagree, mildly agree, moderately agree, SA. Scoring requires that any negatively worded items be reverse scored and the responses then be summed.

The Category format


Similar to likert format but uses an even greater

number of choices For example,measures in which people rate items on a 10 pt. scale It can have more or fewer categories.

Checklists and Q-sorts


The adjective checklist is common in personality

measurement. A subject receives a long list of adjectives and indicates whether each one is characteristic of herself (or himself) or someone else. The Q-sort increases the number of categories A subject is given statements and asked to sort them into nine piles

If a statement really hit home, you would place it

in pile nine. Those that were not at all descriptive would be place in pile 1. Most of the cards are usually placed in piles 4,5 and 6 The frequency of items placed in each of the categories usually looks like a bell-shaped curve. The items that end up in the extreme categories usually say something interesting about the person.

A good test has good items. But what are good items? Item analyses is a general term for a set of methods used to evaluate test items. One of the most important aspect of test construction. The basic method involve assessment of item difficulty and item discriminability.

For a test that measures achievement or ability item difficulty is defined by the number of people who get a particular item correct. For example if 84% of the people taking a particular test get item 24 correct, then the difficulty level for that item is .84. How hard should items be in a good test? This depends on the uses of the test and the types of items.

The first thing to be determined is the p that an item could be answered correctly by chance alone. A true /false item with a difficulty level of .5 is not a good item. A multiple-choice item with four alternatives could be answered correctly 25% of the time. Therefore, we would require difficulty greater than 25% for an item to be reasonable in this context An item answered correctly by 100% of the respondents offers little value since it does not discriminate between individuals

Usually about halfway between 100% of the respondents getting the item correct and the level of success expected by chance alone. Step 1. Find half of the difference between 100% success and chance performance 1.oo -.25 = .75 = .375 2 2 Step 2. Add this value to the p of performing correctly by chance.

(midway (chance
pt.) performance)

.375 +

.25 =

.625

(Optimum item difficulty)

A simpler method for obtaining the same result is to add 1.00 to chance performance and divide by 2.0. For this example the result would be

.25 + 1.0 = .625 2.0

In most tests, the items should have a variety of difficulty levels, because a good test discriminates at many levels. For example, a professor who wants to determine how much his or her students have studied might like to: discriminate between students who have not studied at all and those who have studied just a little. Discriminate between students who studied a little and students who studied a fair amount. distinguish those students who have studied more than average from those who have worked and studied exceptionally hard.

For most tests, items in the difficulty range of .30 to .70 tend to maximize information about the differences among individuals. However, some tests require a concentration of more-difficult items. In constructing a good test, one must also consider human factors e.g., though items answered correctly by all students will have poor psychometric qualities, they may help the morale of the students who take the test.

A few easier items may help keep test anxiety in check, which in turn adds to the reliability of the test.

#3 In one experiment that studies expression of aggression, researchers looked at childrens television viewing of violent cartoons and how these affected behavior towards peers. What is the dependent variable in this experiment? a.TV viewing of violent cartoons b.TV viewing of nonviolent cartoons c. Behavior towards peers d. mental processes while viewing TV e. all of the above

#4 For nos. 4 to 8, refer to the following situation: The last three times Alex visited Dr. Sackeet, he was administered a painful immunization injection that made him cry in pain. When his mother brought Alex for another visit, Alex began to cry as soon as he saw Dr. Sackeet. Question: the painful injection that Alex received during each visit was a ___ that elicited tears from Alex. a. punishment b. negative reinforcement c. classical conditioning d. unconditioned stimuli e.. conditioned stimuli

#8 Fortunately, Dr. Sackeet gave Alex no more injections for quite some time. Over that time, he gradually stopped crying and even came to like him. ___ had occurred. a. forgetting b. friendship c. generalization d. secondorder conditioning e. extinction # 2 Which of the following characteristics/disorders are linked to the 23rd chromosome? a. Huntingtons disease b. schizophrenia c. alcoholism d. color blindness e. phenylketonuria

Item no. 3 4 8 2

correct 31 31 37 23

Total no of inividuals 39 39 39 39

% 79 % 79% 95% 59%

Which items will I include in my exam?

one way to evaluate test items. one way to examine the relationship between performance on particular items and performance on the whole test. We ask Who gets this item correct? Item discriminability determines whether the people who have done well on particular items have also done well on the whole test.

This method compares people who have done very well with those who have done very poorly on the test. For example, you might find the students with test scores in the top third and those in the bottom third of the class. Then you would find the proportions of people in each group who got each item correct. The difference between these proportions is called the discrimination index.

Item No.

Proportion Correct for students in the top third of the class (Pt)

Proportion correct for students in the bottom third of the class (Pb)

Discriminability Index (d1=Pt-Pb)

1 2
3 4 5

.89 .76
.97 .98 .56

.34 .36
.45 .95 .74

.55 .40
.52 .03 -.18

Items 1, 2 and 3 appear to discriminate reasonably well. Item 4 does not discriminate well, because the level of success is high for both groups; it must be too easy. Item 5 appears to be a bad item because it is a negative discriminator. This sometimes happens on multiple-choice examinations when over prepared students find some reason to disqualify the response keyed as correct.

You might also like