Ebel

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

11.

In objective-test items the student’s task and the basis on which the examiner will judge the
degree to which it has been accomplished are stated more clearly than they are in essay tests.

12. An objective test permits, and occasionally encourages, guessing. An essay test, permits, and
occasionally encourages, bluffing.

13. The distribution of numerical scores obtained from an essay test can be controlled to a
considerable degree by the grader; that from an objective test is determined almost entirely
by the examination itself.

In view of these similarities and differences, when might it be most appropriate and
beneficial to use essay items? Essay tests are favored for measuring educational achievement
when:

1. The group to be tested is small, and the test will not be reused
2. The instructor wishes to provide for the development of student skill in written expression
3. The instructor is more interested in exploring student attitudes than in measuring
achievements (Whether instructors should be more interested in attitudes than achievement
and whether they should expect an honest expression of attitudes in a test situation seem
open to question.)
4. The instructor is more confident of his or her proficiency as a critical essay reader than as an
imaginative wrier of good objective-test items.
5. Time available for test preparation is shorter than time available for test scoring

Essay test have important uses in educational measurement, but they also have some
serious limitations. Teachers should be wary of unsubstantiated claims that essay tests can
measure “higher-order thinking skills” if such skills have not been defined. They also should
question the validity of using essay test to determine how well students can analyze, organize,
synthesize, and develop original ideas of the efforts of instruction were not directed toward such
goals. Unfortunately, there is a tendency in some classrooms for instruction to be geared toward
establishing a knowledge base and for evaluation to be directed toward application of that
knowledge— “What can they do with it?” One of the purposes of planning is to prevent the
occurrence of such inconsistencies.

Comparison of Objective Formats

The most commonly used kinds of objective items are multiple choice, true—false,
matching, classification, and short answer. Many other varieties have been described in other
treatments of objective-test item writing (Wesman, 1971). However, most of these special
varieties have limited merit and applicability. Their unique features often do more to change the
appearance of the item or to increase the difficulty of using it than to improve the item as a
measuring tool.

Multiple-choice and true-false test items are widely applicable to a great variety of tasks.
Because of this and because of the importance of developing skill in using each one effectively,
separate chapters are devoted to true-false and multiple-choice item formats later in this text.
The multiple-choice form of test item is relatively high in ability to discriminate between high
and low achieving students. It is somewhat more difficult to write than some other item types,
but its advantages seem so apparent that it has become the type most widely used in tests
constructed by specialists. Theoretically, and this has been verified in practice, a given multiple-
choice test can be expected to show as much score reliability as a typical true-false test with
nearly twice that number of items.

The true—false item is simpler to prepare and is also quite widely adaptable. It tends to be
somewhat less discriminating, item for item, than the multiple- choice type, and somewhat more
subject to ambiguity and misinterpretation. Although theoretically a high proportion of true-false
items could be answered correctly by blind guessing, in practice the error introduced into true-
false test scores by blind guessing tends to be small (Ebel, 1968). This is true because well
motivated examinees taking a reasonable test do very little blind guessing. They almost always
find it possible and more advantageous to give a rational answer than to guess blindly. The
problem of guessing on true—false test questions will be discussed in greater detail in Chapter 8.
Here is an example of the true-false format.

Those critics who urge test makers to abandon the “traditional” multiple choice and true-
false formats and to invent new formats to measure a more varied and more significant array of
educational achievement are misinformed about two important points:

1. Any aspect or cognitive educational achievement can be tested by either the multiple-
choice or the true-false format.

2. What a multiple-choice or true-false item measures is determined much more by its


content than by its format.

The matching type is efficient in that an entire set of responses can be used with a cluster
of related stimulus words. But this is also a limitation since it is sometimes difficult to formulate
clusters of questions or stimulus words that are sufficiently similar to make use of the same set or
responses. Furthermore, questions whose answers can be no more than a word or a phrase tend to
be somewhat superficial and to place a premium on purely verbalistic learning. An example of
the matching type is given here.

The classification type is less familiar than the matching type, but possibly more useful in certain
situations. Like the matching type, it uses a single set of responses but applies these to a large
number of stimulus situations. An example of the classification type is the following.
The short-answer item, in which students must supply a word, phrase, number, or other
symbol, is inordinately popular and tends to be used excessively in classroom tests. It is easy to
prepare. In the elementary grades, where emphasis is on the development of vocabulary and the
formation of concepts, it can serve a useful function. It has the seeming advantage of requiring
the examinee to think of the answer, but this advantage may be more apparent than real. Some
studies have shown a very high correlation between scores on tests composed of parallel short-
answer and multiple-choice items, when both members of each pair of parallel items are intended
to test the same knowledge or ability (Eurich, 1931; Cook. 1955).

This means that students who are best at producing correct answers tend also to be best at
identifying them among several alternatives. Accurate measures of how well students can
identify correct answers tend to be somewhat easier to get than accurate measures of their ability
to produce them. There may be special situations, of course, where the correlation would be
much lower.

The disadvantages of the short-answer form are that it is limited to questions that can be
answered by a word, phrase, symbol, or number and that its scoring tends to be subjective and
tedious. Item writers often find it difficult to phrase good questions about principles,
explanations, applications, or predictions that can be answered by one specific word or phrase.
Here are some examples of short-answer items.
Some test specialists that a variety of item types he used in each examination in order to
diversify the tasks presented to the examinee. They imply that will improve the validity of the
scores or make the test more interesting. Others suggest that test constructors should choose the
particular item type that is best suited to the material they wish to examine. There is more merit
in the second of these suggestions than in the first, but even suitability of item form should not
be accepted as an absolute imperative. Several item forms are quite widely adaptable. A test
constructor can safely decide to use primarily a single item type, such as a multiple choice, and
to turn to one of the other forms only when it becomes clearly more efficient to do so. The
quality of a classroom test depends much more on giving proper weight to various aspects of
achievement and on writing good items of whatever type than on the choice of this or that type of
item.

Item Complexly

There continues to be an interest by some test developers toward the use of items that
present complex tasks, often based on lengthy or detailed description of real or contrived
situations. Some require the interpretation of complex data, diagrams, or background
information. Figure 7—2 shows some examples of complex items presented by Bloom and his
colleagues (1956). In some fields it is common to use items of this nature on licensure and
certification written examinations, particularly if the examinee pool is not very large

There are several reasons why complex items appear to be attractive. Since these tasks
obviously call for the use of knowledge, they provide an answer to critics who assert that
objective questions test only recognition of isolated factual. Furthermore since the situations and
background materials used in the task are complex, the items presumably require the examinee to
use higher mental processes. Finally, the items are attractive to those who believe that education
should be concerned with developing a student’s ability to think rather than more command of
knowledge (as if knowledge and thinking were independent attainments!).

However, these complex tasks have some undesirable features as test items. Because they
tend to be bulky and time consuming, they limit the number of responses examinee can make per
hour of testing time that is, they limit the size of sample of observable behaviors. Hence, because
of the associated reduction in reliability, test composed of complex tasks tend to be less efficient
than is desirable in terms of accuracy of measurement per hour of testing.

Furthermore, the more complex the situation and the higher the level of mental process
required to make some judgment about the situation, the more difficult it become to defend any
one answer as the best answer. For this reason, complex test items tend to discriminate poorly
between high and low achievers. They also tend to be unnecessarily difficult, unless the
examiner manages advocates of complex situational or interpretative test items do not claim that
good items of this type are easy to write.

The inefficiency of these items, the uncertainly of the best answer, and the difficulty of
writing good ones could all be tolerated if the complex items actually did measure more
important aspects of achievement than can be measured by simpler types. However, there is no
good evidence that this is the case. A simple question like, “Will you marry me?” can have the
most profound consequences. It can provide a lifetime’s crucial test of the wisdom of the person
who asks it and of the one who answers.

Some item writers are drawn to complex items because they are perceived as requiring
the application of knowledge. But any good item tests application of knowledge: good multiple-
choice items, for example, require more than recall. And some items test for knowledge
indirectly by giving examinees a task that requires knowledge. Numerical problems, discussed
earlier, test for application of knowledge, as do error recognition spelling tests, tests that require
the examinee to add or correct punctuation and capitalization, tests requiring editing of text, or
those that require the dissection and labeling of sentence parts. When examinees are asked to
interpret the meaning of a table, graph, musical score, cartoon, poem, or passage of test material,
they are asked to apply their knowledge.

Items that require interpretation of materials often are referred to as context-dependent


items. (They have no meaning outside the context of the material about which they are written.)
They are widely used in tests of general educational development, tests whose purposes are to
measure the abilities or students with widely different educational backgrounds. (Most succeed
quite well in doing so.) However, they are less appropriate, convenient, and efficient in testing
for achievement in learning specific subject matter. Test users should be skeptical of claims that
context-dependent items measure abilities rather than knowledge, because the abilities they
measure are almost wholly the results of knowledge.

Many of the indirect tests of knowledge, through special applications of the knowledge or
the use of complex situations, can be presented in true-false, multiple-choice, short-answer, or
matching form. Some are more conveniently presented in open-ended fashion, such as requiring
the examinee in produce a diagram, sketch, or set of editorial corrections. The main point to be
made here is that, while achievement can be tested most conveniently with one of the common
item formats, there are occasions when other means may be more convenient, satisfactory, or
palatable to those who are charged with providing evidence for valid score use.

NUMBER OF ITEMS

The number of questions to include in a test is determined largely by the amount of time
available for it. Many tests are limited to 50 minutes, more or less, because that is the scheduled
length of the class period. Special examination schedules may provide periods of 2 hours or
longer. In general, the longer the period and the examination, the more reliable the scores
obtained from it. However, it is seldom practical or desirable to prepare a classroom test that will
require more than 3 hours.

A reasonable goal is to make tests that include few enough questions so that most
students have time to attempt all of them when working at their own normal rates. One reason for
this is that speed of response is not a primary objective of instruction in most K-12 and college
courses and hence is not a valid indication of achievement. In many areas of proficiency, speed
and accuracy are not highly correlated. Consider the data in Table 7-3. The sum of the scores for
the first ten students who finished the test was 965. The highest score in that group was 105. The
lowest was 71. Thus, the range of scores in that group was 35 score units. Note that, though the
range of scores varies somewhat from group to group, there is no clear tendency for students to
do better or worse depending on the amount of time spent. One can conclude from these data that
on this test there was almost no relation between time spent in taking the test and the number of
correct answers given.

A second reason for giving students ample time to work on a test is that examination
anxiety, severe enough even in untimed tests, is accentuated when pressure to work rapidly as
well as accurately is applied. A third is that efficient use of an instructor’s painstakingly
produced test requires that most students respond to all of it.

In some situations, speeded tests may be appropriate and valuable. but these situations seem to
be the exception, not the rule. Though there are no absolute standards for judging speededness,
measurement specialists have come to adopt this one: A test is speeded if fewer than 90 percent
of the test takers are able to attempt all item.

The number of questions that an examinee can answer per minute depends on the kind of
questions used, the complexity of the thought processes required to answer them, and the
examinee's work habits. The fastest student in a class may finish a test in half the time required
by the slowest. For these reasons, it is difficult to specify precisely how many items to include in
a given test. Rules such as “use one multiple-choice item per minute” or “Allow 30 seconds per
true—false item” are misleading and unsubstantiated generalizations. Only experience with
similar tests in similar classes can provide useful test-length information.

Finally, the number of items needed depends also on how thoroughly the domain must be
sampled. And that, of course, depends on the type of score interpretation desired. For example, a
test covering 10 instructional objectives may require a minimum of 30 items when objectives-
referenced interpretations are wanted, but 20 items might suffice for norm-referenced purposes.

Content Sampling Errors


If the amount of time available for testing does not determine the length of a test, the
accuracy desired in the scores should determine it. In general, the larger the number of items
included in a test, the more reliable the scores will be. In statistical terminology, the items that
make up a test constitute a sample from a much larger collection, or population, of items that
might have been used in that test. A 100-word spelling test might be constructed by selecting
every fifth word from a list of the 500 words studied during the term. The 500 words constitute
the population from which the 100-word sample was selected.

Consider now a student who, asked to spell all 500 words, spells 325 (65 percent) of them
correctly. Of the 100 words in the sample, he spells 69 (69 percent) correctly. The difference
between the 65 percent for the population and the 69 percent for the sample is known as a
sampling error.

In the case of the spelling test, the population of possible questions is real and definite.
But for most tests it is not. That is, there is almost no limit to the number of problems that could
be invented for use in an algebra test or to the number or questions that could be formulated for a
history test. Constructors of tests in these subjects, as in most other subjects, have no
predetermined limited list from which to draw representative samples of questions. But their tests
are samples, nevertheless, because they include only a fraction of the questions that could be
asked in each case. A major problem of test constructors is thus to make their samples fairly
represent a theoretical population of questions on the topic.

The larger the population of potential questions, the more likely it is that the content
domain is heterogeneous; that is, it includes diverse and semi independent areas of knowledge or
ability. To achieve equally accurate results, a somewhat larger sample is required in a
heterogeneous than in a homogeneous domain. And as we have already noted, generally a larger
sample will yield a sample statistic closer to the population parameter than a more limited
sample.

Since any test is a sample of tasks, every test score is subject to sampling errors. The
large the sample, the smaller the sampling errors are likely to be. Posey (1932) has shown that
examinees’ luck, or lack of it, in being asked what they happen to know is a much greater factor
in the grade they receive from a 10 question test than from one of 100 questions. Sampling errors
are present in practically all educational test scores. However, it is important to realize that such
errors are not caused by mistakes in sampling. A perfectly chosen random sample will still be
subject to sampling errors simply because it is a sample.

LEVEL AND DISTRIBUTION OF DIFFICULTY

There are two ways in which the problem of test difficulty can be approached. One is to include
in the test only those items that any student who has studied successfully should be able to
answer. If this is done, most of the students can be expected to answer the majority of the items
correctly. Put somewhat differently, so many correct answers are likely to be given that many of
the items will not be effective in discriminating among various levels of achievement—best,
good, average, weak, and poor. The score distribution in this circumstance will be very
homogeneous, as reflected by a small standard deviation. But when our goal is to make norm-
referenced score interpretations, clearly such a test would yield scores of disappointingly low
reliability.

The other approach, for norm-referenced testing, is to choose items of appropriate content
on the basis or their ability to reveal different levels of achievement among the students tested.
This requires preference for moderately difficult questions. The ideal difficulty of these items
should be at a point on the difficulty sca1e (percent correct) midway between perfect (100
percent correct response) and the chance level difficulty 50 percent correct for true-false items,
25 percent correct for four-alternative multiple-choice items). This means the proportion of
correct responses, the item p-value, should be about 75 percent correct for an ideal true—false
item and about 62,5 percent correct for an ideal multiple-choice item. (The term p-value is used
to refer to the difficulty of an item.) This second approach generally will yield more reliable
scores than the first for a constant amount of testing time.

As we will see in the upcoming chapters on item writing, there are several methods item
writers can use to manipulate the difficulty level of a test item prepared for a specific group. And
for norm referenced testing, such manipulations must be employed to create items of the desired
difficulty level. Though it is possible to use the same methods to control the difficulty of items
written for a criterion-referenced test, such manipulations would be inappropriate. For criterion-
referenced measurement, the difficulty is built into the tasks or the knowledge descriptions that
specify the content domain. When item writers manipulate item content to adjust perceived
difficulty, they are in effect creating a mismatch between item content and the domain definition.
These mismatches impact content relevance by under representing legitimate content and by
introducing irrelevant (or less relevant) content. In sum, part of the reason for not specifying the
norm-referenced content domain too precisely is that it gives license to the item writer to create
items of the most appropriate difficulty.

Some instructors believe that a good test should include some difficult items to “test” the
better students and some easy items to give poorer students a chance. But neither of these kinds
of items tends to affect the rank ordering of student scores appreciably. The higher-scoring
students generally would-answer the harder items and, therefore, earn higher scores yet. Nearly
everyone would answer the easy items. The effect of easy items is to add a constant amount to
each examinee’s score, to raise all scores, but without affecting the rank order of students’
scores. For good norm-referenced achievement measures, items of moderate difficulty not too
hard and not too easy—contribute most to discriminating between students who have learned
varying amounts of the content of instruction.

Tests designed to yield criterion-referenced score interpretations likely will be easier in


difficulty level than their norm-referenced counterparts. When testing for minimum competency
or for mastery, the expectation is that most students have reached the minimum level or have
achieved mastery. The items in these tests should be easy for most students but should be
difficult for those who have not mastered the content the items represent. It should be clear that a
test item in isolation is not easy or difficult. The difficulty of an item relates to the nature of the
group and depends on the extent to which those in the group possess the ability presented by the
task.

You might also like