A Physics Diagnostic Test

THE DEVELOPMENT OF THE
DIAGNOSTIC TEST
Leo Sutrisno
Dept. Math and Science
Education
Faculty of Education
Tanjungpura University
Pontianak, Indonesia
List of contents
1 Constructing the test

1.1 The purpose of the test
1.2 The table of specification
1.3 Types of the test form
1.4 Multiple-choice form
2 Trialling the test
2.1 Item analysis
2.3 Criteria for selecting the appropriate items
31
List of Figures
Figure 1: A 2x2 table of response pattern of item A and
item B on dichotomous events
List of Tables
Table 1 Characteristics of items of the physics
diagnostic test
It is presented the development of a physics diagnostic test.

Gronlund (1981) describes the several stages in developing a
diagnostic test: determining the purpose of testing, developing
the table of specifications, selecting As appropriate item types,
and preparing relevant test items.
As reported in Chapter 3 students hold their own

conceptions about the phenomena of sound before they attend
formal instruction in these topics at secondary school. These
conceptions - the students' pre-conceptions - are generally
different from the scientists' conceptions. In class activities,
however, the students' pre-conceptions interact with the
scientists' conceptions promoted by teachers with several
possible outcomes. Fisher and Lipson (1986, p.784) called
students' observable outcomes which are different from an
expected ("correct") model of performance "student errors".
32
The test is designed primarily to detect the common student
errors in learning about sound. In other words, the test is used as
a diagnostic test. (Gronlund, 1981). The results of the test will
then be used to design learning experiences to remedy these
errors. The test is used again to detect whether or not these
errors have been overcome through remedial activities. In this
study, the expected ("correct") model of performance is scientists'
conceptions and the items of the test are based on the students'
pre-conceptions.
Theobald (1974) states that "the foremost requirement of a

test is that it be valid for its intended purpose" (p.28). He also
states that "a test ... can measure only a small proportion of the
total number of specific objectives of a course of study". It is
important therefore that the test has content validity. The
standard method to ensure the content validity of a test is by
working to a table of specifications. Gronlund outlines three steps
in building the table: obtaining a list of instructional objectives,
outlining the course content, and preparing a two-way chart of
instructional objectives vs course content. (See also Karmel,
1970; Hopkins and Stanley, 1981; Theobald, 1974).
The relationship between objectives and measurement is
clearly stated by Noll, Scannell, and Craig (1979). Objectives and
measurement are complementary. Bloom's Taxonomy of
Educational Objectives (1956) has been widely used as a
framework for educational measurement, and the cognitive
33
domain of this Taxonomy will be adopted as the basis for a list of
behaviourial objectives.
The course content outline is based on the Indonesian
Curriculum-1984 for the General Senior High Schools (SMA). The
unit of instruction about sound consists of several sub units such
as the sources of sound, the transmission of sound, the medium
of transmission, velocity of sound in several media, musical
instruments, human ears and the Doppler Effect.
Gronlund (1981, p.126) suggests that a diagnostic test
should have a relatively low level of difficulty and that most
items should be focused on the knowledge level (51%) with few
items at the level of synthesis or evaluation. At the knowledge
level, items are grouped into knowledge of specific facts,
knowledge of conventions and knowledge of principles.
Knowledge of specific facts refers to "those facts which can only
be known in a larger context" (Bloom, 1956 p.65). Knowledge of
conventions refers to "particular abstractions which summarize
observations of phenomena" (p.75). The items are intended to
recall and recognize facts, conventions or principles in learning
sound.
At the comprehension level items deal with the
interpretation of the given illustration, while at the application
level, items deal with selecting an appropriate principle(s) to
solve a given problem. Items at the analysis level focus on
analysisng the element of given problem.
34
As a broad classification, tests can be grouped as either
essay or objective types. Karmel (1970) makes a comparison
between essay and objective tests on the abilities measured,
scope, incentive to pupils, preparation and method of scoring.
The essay test uses few questions, requires the students to use
their own words to express their knowledge, generally covers
only a limited field of knowledge, encourages pupils to learn how
to express and organize their knowledge, and is very time
consuming to score.
On the other hand, the objective test requires the students
to answer, at most, in a few words, can cover a broad range of
student preparation, is time consuming to construct but can be
scored quickly. "Objective" tests are objective in the sense that
once an intended correct response has been decided on, the test
can be scored objectively, by a clerk or mechanically. On the other
hand, "essay" tests require an "expert" to decide on the worth of
the response and so involve an element of subjectivity in scoring.
See also Gronlund, 1981.
Hopkins and Stanley (1981) present a list of limitations of
essay tests based on research findings: reader unreliability, halo
effects, item-to-item carryover effects, test-to-test carryover
effects, order effects, and language mechanics effects. Theobald
presents similar criticisms: evaluation is difficult, scoring is
generally unreliable, and costly in time and effort, and the
sampling of student behaviours is usually inadequate. (Theobald,
1974). However, Blum and Azencot (1986) who compared the
results of students on multiple-choice test items and equivalent
essay questions in an Israeli examination concluded that there
were no significant differences between the mean scores.
35
Because of the need to comprehensively sample the field of
learning in a diagnostic test (Hopkins and Stanley, 1981),
objective tests are to be preferred over essay test for this
purpose.
Gronlund (1981) presents several advantages of the

multiple choice form over other forms of objective test: it has
greater reliability than other forms, it is one of the most widely
applicable it can measure various types of knowledge effectively,
and the use of a number of plausible alternatives makes the
results amenable to diagnosis. The chances of guessing correct
answers in multiple choice items can be reduced by increasing
the number of options in each item. Grosse and Wright (1985)
found that "more true-false items are required to achieve the
same reliability expected from 5choice items" (p.12). The
multiple-choice form will be used in this study.
There is no agreement about the best number of options in
multiple-choice tests. Noll, Scannell and Craig (1979) suggest that
at least four options per item should be used, and that five
alternatives are preferable, the chance factor being decreased by
increasing the number of options. Hopkins and Stanley (1981)
state that "five alternatives per item are optimal for many
situations" (p.245).
Lord (1977) reviewed four approaches to predict the optimal
number of options per item. The first approach is based on
maximizing the "discrimination function", A" proposed by Tversky
(1964). A" is the total number of possible distinct response
36
patterns on N items with A options in each item. The function A" is
maximized by A = e if N.A is fixed. Since e is 2.718 then A is
approximated by 3. Costin (1970) compared the discrimination
indices, the difficulty indices and the reliabilities (KR-20) of test
results which were based on 3-option and 4-option multiple-choice
tests on perception (N = 25), motivation (N = 30), learning (N =
30) and intelligence (N = 25). He found that the 3-option forms
produced higher values than the 4-option forms.
The second approach was proposed by Grier (1975) and is
based on test reliability. The expected reliability of a multiple-
choice test is a function of the number of options per item. Grier
concluded that for C > 54 the three options per item maximizes
the expected reliability coefficient. (for large C the optimal value
of N approaches 2.50).
The third approach examines the knowledge-or-random-
guessing assumption. Lord (1977) used this approach and found it
to support Grier's result (= 3). The fourth approach is based on
the use of the item characteristic curve (Lord, 1977). The item
characteristic curve of an item gives the probability of a correct
answer to the item as a function of examinee ability. Lord found
that the test where the pseudo chance-score level of items is 0.33
is superior to others.
Green, Sax and Michael (1982) analyse the reliability and the
validity of tests by the numbers of options per item. They found
that 4-options per item produces the highest reliability when
compared with 3-options and 5-options, but that 3-options has the
highest predicted validity (correlations with the course grade).
The experimental evidence quoted above argues in favour of
using three options per item and this form will be adopted in this
37
study. A multiple choice item has two parts: the introductory
statement to pose the problem and a series of possible responses.
The first part is called "a stem" and the second part is called
"options", "alternatives", or "choices". In this study the term
"options" is being used. Options include the correct response and
distracters, foils or decoys for the others.
Using the table of specifications items were constructed for
each sub unit of instruction. There are standard guidelines for
constructing multiple choice items (Hopkins and Stanley, 1981;
Noll et al., 1979; Gronlund, 1981; Theobald, 1974). In relation to
the item, there are many suggestions.
Summing up, items should be as clear and short as possible
to pose the problem and they should not contain cues which
attract students' attention to the correct response.
Several studies have been conducted on the effects of
violating item construction principles. McMorris, Brown, Snyder
and Pruzek (1972) studied the relationship between providing
signs about the correct response and the test results. Three types
of signs used in their study are: words or phrases in the stem
which provide a sign to the correct answers, grammar - where the
correct answer was the only one grammatically consistent with
the stem, and length where the correct answer was longer than
the distracters. They found that these factors are positively
correlated to the test results. (.48; .39; and .47 for words or
phrases, grammar, and length respectively).
Austin and lee (1982) studied the relationship between the
readability of the test items and item difficulty. Several aspects of
the readability were the number of sentences in each item, the
number of words in each item, and the number of "tokens"
38
(words,digits or mathematical symbols) in each item. They found
that these aspects of the readability of the items were negatively
correlated with item difficulty. (-.22, -.24, -.15 for the number of
sentences, words and tokens respectively). Increasing the number
of sentences, words or tokens in the item would decrease the
number of the correct answers.
Schrock and Mueller (1982) studied the effect of the stem
form (incomplete and complete statement), and the presence or
absence of extraneous material as cues to attract students to the
correct answers. They suggest that "a complete sentence stem is
to be used rather than an incomplete sentence stem" and that
"extraneous material should not be present in any type of stem"
(p.317).
Green (1984) studied the effects of the difficulty of language
and the similarity among the options on item difficulty. The
difficulty of language was varied by the length of stems, the
syntactic complexity, and the substitution of an uncommon term
with a familiar term in the stem. She found that the similarity
among the options significantly affect the item difficulty (F =
72.21; a < 1%) but it did not affect difficulty of language.
There are standard guidelines for constructing multiple
choice items (Hopkins and Stanley, 1981; Noll et al., 1979;
Gronlund, 1981; Theobald, 1974).
In relation to the item, there are many suggestions:
1. Items should cover an important achievement.
2. Each item should be as short as possible.
3. The reading and linguistic difficulty of items should be
low.
39
4. Items that reveal the answer to another should be
avoided.
5. Items which use the textbook wording style should be
avoided.
6. Items which use specific determiners (e.g., always,
never) should be avoided.
7. Each item should have only one correct answer.
8. In a power test, there should not be too many items
otherwise the text would become a speed test.
Some suggestions related to the stem are:

1. The stem should contain the central problem.
2. The negative statement or the incomplete statement in
the stem should be used with care.
3. The stem should be meaningful in itself.
114
Several suggestions related to the options are:
1. All options should be grammatically consistent with the
stem of the item.
2. All options should be plausible.
3. All options should be as parallel in form as possible.
4. The correct response should be placed equally often in
each possible position.
5. Verbal associations between the stem and the correct
response should be avoided.
6. Any clue which leads to the choice of the correct
response should be avoided.
7. Options which are synonymous or opposite in meaning
in the item should be avoided.
40
Summing up, items should be as clear and short as possible
to pose the problem and they should not contain cues which
attract students' attention to the correct response.
Test items used in this study were written to follow the
recommendations quoted above as closely as possible. Ninety-two
items were constructed and divided into two tests for ease of
administration: form-A (45 items) and form-B (47 items). The first
drafts were written in English and the final drafts for trying out
were in Bahasa Indonesia
The final drafts were sent to Indonesia, duplicated and

distributed to schools in West Kalimantan. Neither schools nor
students were randomly chosen. This was because of the lack of
time available to negotiate with school authorities. During that
time (May - June 1987) schools were very busy as it was the end
of the academic year and only six schools were willing to be used
for the trialing of the items. Physics teachers in these schools
then were asked to choose classes to be given the test, 231
students were participated.
The answer sheets together with teachers' responses and
comments analysed. It should be noted that the test results were
scored by counting the number of items answered correctly. In
administrations of the final form of the diagnostic test the focus
for scoring was on the number of curriculum sub-units not
mastered.
41
2.1 Item analysis
There are many standard item analysis techniques
(Hopkins and Stanley, 1981; Anastasi, 1976; Hills, 1981;
Gronlund, 1981; and Theobald, 1974). The most common indices
used are the facility value and the discriminating power of the
item. The means and standard deviations of the total correct
answers for each form of the tests are 54.91 and 22.92 (form A);
53.26 and 26.04 (form B).
Facility values
Facility value is defined as "the percentage of students
completing an item correctly" (Theobald, 1974, p.33). Some
authors use the reverse concept, that of "item difficulty" (Noll et
al., 1979, p.83; Anastasi, 1976, p.199; Gronlund, 1981, p.258). In
this study the term "facility value" is adopted as its use will lead
to less confusion. The greater the facility value of the item the
more students are able to answer correctly.
The means and the standard deviations of the facility values
of items of the test are 55 percent (SD = 10) for the form-A and
48.6 percent (SD = 21.5) for the form-B. The minimum values are
24 percent (item no 24 of form-A) and 19 percent (item no 25 of
the form-B) while the maximum values are 90 percent (item no 7
of the form-A) and 100 percent (item no 5 of the form-B).
The facility values of items will be considered as an aspect
which should be taken in account when selecting items to produce
the final test. Theobald (1974) suggests that "items should lie
within the range of 20 per cent - 80 per cent difficulty" (p.34).
This suggestion will be implemented but with consideration given
to Anastasi's warning that "the decisions about item difficulty
42
cannot be made routinely, without knowing how the test scores
will be used" (Anastasi, p.201).
As mentioned earlier the test is designed to be used as a
diagnostic test in an attempt to measure in detail academic
strength and weaknesses in a specific area, in contrast to the
survey test which is an attempt to measure overall progress
(Karmel, 1970, p.283). If an item does not appear to suffer from
technical faults in construction, a low facility value could reflect
that students' pre-conceptions have not been replaced by
scientists' conceptions in most students. Such an item indicates a
common students' weakness and would tend to be retained.
The discrimination power of the test

Gronlund (1981) defines the discrimination power of the test
item as "the degree to which it discriminates between pupils with
high and low achievement" (p.259). Anastasi (1976, p.206) used
"item validity" to describe this concept. The ideal item is the item
which is answered correctly by all students of the upper group
and is not by all students of the lower group (Findley, 1956). In
Findley's 121 usage the upper group is a defined proportion,
usually a third, of the class who scored most on the test and the
lower group is the same proportion who scored least. The
assumption is that the score on the total test is the best measure
we have of high and low achievement.
Several methods have been proposed to measure the
discriminating power of an item: biserial correlation, phi
coefficient, and the index of discrimination. All of these methods
are based on the measurement of the relationship between the
43
item score and the criterion score. Usually the total score on the
test itself is used as the criterion score.
The biserial correlation method is based on the assumption
that "the knowledge of the test item and the knowledge of the
entire test are both distributed normally" (Mosier and McQuitty,
1940, p.57).
Means and standard deviations of the bi-serial correlation
coefficients of the test form-A and the test form-B. (.20 and 0.12
Sd for the test form-A, .15 and 0.13 Sd for the test form-B). Item
no 1 of form-A had the lowest correlation. This item dealt with
generation of sound. Item no 6 of form-B which attempted to
investigate students' understanding about waves by using a
diagram of a wave, had the lowest correlation among items of
formB. Item number 43 of the form-A and number 17 of the
form-B have the highest correlation coefficients (.46, and .38
respectively). Item 43 dealt with the Doppler Effect, while item
17 dealt with the transmission of sound in a metal bar.
The use of the phi coefficient was developed by Guilford
(1941) based upon "the principle of the correlation between an
item and some criterion variable" (p.ll). This method is
applicable if these two groups are equal in number.
Several computational aids have been developed for
arriving at phi coefficient values: tables (Jurgensen, 1947, for
equal groups; Edgerton, 1960, for unequal groups), nomograph
(Lord, 1944), and abacs (Mosier and McQuitty, 1940; Guilford,
1941).
In this study the phi coefficients of items are calculated
following Theobald's procedures for using the abac method.
Table 4.4.7 presents means and standard deviations of phi
44
coefficients of the tests form-A and form-B. There is no
significant difference between the phi coefficients of the two
groups. (t = 0.02, p< .O1).
Findley (1956, p.177) also proposed two formulas to
measure the discrimination power of items. One of the potential
virtues of this method is that it can be used to provide "a precise
measure of the distractive power of each option" (Findley, p. 179).
In this regard, Findley's method can provide more information
than the bi-serial correlation coefficient and than the phi
coefficient measured by the abac method. So the index of
discrimination power of items measured using Findley's method
will be used to select items.
Means and standard deviations of these indices are .22 and
0.17 for the form-A, and .21 and 0.18 for the form-B). There is no
significant difference between the indices of the two groups (t =
0.02, a _ .O1). The Findley index and phi coefficient are highly
correlated (.96 for the form-A, and .97 for the form-B).
3 Criteria for selecting the appropriate items
Anastasi (1976) states that items of a test should be

evaluated both qualitatively and quantitatively. Qualitative
evaluation deals with the content and the verbal structure of the
item. The care taken in the construction of the items has been
described in sections 4.1. Quantitative evaluation deals with the
difficulty and the discrimination of the items and this has been
presented in section 4.2. Results of the evaluation will be used as
a base for selection of the items in order to improve the reliability
and the validity of the test.
45
The official class-period time in Indonesian secondary
schools is 45 minutes but the effective period would be about 40
minutes. Students have wide experience of multiple-choice
testing in Indonesian secondary schools and experience suggests
that for a test to be completed by nearly all students the
appropriate number of items would be about 30. A final form of a
physics diagnostic test was constructed from the 97 items which
were trialled. The 30 items will be selected mainly from the test
form-A. Whenever needed items of the form-B were included.
In selecting items for inclusion in a test, Hopkins and
Stanley (1981, p.284) believe that an item which has high
difficulty and low discrimination power may, on occasions, be
acceptable. On the other hand, Gronlund (1981) says that "a low
D should alert us to the possible presence of technical defect in a
test item" (p.262). Noll et al. (1979) stated that there is little
reason to retain items which have negative discrimination indices
unless the other important values can be shown. Theobald (1974)
suggested that "items should lie within the range of 20 per cent -
80 per cent difficulty" (p.34) and "items should be carefully
scrutinized whenever D < +.2" (p.32). Anastasi claims that items
which have around 50 per cent of difficulty and of discrimination
power are preferable.
Although "item analysis is no substitute for meticulous
care in
planning, constructing, criticizing, and editing items" (Hopkins and
Stanley 1981, p.270) Theobald's quantitative guidelines will be
adopted. He states, however, that even when test items are
chosen to reflect precisely stated behavioural objectives, and the
test as a whole is an adequate and representative sample of the
46
criterion behaviours for the course of study, additional
considerations also apply (P32).
There are three additional considerations: the students'
and the teachers' comments about the items, prerequisite
relationships among sub units of study and caution indices.
Students' and teachers' comments on the items are available.
These were also considered carefully when selecting or rejecting
items.
There are many methods used to test models of
prerequisite relationships among sub units. The first method is the
Proportion Positive Transfer method which was pioneered by
Gagne and Paradise (Barton, 1979), and its refinement suggested
by Walbesser and Eisenberg (in White, 1974b). White observed
that these methods do not take into account errors of
measurement, so he and Clark proposed another method (White
1974a, 1974b). By applying the Guttman's coefficient (Yeany,
Kuch and Padilla, 1986), and the phi coefficient (Barton, 1979;
White, 1974a) the scalogram method has also been widely used.
Barton also proposed a method called the Maximum Likelihood
method. Proctor (1970) suggested the use of X.2 procedures
(Bart and Krus, 1973 p.293).
Dayton and Macready (1976) used a probabilistic method
(Yeany et al., 1986). An Ordering Theory method has been used
by Bart and Krus (1973); Airasian and Bart (1973); Krus and Bart
(1974). Bart and Read (1984) tried to adopt Fisher's exact
probability method.
Although all these methods have their own particular
advantages and limitations most of them share a similar problem
of determining whether a certain model of prerequisite
47
relationship occurred by chance or not. Bart and Read's method
suggests procedures to solve this problem, so their method has
been adopted in this study. This method is based on the axiom:
For dichotomous items, with a correct response scored "1", and
incorrect response scored "0", success on item i is considered a
prerequisite to success on item j, if and only if the response
pattern (O1) for items i and j respectively does not occur. Bart
and Read, 1984 (p.223)
Given items A and B which have been administered to N
students we produce a 2 x 2 table of response pattern as follows:
Item Success
N 11 M
12 N 1.
A (1)
Fa i l ( 0 ) N Z1 NZZ NZ.
N,1 N,Z N
Figure 4.4.1: A 2x2 table of response pattern of item A and

item B on dichotomous events.
If the success of item A is necessary though not sufficient

for success on item B (ie. NZ1 = 0), then item A is a prerequisite
to item B. All items which were found to be prerequisite for the
development of the concepts of sound were included.
There are several methods to analyze item response
patterns. These methods can be used to detect students who
48
need to be given remedial activities. The method adopted in this
study is based on Student-Problem (S-P) curve theory. Harnisch
(1983) states that this method also provides information about
each respective item by observing the distribution of distracters
above the P-curve.
An unusual item is one that has a large number of better
than average students answering it incorrectly while an equal
number of less than average students answer it correctly,
Harnisch, 1983 (p.199)
Harnisch and Linn (1981), and Harnisch (1983) proposed the
Modified Caution Index (MCI) Formula. This formula is originally
used to detect the characteristics of students based on their
responses. Harnisch (1983) stated that the MCI can be used to
detect the characteristics of items as well by reversing the roles
of students and items. "High MCI's for items indicate an unusual
set of responses by students of varying ability, and thus these
items should be examined closely by the test constructors and
the classroom teacher" (p.199). This criterion will be adopted as
one of the additional considerations in selecting items.
Thus the criteria used for the selection of appropriate test
items to be used in this study can be restated as:
1. The facility value of the item lies between 20%-80%.
2. The discrimination index (D) is equal to or more
than .20.
If these requirements are not met, additional information
about the items is needed such as:
3. The students' and teachers' comments about the item.
4. The prerequisite relationships of the item for other
items.
49
5. The Modified Caution Index (MCI) of the item.
6 The final form of the test
There were 22 items of the form-A which meet the first two
criteria. Several items which did not meet these criteria were
considered for inclusion in the final form of the test after
additional consideration. For example item no 8, facility value =
64, D = .15, received several comments that indicate confusion
between a vacuum pump (pompa hampa udara) and a
vacuum (ruang hampa u ara). It is suggested that the
description of a vacuum pump within the stem be rephrased. In
addition, the MCI of this item is low (.05). Item no.15 will also be
included in the final test because the MCI of the item is low (.21)
Other items were not acceptable for inclusion although some
of them could be acceptable after slight revision. For example,
item no21 which has 50 percent facility value and .18 index of
discrimination received some comments revealing that many
students have not heard the term bulk modulus. Providing an
explanation about the meaning of the bulk modulus may be
expected to increase its facility and may alter the index of
discrimination. However, this item is excluded because the MCI of
this item is high (.77).
Six items from form-B were chosen to replace the position of
items of form-A on the bases of prerequisite relationships among
sub units. Item no.l (form-8) replaces item no.l (form-A), item
no.12 (B) replaces item no.19 (A), items no.13(B) and 14(B)
replace items no.13(A) and 14(A), respectively, item no.17(B)
replaces item no.16(A), item no.21(B) replaces item no.21(A), and
items no.42(B) and 44(B) replace items no.40(A) and 45(A)
respectively.
50
Table 4. 1 presents the number of items, their numbering in
the original forms and facility values, discrimination indices, and
the MCIs of each item calculated from the initial trialling, and from
the second investigation. The facility values of the 32 items in the
final version of the test fell between 20 and 80 percent: the
increase in facility values was to be expected over those in form-A
and form-B because these students had received instruction in
physics of sound. Similarly all but 2 items had discrimination
indices above .20.
There was strong pressure from physics teachers to include,
in the final form of the test, items which would test students'
conceptions of the transmission of sound at night and the
influence of the force of gravity on sound. There were no such
items in form-A or form-B and two new items were constructed
with great care and included in the final form. If these items
performed poorly in the second investigation (where the test was
administered to 596 students from.l9 schools) they could have
been dropped for the experimental stage of the study. As it was
they performed as well as many other items and contributed to a
total test reliability of .85 (SpearmanBrown) (Standard error of
measurement = 1.77).

diagnostic test.
51
Item Original Fac.
D MCI Sub units
no. item no. value
a ba b ab
1 1B 62 89.50 .29 .06 .11 The generation of sound
2 3A 71 83.67 .49 .11 .41
3 SA 61 69.26 .25 .41 .32 The transmis sion of sound
4 6A 76 56.38 .33 .23 .13
5 8A 63 80.15 .33 .05 .07 The medium of transmission
6 9A 50 61.38 .51 .23 .18
52
List of contents

2.1 Item analysis
2.3 Criteria for selecting the appropriate items
List of Figures
Figure 1: A 2x2 table of response pattern of item A and
item B on dichotomous events
List of Tables
diagnostic test
Important terms
3-option
4-option
5-option
abacs
53
analysis level
application level
behaviourial objectives
biserial correlation,
Bloom's Taxonomy
common student errors
comprehension level
content validity
diagnostic test
diagnostic test
difficulty
difficulty indices
difficulty of language
discrimination function
discrimination indices,
discrimination of the items
Educational Objectives
effects of violating item construction principles
essay and objective tests
Facility values
framework for educational measurement
Guttman's coefficient
halo effects,
index of discrimination.
instructional objectives
item-to-item carryover effects,
knowledge level
knowledge of conventions
Knowledge of conventions
54
knowledge of principles.
knowledge of specific facts
knowledge-or-random-guessing assumption
language mechanics effects
length of stems,
level of difficulty
level of synthesis
Modified Caution Index (MCI) Formula
optimal number of options per item.
order effects,
Ordering Theory
own conceptions
phi coefficient
phi coefficient,
prerequisite relationships
probabilistic method
readability of the test items and item difficulty
reader unreliability,
recall and recognize facts, conventions or principles
reliability
scientists' conceptions
standard guidelines for constructing multiple choice items
stem form
student errors
Student-Problem (S-P) curve theory
students' observable outcomes
students' pre-conceptions
syntactic complexity
table of specifications
55
table of specifications
test-to-test carryover effects,
The discrimination power of the test
the reliabilities (KR-20)
the substitution of an uncommon term
true-false items
validity of tests
List of refernces
Anastasi, (1976). Psycho log i ca l tes t (4
i ngth ed . ) . New Yo rk :
Macmi l l an .
Bart, W.M., & Krus, D.J., (1973). An ordering-theoretic method to
determine hierarchies among items. Educa t i ona l and
Psycho log i ca l Measu rement, 33, 291-300.
Bart, W.M. & Read, S.A., (1984). A statistical test for prerequisite
relations. Educational and Psychological Measurement, 44,
223227.
Barton, A.R., (1979). A new statistical procedure for the analysis
of hierarchy validation data. Research in Science Education,
9, 23-31
Bloom, B.S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl,
D. R., (1956). Taxonomy of educational objectives:
handbook 1: cognitive domain. London: Longman.
Blum, A., (1979). The remedial effect of a biological game. Journal
of Research in Science Teaching, 16(4), 333-338.
Blum, A., & Azencot, M., (1986). Multiple choice versus equivalent
essay questions in a national examination. European Journal
of Science Education, 8(2), 225-228.
Costin, F., (1970). The optimal number of alternatives in multiple-
choice achievement tests: Some empirical evidence for a
mathematical proof. Educational and Psychological
Measurement, 30, 353-358.
Dayton, C.M., & Macready, G.B., (1976). A probabilistic model for

hierarchical relationships among intellectual skills and
propositional logic tasks. Psvchometrica, 41, 189-204.
Edgerton, H.A., (1960). A table for computing the phi coefficient.
Journal of Applied Psychology, 44, 141-145.
56
Fisher, K.M., & Lipson, J.I., (1986). Twenty questions about student
errors. J ou rna l o f Resea rch i n Sc ience Teach 23(9
ing) , 783
-
803 .
Gagne, R.M., & Paradise, N.E., (1961). Abilities and learning sets in
knowledge acquisition. Psycho log i ca l Monographs75 , (14 ,
who le , no . 518) .
Green, K. (1984). Effects of item characteristics on multiple choice
item difficulty. Educat i ona l and Psycho log i ca l Measurement ,
44, 551-561.
Green, K., Sax, G., & Michael, W.B., (1982). Validity and reliability
of test having differing numbers of options for students of
differing level of ability. Educa t i ona l and Psycho log i ca l
Measurement , 42, 239-245.
Grier, J.B., (1975). The number of alternatives for optimum test
reliability. J ou rna l o f Educat i ona l Measurement
, 12, 109-113.
Grondlund, N.E., (1981). Measurement and eva lua t i on i n teach ing

(4th ed.). New York: Collier Macmillan.
Guilford, J.P., (1941). The phi coefficient and chi square as indices
of item validity. Psychomet r i ka, 6, 11-19.
Guilford, J.P., & Fruchter, B., (1983). Fundamenta l s ta t i s t i c s i n
psycho logy and educat i on (7th ed.). Tokyo: McGraw-Hill
Kogakusha.
Hopkins, C.D., (1976). Educa t i ona l re sea rch : A s t ruc tu re f o r
i nqu i r y.
Colombus, Ohio: Merrill, Bell & Howell.
Hopkins, C.D., (1980) Unders tand ing educa t i ona l resea rch : An
i nau i r v approach Ohio:
. Merrill.
Hopkins, K.D., & Stanley, J.C., (1981). Educa t i ona l and
psycho log i ca l measu rement and eva lua t. iEngiewood
on
Cliffs, New Jersey: Prentice-Hall.
Jurgensen, C.E., (1947). Table for determining phi coefficients.
Psvchomet r i ka, 12(1), 17-29.
Kraemer, H. C., & Thiemann, S. (1987). How many sub jec ts?
Sta t i s t i ca l power ana lys i s i n resea
. London:
rch Sage.
Krus, D.J., & Bart, W.M., (1974). An ordering theoretic method of
multidimensional scaling of items. Educat i ona l and
Psycho log i ca l Measurement , 34, 525-535.
Lord, F.M., (1944). Alignment chart for calculating the fourfold
point correlation coefficient. Psychomet r i ka, 9(1), 41-42.
Lord, F.M., (1977). Optimal number of choices per item - a
comparison of four approaches. Journal of Educational
Measurement, 14(1), 33-38.
57
McMorris, R.F., Brown, J.A., Snyder, G.W., & Pruzek, R.M., (1972).
Effects of violating item construction principles. J ou rna l o f
Educa t i ona l Measurement , 9(4), 287-295.
Mosier, C.I., & McQuitty, J.V., (1940). Methods of item validation
and ABACS for item-test correlation and critical ratio of
upper-lower difference. Psvchometrika, 5(1), 57-85.
Noll, V.H., Scannell, D.P., & Craig, R.C., (1979). Introduction to
educational measurement (4th ed.). Boston: Houghton
Mifflin.
Proctor, C.H., (1970). A probabilistic formulation and statistical
analysis of Guttman scaling. Psychometrika, 35, 73-18.
Theobald, J.H., (1974). Classroom testing. Principles and practice
(2nd ed.). Melbourne: Longman Cheshire.
Theobald, J.H., (1977). Attitudes and achievement in biology.
Unpublished Ph.D. Thesis, Monash University.
Wade, R.K., (1984/85). What makes a difference in inservice
teacher education? A meta-analysis of research. Education
Leadership, 42(4), 48-54.
Walbasser, N.H., & Eisenberg, T.A., (1972). A review of research
on behavioural objectives and learning hierarchies.
Mathematics Education Reports. Columbus, Ohio: ERIC
information analysis centre for science, mathematics and
environmental education. ERIC no.ED059900.
White, F.A., (1975). Our acoustic environment. New York: Wiley .
White, H.E., (1968). Introduction to college physics. New York: Van
Nostrand.
White, M.W., Manning, K.V., & Weber, R.L., (1968). Basic physics.
New York: McGraw-Hill.
White, R.T., (1914a). A model for validation of learning hierarchies.
Journal of Research in Science Teaching, 11(1), 1-3.
White, R.T., (1974b). Indexes used in testing the validity of
learning hierarchies. Journal of Research in Science Teaching,
11(1), 61-66.
Yeany, R.H., Dsst, R.J., & Mathews, R.W., (1980). The effects of
diagnostic-prescriptive instruction and locus of control on
the achievement and attitudes of university students.
Journal of Research in Science Teaching, 17(6), 537-543.
Yeany, R.H., Kuch, Chin Yap, & Padilla, M.J., (1986). Analysing
hierarchical relationships among modes of cognitive
reasoning and integrated science process skills. Journal of
Research in Science Teaching, 23(4), 277-291.
Yeany, R.H., & Miller, P.A., (1980). The effect of
diaqnostic/remediation: instruction on science learning: A
58
meta - ana lys i.sPaper presented at the annual meeting of the
National Association for Research in Science Teaching.
Boston, MA, April 11-13. ERIC no.ED187533.
Yeany, R.H., Waugh, M.L., & Blalock, A.L., (1979). The effects of
achievement diagnosis with feedback on the science
achievement and attitude of university students. J ou rna l o f
Re
59

A Physics Diagnostic Test

Uploaded by

Copyright:

Available Formats

A Physics Diagnostic Test

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Physics Diagnostic Test

Uploaded by

Copyright:

Available Formats

THE DEVELOPMENT OF THE

1 Constructing the test

It is presented the development of a physics diagnostic test.

1 Constructing the test

1.1 The purpose of the test

As reported in Chapter 3 students hold their own

1.2 The table of specification

Theobald (1974) states that "the foremost requirement of a

1.3 Types of the test form

1.4 Multiple-choice form

Gronlund (1981) presents several advantages of the

Some suggestions related to the stem are:

2 Trialling the test

The final drafts were sent to Indonesia, duplicated and

The discrimination power of the test

3 Criteria for selecting the appropriate items

Anastasi (1976) states that items of a test should be

Figure 4.4.1: A 2x2 table of response pattern of item A and

If the success of item A is necessary though not sufficient

Table 1 Characteristics of items of the physics

1 1B 62 89.50 .29 .06 .11 The generation of sound

2 3A 71 83.67 .49 .11 .41

3 SA 61 69.26 .25 .41 .32 The transmis sion of sound

4 6A 76 56.38 .33 .23 .13

5 8A 63 80.15 .33 .05 .07 The medium of transmission

6 9A 50 61.38 .51 .23 .18

1 Constructing the test

Dayton, C.M., & Macready, G.B., (1976). A probabilistic model for

Grondlund, N.E., (1981). Measurement and eva lua t i on i n teach ing

You might also like