02 Chapter 3
02 Chapter 3
02 Chapter 3
INTRODUCTION
82
Research designs can be classified as either non experimental or experimental.
In non experimental designs the researcher studies phenomena as they exist.
In contrast, the various experimental designs all involve researcher intervention
(Gall, Gall & Borg, 2003). This research study is non experimental in design,
and as the purpose of this study is prediction, a correlational research design is
used. Correlational research refers to studies in which the purpose is to
discover relationships between variables through the use of correlational
statistics. The basic design in correlational research is very simple, involving
collecting data on two or more variables for each individual in a sample and
computing a correlation coefficient.
Many studies in education have been done with this design. As in most
research, the quality of correlational studies is determined not by the complexity
of the design or the sophistication of analytical techniques, but by the depth of
the rationale and theoretical constructs that guide the research design. The
likelihood of obtaining an important research finding is greater if the researcher
uses theory and the results of previous research to select variables to be
correlated with one another (Gall, Gall & Borg, 2003).
In this study, first year Mathematics Major students from the University of the
Witwatersrand were selected from the MATH109 course and their performance
on assessment in the PRQ format was compared to their performance on
assessment in the CRQ format. In addition, students were asked to indicate a
confidence of response corresponding to each test item, in both the CRQ and
83
PRQ assessment formats. Further data was collected from experts who
indicated their opinions of the difficulty of the test items, both PRQs and CRQs,
independent of the students’ performance in each question. Further discussion
on the research methodology is presented in section 3.4.
The objective of this research study is to design a model to measure how good a
mathematics question is and to use the proposed model to determine which of
the mathematics assessment components can be successfully assessed with
respect to the PRQ format, and which can be successfully assessed with
respect to the CRQ format.
To meet the objective of the study described above, the study will be designed
according to the following steps:
[1] Three measuring criteria are used to develop a model for determining the
quality of a mathematics question (the QI model).
[2] The quality of all PRQs and CRQs are determined by means of the QI
model.
[3] A comparison is made within each assessment component between PRQ
and CRQ assessment.
Based on these design steps and having defined the concept of a good
mathematics question, the research question is formulated as follows:
Research question:
Can we successfully use PRQs as an assessment format in undergraduate
mathematics?
84
Subquestion 1:
How do we measure the quality of a good mathematics question?
Subquestion 2:
Which of the mathematics assessment components can be successfully
assessed using the PRQ assessment format and which of the mathematics
assessment components can be successfully assessed using the CRQ
assessment format?
Subquestion 3:
What are student preferences regarding different assessment formats?
85
3.3.1 Qualitative data collection
The qualitative data will be used to address the third research subquestion of
what student preferences are regarding different assessments formats.
Interviews
86
Educationally significant human interactions do not involve abstract bearers of
cognitive structures but real people who develop a variety of interpersonal
relationships with one another in the course of their shared activity in a given
institutional context. … For example, appropriating the speech or actions of
another person requires a degree of identification with that person and cultural
community he or she represents (p6).
I was able to engage far more effectively with some students rather than others
in the interview situations (in the sense of being able to generate more
penetrative probes). For example, with certain students whose home language
is not English, much of my time was spent on interpreting what they said.
87
At the commencement of the interview, I reminded each student that I was doing
research to probe their beliefs, attitudes and inner experiences about the
different assessment formats they had been exposed to in their tests and
examinations. My opening questions were to find out about the background of
each student i.e. why they registered for Mathematics I Major; career choice etc.
This seemed to put the student at ease and they found the situation less
threatening. I then moved on to the ten interview questions.
Interview questions:
[1] I’m interested in your feelings about the different ways in which we asked
questions in your maths tests, a percentage being multiple choice provided
response questions and the other the more traditional open-ended constructed
response questions. Do you like the different formats of assessment?
[2] Why / Why not?
[3] Which type of question do you prefer in maths?
[4] Why do you prefer type A to type B?
[5] Which type of questions did you perform better in? Why?
[6] Do you feel that the mark you got for the MCQ sections is representative of your
knowledge? What about the mark you got for the traditional long questions? Do
you feel this is representative of your knowledge?
[7] Do you have confidence in answering questions in maths tests which are
different to the traditional types of questions? Elaborate.
[8] What percentage of the maths tests do you recommend should be multiple-
choice questions, and what percentage should be open-ended long questions?
[9] How would you ask questions in maths tests if you were responsible for the
course?
[10] Is there opportunity for cheating in these different formats of assessment?
Please tell me about them.
After asking these ten questions, I concluded the interview by asking each
student if they had anything else to add or if they had any questions for me.
88
Examples of responses will be given and discussed in greater detail in the
qualitative data analysis presented in section 4.1.
The Rasch model was used as the quantitative research methodology in this
study. It is a probabilistic model that estimates person ability and item difficulty
(Rasch, 1960). Although it is common practice in the South African educational
setting to use raw scores in tests and examinations as a measure of a student’s
ability, research has shown that misleading and even incorrect results can stem
from an erroneous assumption that raw scores are in fact linear measures
(Planinic, Boone, Krsnik & Beilfuss, 2006). Linear measures, as used in the
Rasch model, on the other hand, are on an interval scale, where arithmetic and
statistical techniques can be applied and useful inferences can be made about
the results (Rasch, 1980).
In the following poem written by Tang (1996), each verse highlights a different
characteristic of the Rasch model: A model of probability; uniformity; sufficiency;
invariance property; diagnosticity and ubiquity.
89
Poem: What is Rasch?
90
3.4.1.1 Historical background
The Rasch model was developed during the years 1952 to 1960 by the Danish
mathematician and statistician Georg Rasch (1901-1980). The development of
the Rasch model took its beginning with the analysis of slow readers in 1952.
The data in question were from children who had trouble reading during their
time in school and for that reason were given supplementary education. There
were several problems in the analysis of the slow readers. One was that the
data had not been systematically collected. The children had for example not
been tested with the same reading tests, and no effort had been made to
standardise the difficulty of the tests. Another problem was that World War II
had taken place between the two testings. This made it almost impossible to
reconstruct the circumstances of the tests. It was therefore not possible to
evaluate the slow readers by standardisation as was the usual method at the
time (Andersen & Olsen, 1982).
Accordingly, it was necessary for Rasch to develop a new method where the
individual could be measured independent of which particular reading test had
been used for testing the child. The method was as follows: two of the tests
that had been used to test the slow readers were given to a sample of school
children in January 1952. Rasch graphically compared the number of
misreadings in the two tests by plotting the number of misreadings in test 1
against the number of misreadings in test 2 for all persons. This is illustrated in
Figure 3.1.
91
Figure 3.1: Number of misreadings of nine subjects in two tests.
∝ ν2
30
20
10
0
∝ ν1
0 10 20
The graphical analysis showed that, apart from random variations, the number
of misreadings in the two tests was proportional for all persons. Further, this
relationship held, no matter which pair of reading tests he considered.
λv1 λ01 λ
= ⇔ λvi = v1 λ0i = θvδ i (1.2)
λvi λ0i λ01
92
Thus the parameter of the model factorised into a product of two parameters, a
person parameter θv and an item parameter δ i . Inserting factorisation (1.2) in
α vi
(θ vδ i )
P (α vi ) = e −θvδi (1.3)
α vi !
The way Rasch arrived at the multiplicative Poisson model was characteristic for
his methods. He used graphical methods to understand the nature of a data set
and then transferred his findings to a mathematical and a statistical formulation
of the model.
The graphical analysis, however, was not Rasch’s only reason to choose the
multiplicative Poisson model. Rasch (1977) wrote:
Obviously it is not a small step from Figure 1 [our Figure 3.1] to the Poisson
distribution (1.1) with the parameter decomposition (1.2). I readily admit that I
introduced this model with some mathematical hindsight: I realized that if the
model thus defined was proven adequate, the statistical analysis of the
experimental data and thus the assessment of the reading progress of the weak
readers, would rest on a solid – and furthermore mathematically rather elegant –
foundation.
Fortunately the experimental result turned out to correspond satisfactorily to the
model which became known as the multiplicative Poisson model (p63).
only depending on the item parameters, this estimate, Sˆi , may be inserted in the
distribution (1.3) giving
α vi
−θ v Sˆi (θ v Sˆi )
P(α vi ) = e (1.4)
α vi !
The way Rasch solved the problem of parameter separation for the slow readers
was not the method he used later. But it represents the first trace of the idea of
separating the estimation of item parameters from the estimation of person
parameters.
Rasch analysis has been the method of choice for moderate size data sets since
1965. Now the theoretical advantages and directly meaningful results of Rasch
analysis can be easily obtained for large data sets, as follows:
● Scores and analyses dichotomous items, or sets of items with the same
or different rating scale, partial credit, rank or count structures for up to
254 ordered categories per structure, with useful estimation of perfect
scores.
94
● Missing responses or non-administered items are no problem.
● Analyse several partially linked forms in one analysis.
● Analyse responses from computer-adaptive tests.
● Item reports and graphical output include calibrations, standard errors, fit
statistics, detailed reports of the particular improbable person responses
which cause item misfit, distracter counts, and complete DOS files for
additional analysis of item statistics.
● Person reports and graphical output include measures, standard errors,
fit statistics, detailed reports of the particular improbable item responses
which cause person misfit, a table of measures for all possible complete
scores, and complete DOS files for additional analysis of person statistics
● Rating scale, partial credit, rank and count structures reported
numerically and graphically.
● Complete output files of observations, residuals and their errors for
additional analyses of differential item function and other residual
analyses.
● Observations listed in conjoint estimate order to display extent of
stochastic Guttman order. The Guttman scale (also called ‘scalogram’) is
a data matrix where the items are ranked from easy to difficult and the
persons likewise are ranked from lowest achiever on the test to highest
achiever on the test.
● Option to pre-set and/or delete some or all person measures and/or item
calibrations for anchoring, equating and banking, and also to pre-set
rating scale step calibrations (Rasch, 1980).
The advantages of the Rasch model above other statistical procedures, used as
the quantitative research methodology in this study, will be clarified further in
section 3.4.1.4.
95
3.4.1.2 Latent trait
One of the basic assumptions of the Rasch model is that a relatively stable
latent trait underlies test results (Boone & Rogan, 2005). For this reason, the
model is also sometimes called the ‘latent trait model’.
Latent trait models focus on the interaction of a person with an item, rather than
upon total test score (Wright & Stone, 1979). They use total test scores, but the
mathematical model commences with a modelling of a person’s response to an
item. They are concerned with how likely a person v of an ability βv on the
96
If the person’s ability βv is above the item’s difficulty δ i we would expect the
If the person’s ability is below the item’s difficulty, we would expect the
probability of a correct response to be less than 0.5, i.e.
if ( β v − δ i ) < 0, then P{χ vi = 1} < 0.5
In the intermediate case where the person’s ability and the item’s difficulty are at
the same point on the scale, the probability of a successful response would be
0.5 i.e.
if ( β v − δ i ) = 0, then P{χ vi = 1} = 0.5
Figure 3.2 illustrates how differences between person ability and item difficulty
ought to affect the probability of a correct response.
97
Figure 3.2: How differences between person ability and item difficulty ought to affect
the probability of a correct response.
1. When
βv > δi
(βv − δ i ) > 0
δi
and P{χ vi = 1} > 1
2
βv
2. When
βv < δi
(βv − δ i ) < 0 δi
P{χ vi = 1} < 12
and
βv
3. When
βv = δi
δ
(βv − δ i ) = 0
and P{χ vi = 1} = 1
2
The curve in Figure 3.3 summarises the implications of Figure 3.2 for all
reasonable relationships between probabilities of correct responses and
differences between person ability and item difficulty. This curve specifies the
conditions a response model must fulfill. The difference ( β v − δ i ) could arise in 2
ways. It could arise from a variety of person abilities reacting to a single item, or
it could arise from a variety of item difficulties testing the ability of one person.
98
When the curve is drawn with ability β as its variable so that it describes an
item i , it is called an item characteristic curve, because it shows the way the
item elicits responses from persons of every ability.
Ρ
1.0
1
P{χ vi = 1} >
2 The relative
position
The probability of
of a correct 0.5 βv and δ i
response on the
1
P{χ vi = 1} <
2
(βv − δi )
0.0
βv < δi βv = δi βv > δi
P { χ vi = 1 β v , δ i } = f ( β v − δ i )
In Figure 3.3 if we thought of the horizontal axis as the latent trait, the item
characteristic curve would show the probability of persons of varying abilities
responding correctly to a particular item. The point on the latent trait at which
this probability is 0.50 would be the point at which the item should be located.
and δί for item difficulty through their difference ( β v − δ i ). We want this difference
99
vary from minus infinity to plus infinity, while the probability of a successful
response must remain between zero and one. That is
0 ≤ P{χ vi = 1} ≤ 1 (1)
−∞ ≤ β v − δ i ≤ +∞ (2)
If we use the difference between ability and difficulty as an exponent of the base
e , the expression will have the limits of zero and infinity. That is
0 ≤ e( β v −δ i ) ≤ +∞ (3)
With a further adjustment we can obtain an expression which has the limits zero
and one and therefore could perhaps be a formula for the probability of a correct
response. The expression and its limits are:
e ( β v −δ i )
0≤ ≤1 (4)
1 + e ( β v −δ i )
The left hand side of (5) represents the probability of person v being correct on
item i (or of the response of person v to item i being scored 1), given the
person’s ability βv and the item’s difficulty δ i .
The function (5) which gives us the probability of a correct response is a simple
logistic function. It provides a simple, useful response model that makes both
linearity of scale and generality of measure possible. It is the formula Rasch
chose when he developed the latent trait test theory. It is a simple logistic
function. Rasch calls the special characteristic of the simple logistic function
which makes generality in measurement possible specific objectivity (Rasch,
1960). He and others have shown that there is no alternative mathematical
formula for the ogive curve in Figure 3.3 that allows estimation of the person
100
measures βv and the item calibrations δ i independently of one another
The responses of individual persons to individual items provide the raw data.
Through the application of the Rasch model, raw scores undergo logarithmic
transformations that render an interval scale where the intervals are equal,
expressed as a ratio or log odd units or logits (Linacre, 1994). The Rasch model
takes the raw data and makes from them item calibrations and person measures
resulting in the following:
● valid items which can be demonstrated to define a variable
● valid response patterns which can be used to locate persons on the
variable
● test-free measures that can be used to characterise persons in a general
way
● linear measures that can be used to study growth and to compare groups
(Bond & Fox, 2007).
Through the years the Rasch model has been developed to include a family of
models, not only addressing dichotomies, but also inter alia rating scale and
partial credit models.
ability βv and the item difficulty level of δ i . Formula (5) in a simpler form is used
101
e ( β v −δ i )
Pvi =
1 + e( β v −δ i )
As discussed before, this formula is a simple logistic function and the units are
called ‘logits’.
difficulty δ i = 2 , the probability of the person answering the item correctly will be:
e(5− 2)
P{χ vi = 1 β v , δ i } =
1 + e(5− 2)
e3
=
1 + e3
20.086
=
21.086
= 0.95
Table 3.2 is a table of more examples of the probabilities generated from
differences between ability and difficulty.
Table 3.2: Probabilities of correct responses for persons on items of different relative
difficulties.
βv − δ i Probability
3 0.95
2 0.88
1 0.73
0 0.50
-1 0.27
-2 0.12
-3 0.05
The explanation of the dichotomous Rasch model is based on Andrich and Marais (2006).
102
One can generate many more probabilities from such differences and then
represent the resulting function graphically. This graph is also known as the item
characteristic curve.
Figure 3.4 displays the function of the dichotomous Rasch model graphically.
1.0
Conditional probability
0.5
0.0
-5.0 0.0 5.0
Ability relative to item difficulty
βv < δ i βv = δi βv > δi
The item characteristic curve provides the opportunity to directly establish the
probability of a person of ability βv answering an item of difficulty δ i correctly.
For example, if in Figure 3.4 a person with ability β v = 0.0 interacts with an item
of difficulty δ i = 0.0 the probability is 50% that the answer will be correct (see
103
Rasch-Andrich rating scale model
Andrich (as cited in Linacre, 2007, p7) in a conceptual breakthrough,
comprehended that a rating scale, for example a Likert-type scale, could be
considered as a series of Rasch dichotomies. Linacre (2007) makes the point
that similar to the Rasch original dichotomous model, a person’s ability or
attitude is represented by βv , whereas δ i is the item difficulty or the ‘difficulty to
endorse’. The difficulty or endorsability value is the ‘balance point’ of the item
according to Bond and Fox (2007, p8), and is situated at the point where the
probability of observing the highest category is equal to the probability of
observing the lowest category (Linacre, 2007).
According to Linacre (2005), the Rasch-Andrich rating scale model specifies the
probability, Pvix , that person v of ability βv is observed in category x of a rating
scale applied to item i with difficulty level δ i as opposed to the probability Pvi ( x −1)
104
In this research study, the categories for the Rasch-Andrich rating scale were:
1: Complete guess
2: Partial guess
3: Almost certain
4: Certain
A high raw score on an item would indicate a lot of confidence. When this figure
is transformed to a log odds or logit, as it is done in the Rasch model, a low
Rasch measure of endorsability is obtained. According to Planinic and Boone
(2006), it is better to invert the scale for easier interpretation, since a high logit
would then correspond to high confidence. This is the strategy adopted in this
study.
partial credit model becomes Fix and mathematically the Rasch-Andrich rating
In both traditional test theory and in the Rasch latent trait theory, total scores
play a special role. In traditional test theory, test scores are test-bound and test
105
scores do not mark locations on their variable in a linear way. In traditional test
theory, the observed measure used for a person’s performance would be the
total score on the test. A higher total score on the test would be taken to reflect
a higher level of understanding than would a lower total score on the test. The
advice about item difficulties which develops from a traditional theory framework
is that all items should be at a difficulty level of 0.5. Just how difficult an item
needs to be for it to have a difficulty of 0.5 depends on how able the persons are
who will take it. How able the persons are, is in turn judged from their
performance on a set of items. There is no way within traditional test theory of
breaking out of this reciprocal relationship other than through the performance of
some carefully sampled normative reference group. The performance of
individuals on subsequent uses of the test can be judged against the spread of
performances in the normative group.
The Rasch model focuses on the interaction of a person with an item rather than
upon the total test score. Total test scores are used, but the model commences
with a modelling of a person’s response to an item. The total score emerges as
the key statistic with information about the ability β v . A feature of traditional test
theory is that its various properties depend on the distribution of the abilities of
the persons. Many of the statistics depends on the assumption that the true
scores of people are normally distributed (Andrich, 1988). An important
advantage of the Rasch latent trait model is that no assumptions need to be
made about this distribution, and indeed, the distribution of abilities may be
studied empirically. It was for this reason that the Rasch model was chosen
above other traditional statistical procedures for the quantitative research
methodology of this study.
If we intend to use test results to study growth and to compare groups, then we
must make use of the Rasch model for making measures from test scores that
marks locations along the variable in an equal interval or linear way.
Bond and Fox (2007) argue strongly for the same rigour in measurement in the
physical sciences to be applied in the field of psychology. This proposed rigour
in measurement should be extended also to the field of education in South
Africa. The Rasch model provides an avenue to attain this goal.
Reliability and validity are approached differently in traditional test theory from
the way they are approached in latent trait theory. The process of mapping the
amount of a trait on a line necessarily involves numbers. The use of numbers in
this way gives precision to certain kinds of work. However, there is always a
107
trade-off in the use of such numbers – in particular, they can be readily over
interpreted because they appear to be so precise, hence affecting the reliability
of the data. In addition, the instrument may not measure what we really want to
measure and this affects the validity of the research.
In the latent trait model, the use of a total score from a set of items implies an
assumption of a single, unidimensional underlying trait which the items, and
therefore the test, measure. Those reliability indices which reflect internal
consistency provide a direct indication of whether a clear single dimension is
present. If the reliability is low, there may be only a single dimension but one
measured by items with considerable error. Alternatively, there may be other
dimensions which the items tap to varying degrees.
The calculation of a reliability index is not very common in latent trait theory.
However, it is possible to calculate such an index, and in a simple way, once the
ability estimates and the standard error of the persons is known. Instead of
using the raw scores for the reliability index formula, the ability estimates are
used, where the ability estimate β v for each person v can be expressed as the
The key feature of reliability in traditional test theory is that it indicates the
degree to which there is systematic variance among the persons relative to the
error variance i.e. it is the ratio of the estimated true variance relative to the true
variance plus the error variance. In traditional test theory, the reliability index
gives the impression that it is a property of the test, when it is actually a property
of the persons as identified by the test. The same test administered to people of
the same class or population but with a smaller true variance, would be shown
to have a lower reliability.
108
Having the facility to capture the most well known and commonly used
discrimination index of traditional test theory; to provide evidence of the degree
of conformity of a set of responses to a Guttman or ‘scalogram’ scale in a
probabilistic sense and to provide these from a latent trait formulation, indicates
that Rasch’s simple logistic model provides an extremely economical and
reliable perspective from which to evaluate test data (Andrich, 1982).
In the years of this study, July 2004 to July 2006, student numbers registering
for MATH109 were high with 483 in 2004, 414 in 2005 and 376 in 2006. The
reduction in numbers in 2006 coincided with the increase in the entrance
requirements to the Faculty of Science at the University of the Witwatersrand. In
each of these years, the students were allocated, subject to timetable
constraints, to one of two parallel courses presented by different lecturers. The
lectures took place six times a week (45 minutes per lecture) in a large lecture
theatre. MATH109 consists of a Calculus and an Algebra component. In
Semester 1, Algebra constituted one-third and Calculus two-thirds of each
assessment task, corresponding to the same ratio of lectures. In Semester 2,
Algebra and Calculus were weighted equally with students receiving 3 lectures
of Algebra and 3 lectures of Calculus per week. I lectured one set of Calculus
and one set of Algebra classes while my colleagues lectured the other parallel
courses. All the students from the MATH109 classes constituted the group from
which data was collected for this study. As course co-ordinator for the duration
of the study, I had more contact with these students than my colleagues. I was
109
personally involved, either as examiner or as moderator, for all the tests and
projects which contributed to the assessment programme. I was also directly
responsible for the invigilation duties of this group and hence administered all
the tests at which the data was collected.
The collection of data for this study was directly related to the Mathematics I
Major assessment programme as illustrated in Figure 3.5.
Test instruments
Data was collected from the 2 MCQ Tutorial tests, the 3 class tests (CRQs and
PRQs) (1 hour) in March/May/August, the mid-year test (CRQs and PRQs) (1.5
hrs) in June and the final examination (CRQs and PRQs)(3 hrs) in November, in
each of the years 2004, 2005 and 2006 respectively.
110
Tutorial tests
Two tutorial MCQ tests were written during the course of the year in March and
August respectively. Each test, of duration 20 minutes, consisted of 8 multiple-
choice questions (total = 16 marks), 4 MCQs on Algebra content and 4 MCQs
on Calculus content. Each of these MCQs was followed by a confidence of
response question in which a student was asked to indicate their confidence
about the correctness of their answer, where A implies no knowledge (complete
guess), B a partial guess, C almost certain and D indicates complete confidence
or certainty in the knowledge of the principles and laws required to arrive at the
selected answer. Each of the MCQs had 3 distracters and 1 key, indicated by
the letters A, B, C, or D.
A. 5
B. 10
C. 15
D. 20
A B C D
COMPLETE GUESS PARTIAL GUESS ALMOST CERTAIN CERTAIN
Tutorial tests were written during the last 20 minutes of one of the 45 minutes
compulsory tutorial periods, in the first semester and the second semester. The
tests were administered by the tutor who handed out the question papers
together with a blank computer card. The instruction to each student was to
shade the correct answers on the computer card to questions 1-8 in the first
column. In these questions there was only one possible answer. There was no
negative marking. In addition, the students had to shade their confidence of
response answers on the computer card corresponding to Questions 1-8 in the
second column, i.e. Questions [26] – [33]. Students were reminded that there is
no correct answer in the confidence of responses. Students were also informed
111
that marks were not awarded for the confidence of response answers, as these
were purely for educational research purposes.
Once the tests had been written, the tutor collected both the question paper and
the computer cards. The question papers were kept for reference only should
any queries arise, and not returned to the students. The computer cards were
marked by the Computer and Networking Services (CNS) division of the
University of the Witwatersrand. On completion, CNS provided a print out of the
quantitative statistical analysis of data, including the performance index,
discrimination index and easiness factor per question. CNS also captured the
students’ confidence of responses.
112
Sample CRQ question:
Question 4.
a. Give the condition that is required to ensure continuity of a function f ( x) at the point
x = α.
A B C D
COMPLETE GUESS PARTIAL GUESS ALMOST CERTAIN CERTAIN
A B C D
COMPLETE GUESS PARTIAL GUESS ALMOST CERTAIN CERTAIN
A B C D
COMPLETE GUESS PARTIAL GUESS ALMOST CERTAIN CERTAIN
For Section A, students were provided with blank computer cards to indicate
their choice of answers and the corresponding confidence of responses. As in
the tutorial tests, students were informed that no marks were awarded for the
confidence of responses. In Sections B and C, students were provided with
space on the question papers to complete their solutions. The computer cards
were used only to indicate the corresponding confidence of responses. On
completion of the tests, all three sections, together with the filled in computer
card, were collected. CNS provided a print out of all the results for Section A,
together with confidence of responses for Sections A, B and C.
113
Expert opinions
In this study, the term expert refers to content experts. In this case the content
experts were my colleagues who taught the MATH109 course, either Algebra or
Calculus or both, as well as my supervisors from the University of Pretoria who
were familiar with the content. In total, the opinions of eight experts on the level
of difficulty of the questions were obtained, independent of each other. Five of
the experts gave their opinions on Calculus, and six of the experts gave their
opinions on Algebra. Each expert was given a full set of the following tests:
MATH109 August Tutorial Test (2005); March Tutorial Test 1A (2006); March
Tutorial Test 1B (2006); March Section A (2005); May Section A (2005); June
Section A (2005); August Section A (2005); November Section A (2005);
March Section A (2006); May Section A (2006); June Section A (2006); March
Sections B & C (2005); May Sections B & C (2005); June Sections B & C
(2005); August Sections B & C (2005); November Sections B & C (2005);
March Sections B & C (2006); May Sections B & C (2006) and June Sections
B & C (2006). The reader is to note that the August Tutorial Test was the same
in both 2005 and 2006. Also the March Tutorial Test 1A which was written
during a tutorial period on a Tuesday and March Tutorial Test 1B written during
a tutorial period on the Wednesday of the same week, although testing the same
content, were different. These tests were the same for 2005 and 2006. The
experts chose to give their opinions on either the Calculus or Algebra questions,
depending on which courses they taught. Hence for Calculus, Section C was
appropriate and for Algebra, Section B was appropriate. In the MCQ Section A,
there was a mixture of both Calculus and Algebra questions. Experts were
asked for their opinions on the level of difficulty of both the PRQs and CRQs,
and were asked to indicate their opinions as follows:
● Use a 1 if your opinion is that the students should find the question easy
● Use a 2 if your opinion is that the question is of average difficulty
● Use a 3 if your opinion is that the students would find the question difficult
or challenging.
Experts were informed that their opinions were completely independent of how
the students performed in the questions. Experts worked independently and did
114
not collaborate with other experts. In the study, the students’ performance is
referred to as novice performance. Once all the expert opinions were collected,
the data was captured separately for Calculus and Algebra on spreadsheets.
An expert opinion on the level of difficulty of each question (PRQs and CRQs)
was calculated as the average of the eight expert opinions per question.
Schumacher and McMillan (1993) have suggested the following reliability threats
to research. These are:
● the researcher’s role
● the informant selection of the sample
● the social context in which data is collected
● the data collection strategies
● the data analysis strategies
● the analytical premises i.e. the initial theoretical framework of the study.
115
In this study reliability was enhanced by means of the following:
● The importance of my social relationship with the students in my role as
the co-ordinator and lecturer of the Mathematics 1 Major Course was
carefully described.
● The selection of the population sample of this study and the decision
process used in their selection was described in detail.
● The social context influencing the data collection was described
physically, socially, interpersonally and functionally. Physical descriptions
of the students, the time and the place of the assessment tasks, as well
as of the interviews, assisted in data analysis.
● All data collection techniques were described. The interview method,
how data was recorded and under what circumstances was noted.
● Data analysis strategies were identified.
● The theoretical framework which informs this study and from which
findings from prior research could be integrated was made explicit.
● Stability was achieved by administering the same tutorial tests in March
and August over the period 2004-2006.
● Equivalence was achieved over the period of study, by administering
different tests to the same group of students.
● Internal consistency was achieved by correlating the items in each test to
each other.
● A large number of data items were collected over the period of 2 years,
and were all used in the data analysis.
In the context of research design, the term validity means the degree to which
scientific explanations of phenomena match the realities of the world
(Schumacher & McMillan, 1993). Test validity is the extent to which inferences
made on the basis of numerical scores are appropriate, meaningful and useful.
Validity, in other words, is a situation-specific concept. Validity is assessed
116
depending on the purpose, population and environmental characteristics in
which measurement take place.
In quantitative research there are two type of design validity. Internal validity
expresses the extent to which extraneous variables have been controlled or
accounted for. External validity refers to the generalisability of the results i.e.
the extent to which the results and conclusion can be generalised to other
people and settings. In this study, internal validity was addressed as the
population sample of first year mainstream mathematics students were always
fully informed and aware that their confidence of responses, in both the CRQs
and PRQs, were not for assessment purposes, but used purely for this research
study. All students wrote the same test on the same day in a single venue. All
the data collected was used, irrespective of whether the students completed all
of the confidence of responses, or not.
Andrich and Marais (2006) point out that it is now considered standard that
construct validity is the overarching concept, and that the other three so called
forms of validity are pieces of evidence for construct validity. Construct
validation is addressed to the identification of the dimension in a substantive
117
sense. The test developer must have a clear idea of what the dimension is
when the items are written.
In order to enhance the validity of this study, the following steps were taken:
● The literature was examined in order to identify and develop the seven
mathematical assessment components.
● The test instrument was validated after implementation by a panel
consisting of my 2 supervisors at the University of Pretoria and 6
mathematics lecturers from the University of the Witwatersrand.
● The questions used for data collection were all moderated by colleagues
and were in line with the theoretical framework. Minor adjustments were
made to a number of test items to avoid ambiguity and to strengthen
weak distracters.
● Expert opinions obtained from colleagues were completely independent
of student performance (novice performance).
● Three measuring criteria were identified in order to develop a model for
addressing the research questions. These criteria were modified and
adapted in collaboration with my supervisors to address the issue of what
constitutes a good mathematical question and how to measure how good
a mathematics question is.
● All marking of PRQs was done by computers using the Augmented
marking scheme. This programme accommodates the fact that not all
questions are equally weighted. There was no negative marking.
● Marking of CRQs was done by the MATH109 team of lecturers, using a
detailed marking memorandum which had been discussed prior to each
marking session. In addition, all marking was moderated by the
researcher, except for the examinations which were moderated by an
external examiner.
Bias is defined by Gall, Gall and Borg (2003) as a set to perceive events in such
a way that certain types of facts are habitually overlooked, distorted or falsified.
118
In this study, an attempt was made to decrease bias by the following:
● A representative sample of undergraduate students studying tertiary
mathematics
● A comprehensive literature review
● Verified statistical methods and findings.
3.5.4 Ethics
Ethics generally are considered to deal with beliefs about what is right or wrong,
proper or improper, good or bad (Schumacher & McMillan, 1993). Most relevant
for educational research is the set of ethical principles published by the
American Psychological Association in 1963.
The principles of most concern to educators are as follows:
● The primary investigator of a study is responsible for the ethical
standards to which the study adheres.
● The investigator should inform the subjects of all aspects of the research
that might influence willingness to participate.
● The investigator should be as open and honest with the subjects as
possible.
● Subjects must be protected from physical and mental discomfort, harm
and danger.
● The investigator should secure informed consent from the subjects before
they participate in the research.
119
● In the interview, all respondents were assured of confidentiality.
Respondents were informed that they had been randomly selected,
based on their June class record marks. Permission was obtained from
each candidate to tape-record the interviews. Candidates were informed
that they were free to withdraw from the interview or not to answer any
question, if they wished. Candidates were assured of the confidentiality
and anonymity of their responses and, in particular, that the information
they provided for the research would not be divulged to the University or
their lecturers at any time.
● The researcher assured all participants that all data collected from the
confidence of responses would not affect their overall marks. No person,
except the researcher, supervisors and the data analyst, would be able to
access the raw data. All raw data was used, irrespective of whether the
student indicated a confidence of response or not.
● The research report will be made available to the University of the
Witwatersrand and to the University of Pretoria, should they so desire it.
● Informed consent was achieved by providing the subjects with an
explanation of the research and an opportunity to terminate their
participation at any time with no penalty. Since test data was collected
over the research period to chart performance trends, the research was
quite unobtrusive and had no risks to the subjects. The students were at
no times inconvenienced in the data collection process, as all data was
collected during the test times as set out in the assessment schedule for
MATH109.
● In the data analysis, student names and student numbers were not used.
Thus, confidentiality was ensured by making certain that the data cannot
be linked to individual subjects by name. This was achieved by using the
Rasch model.
● In my role as researcher, I will make every effort to communicate the
results of my study so that misunderstanding and misuses of the research
is minimised.
● To maximise both internal and external validity, research has shown it
seems best if the subjects are unaware that they are being studied
120
(Schumacher & McMillan, 1993). In this regard, the research
methodology was designed in order to collect data from the students
during their normal tutorial times or formal test times. As a result,
students did not feel threatened in any way and the resulting data was
sufficiently objective.
● The methodology section of my study shows how the data was collected
in sufficient detail to allow other researchers to extend the study.
● In my roles as co-ordinator, lecturer and researcher, I was very aware of
ethical responsibilities that accompanied the gathering and reporting of
data. The aims, objectives and methods of my research were described
to all participants in this research study.
121