Using Bayesian Networks in Computerized Tests1
Eva Millána , Luis Descalçob , Gladys Castillob , Paula Oliveirab , Sandra
Diogoc
a
ETSI Informática, University of Málaga, Campus de Teatinos, 29080. Málaga, Spain
b
Department of Mathematics, University of Aveiro, 3810-193. Aveiro, Portugal
c
Projecto PMatE, University of Aveiro, 3810-193. Aveiro, Portugal
Abstract
In this paper, we describe the integration and evaluation of an existing
generic Bayesian student model (GBSM) into an existing computerized testing system within the Mathematics Education Project (PmatE - Projecto
Matemática Ensino) of the University of Aveiro. This generic Bayesian student model was previously evaluated with simulated students, but a real
application was still missing. In the work presented here, we have used the
GBSM to define Bayesian Student Models (BSMs) for a concrete domain:
first degree equations. In order to test the diagnosis capabilities of such
BSMs, an evaluation with 152 students has been performed. Each of the 152
students took both a computerized test within PMatE and a written exam,
both of them designed to measure students knowledge in 12 concepts related
to first degree equations. The written exam was graded by three experts.
Then two BSMs were developed, one for the computer test and another one
for the written exam. These BSMs were used to to obtain estimations of
student’s knowledge on the same 12 concepts, and the inter-rater agreement
among the different measures was computed. Results show a high degree of
agreement among the scores given by the experts and also among the diagnosis provided by the BSM in the written exam and the experts average, but
a low degree of agreement among the diagnosis provided by the BSM in the
computer test and expert’s average.
Keywords:
Computerized testing, Bayesian networks, student modeling
1
Partially supported by grant TIN2009-14179, Plan Nacional de I+D+i, Gobierno de
España
Preprint submitted to Computer & Education
February 13, 2014
1. Introduction
Preventing school failure and decreasing drop out rates is a matter of concern in our society. In our modern world of competitive international markets
and rapidly changing technologies, the importance of a good education and,
in particular, of good mathematics skills should not be underestimated. For
this reason, the identification of the causes of school failure and ways to
overcome them should be a priority for educational boards and authorities.
The Mathematics Education Project (PmatE Projecto Matemática Ensino)
(Sousa et al., 2007) was born in the University of Aveiro. The goal of this
project is to invest in new information technologies for teaching and learning as a way to enrich, enhance and boost education in Portugal. To this
end, computer tools and contents for several areas of knowledge (specially
for Mathematics) have been developed. The use of computers to improve
student’s performance and motivation is recognized in the final report of the
Mathematical Advisory Panel2 in USA: “Research on instructional software
has generally shown positive effects on students’ achievement in mathematics as compared with instruction that does not incorporate such technologies.
These studies show that technology-based drill and practice and tutorials can
improve student performance in specific areas of mathematics”.
Since 1990, PMatE has been available in the web. It includes material and
contents for all school grades and for different purposes: formative (evaluation, diagnosis and practice) and competition (via computerized tests). Every
year, PMatE promotes six National Mathematical Competitions in several
science areas: one for each school degree, from Primary to Higher Education, and one for all school degrees in the network. 15000 students from 400
schools in Portugal attended the latest May 2011 Edition of the National Science Competition in the University of Aveiro using the computerized testing
tools developed by PMatE.
The goal of the research presented in this paper is to study wether or not
the diagnostic capabilities of computerized tests in PMatE can be improved
by using approximate reasoning techniques. To this end, we have used a
Bayesian student model (BSM) based on Bayesian Networks (BNs). The BN
2
The Final Report of the National Mathematics Advisory Panel. U.S. Department of
Education (2008).
2
paradigm was chosen because it has proven to be a sound methodology for
the student modeling problem, and it has been used with this purpose in
a number of existing applications (Collins et al., 1996; Conati et al., 1997;
VanLehn et al., 1998; Jameson, 1996). This previous research has shown that
a BSM allows for a sound and detailed evaluation of each student, according
to the granularity level defined by teachers. Instead of having only a final
grade to measure students performance, the system will be able to provide
a more detailed model of student knowledge, which contains information
about which parts of the curriculum is the student struggling with and which
parts he/she has already mastered. This information is essential to provide
feedback, remediation and personalized instruction.
The student model chosen to be integrated into the computerized testing
tools of the PMatE was a previously generic BSM developed by Millán and
Pérez de la Cruz (2002). To refer to this model, we will use the acronym
GBSM (Generic Bayesian Student Model). This model was chosen for two
reasons: i ) the conditional probabilities needed are automatically computed
from a reduced set of parameters specified by teachers, and ii) the model
had already been evaluated with simulated students, and showed a good
diagnostic performance both in terms of accuracy and efficiency.
Other authors have also used Bayesian Networks in educational testing.
For example, Vomlel (2004) proposed a test based on Bayesian Networks for
basic operations with fractions. Differently from the model to be used in this
work, the model is domain specific, so it cannot be reused in other context.
The Bayesian Network models relations between skills, misconceptions and
test items, but no granularity relationships were established. The empirical
results suggested that educational testing can benefit from application of
Bayesian networks in this context.
For the sake of completeness, Section 2 will present a brief description of
the GBSM. Next, in Section 3 we will provide details about its implementation in computerized tests of the PMatE. In Section 4 we will present the
results of evaluation performed with real students. Finally, Section 5 summarizes the main conclusions of this work and provides some indications for
future work.
In this way, the main contributions of the research work presented here
are: a) the introduction of a BSM into the PMatE’s computerized tests
and a preliminary evaluation of this model with real students; and b) a real
application of the work presented by Millán and Pérez de la Cruz (2002),
which was still missing.
3
2. The Generic Bayesian Student Model
In this section we will present here a brief summary of the main features of
the GBSM. The interested reader can consult the original publication (Millán
and Pérez de la Cruz, 2002) for more details. For a detailed tutorial on how
to build Bayesian student models, see (Millán et al., 2010).
2.1. Variables Definition
The GBSM is composed of two different kinds of variables: knowledge
and evidential variables.
• Knowledge variables (K), that represent student’s knowledge (either
declarative or procedural knowledge, but also skills, abilities, etc).
These are the variables of interest in adaptive e-learning systems, in
order to be able to adapt instruction for each individual student. Their
values are not directly observable (i.e., they are hidden variables).
Knowledge variables can measure simple (non-decomposable) pieces of
knowledge, or compound variables (which result from the aggregation
of other variables). The level of granularity required will depend on the
nature of the application. For example, a knowledge domain can be decomposed hierarchically into three levels: i) basic concepts (C) (lower
level) - non-decomposable; ii) topics (T) (medium level) - represent the
aggregated concepts-; and iii) subject (S) (higher level) - represents the
aggregated topics-. In the GBSM, all knowledge variables are modeled
as binary, and take two values: 0 (not-known) and 1 (known).
• Evidential variables (Q), which represent student’s actions, and are directly observable. For example, the results of a test, question, problem
solving procedure, etc. The values of such variables will be used to infer
the values of the hidden knowledge variables. In the GBSM, evidential
variables are also considered to be binary, with values 0 (incorrect) or
1 (correct).
2.2. Structure Definition
The next step is to define the causal relationships among them. The
relationships modeled in the GBSM are:
• Aggregation relationships between knowledge nodes (basic concepts,
topics and subject): the GBSM considers that knowledge of the parts
has causal influence in the knowledge of the more general topic.
4
• Relationships between knowledge and evidential nodes: the GBSM considers that having knowledge about certain topics has causal influence
in being able to correctly solve the associated evidential items.
Figure 1 illustrates the structure of the GBSM.
Figure 1: Structure of the Bayesian Student model
With respect to aggregation relationships, for the sake of simplicity in
what follows K will denote either the subject or the topic, while {K1 , ..., Kn }
will denote its parts (topics in the case of the subject, or basic concepts in
the case of the topics).
2.3. Parameter Estimation
The parameters needed for the proposed Bayesian student model are:
1. for each basic concept, Ci: its prior probability P (Ci). It can be estimated from data available, if it exists; or set to uniform, otherwise.
2. for each aggregated knowledge variable, K: the conditional probabilities
of K given its parts, P (K | K1 , ..., Kn ).
Let {w1 , ..., wn } be a normalized set of weights, where each wi represents
the relative importance of the sub-item Ki in the aggregated node K.
Then, the conditional probabilities are calculated as follows:
X
wi
(1)
P (K = 1 | ({Ki = 1, i ∈ I} , {Ki = 0, i ∈
/ I})) =
i∈I
where I = {i ∈ {1, 2, ..., n} such that Ki = 1}.
5
3. for each evidential node, Qi : the conditional probabilities of the evidential node Qi (e.g., a test question) given the concepts involved, P (Qi |
C1 , ..., Cm ) (assuming we rename the concepts so that {C1 , ..., Cn } are
the concepts related to Qi ).
To simplify the parameter specification for the 2m conditional probabilities needed for each evidential node, the GBSM proposes an approach based
on computing the probabilities using an ad-hoc defined function G. To do so,
for each question item Qi , four parameters {g, s, d, a} need to be specified by
experts:
• g is the guessing factor, which represents the probability that a student
with no knowledge guesses the correct answer to item Qi .
• s is the slip factor, which represents the probability that a student with
all the needed knowledge fails to correctly answer item Qi .
• d is the difficulty level, which estimates how difficult item Qi is.
• a is the discrimination, which represents the degree to which the item
discriminates between students (a commonly used value is 1.2).
Function G is then used to compute the conditional probabilities. This
approach reduces the number of parameters needed for each question from
2m to 4. Function G provides a smooth curve to assign such conditional
probabilities, covering the range between g and 1 − s. As illustrated in
Figure 2, in the case that the student has no knowledge (any of the required
concepts is known), the probability is set to g (guess factor). On the opposite,
if the student has all necessary knowledge (all required concepts are known),
the probability is 1 − s. The remaining probabilities have been assigned
increasing values among g and 1 − s. In this way, the probability of a correct
answer increases with the number of known concepts.
For more details about function G, we refer the interested reader to the
original publication (Millán and Pérez de la Cruz, 2002).
In the next section we will explain how the GBSM was implemented in
the computerized tests of PMatE.
6
Figure 2: Using G(x) to compute the probabilities
3. Bringing the model into practice in PMatE
For using the GBSM, we have implemented a web-based application that
makes use of the learning objects (LOs) produced by PMatE. In PMatE, each
LO consists in a stem and a set of four associated true and false question.
Both the stem and the related test items are parameterized, so different LO
can be automatically generated. An example of an instance of a LO is shown
in Figure 3.
EQUAmat (Sousa et al., 2007) is one of the computerized testing systems
implemented in the PMatE. EQUAmat was developed to evaluate mathematics for ninth-grade students (14-15 years old). We have chosen the subject
1st Degree Equations included in EQUAmat as the domain for testing our
model.
The use of PMatE has forced us to consider some constraints. For example, each LO is composed of four true-false items, and the student needs to
provide answers to all of them. In our context, this means that we have had
to adapt the GBSM to this operation mode, implying that:
• answers to items need to be processed by the BN in batches of size
four,
• the guessing factor of each item is always 0.5 (true-false items),
• the student needs to provide answers to all items.
7
Figure 3: An example of a generated multiple-choice question
This high guessing factor and the fact that students need to provide
answers to all test items increase the randomness of the answers and make
the diagnosis process more difficult.
To model 1st Degree Equations using the GBSM, researchers in the
PMatE team have decomposed the selected domain in topics and concepts,
and defined the weights for the aggregation relationships. The resulting network for 1st Degree Equations is shown in Figure 4.
For example, the aggregated topic C11_Resolution is composed of the
items C5_Classification and C10_Equivalence with weights w5 = 0.25
and w7 = 0.75, respectively. These weights are used to define the conditional
probability table of the node C11 given its parents (C5 and C10) as explained
in Section 2.3. Table 1 shows the resulting probabilities.
Table 1: The CPT for the variable C11 in the BSM for knowledge diagnosis
Parent 1
knows C5_Classification?
yes
yes
no
no
Parent 2
knows C10_Equivalence?
yes
no
yes
no
8
P(knows C11_Resolution? | Parents)
yes
1
0.25
0.75
0
Figure 4: The Bayesian network for knowledge diagnosis
Evidential nodes are defined to be the test items. Each test item will be
linked to several concepts in the network. The conditional probabilities are
computed from the set of parameters {g, s, d, a} we have defined.
Since in this application test items are true/false questions, the guessing
factor g is 0.5. To define the difficulty levels d, domain modelers assigned a
number between 1 and 5 to each test item. The slip factor was set to be 0.2
and the discrimination index to 1.4, for all questions in the test.
We present now an example that illustrates how the values of the conditional probabilities are computed by using function G.
Let Q be a test item that requires knowledge about two topics, for example C6_P_add (w5 = 0.25) and C7_P_Multiply (w7 = 0.75). Then we add
this test item Q to the network as shown in Figure 5.
Figure 5: Adding a test item to the network
For this test item Q, domain modelers must provide estimated values for
the needed parameters. In this example, we will use the following set of
values: guessing factor 0.5, ; slip 0.01; difficulty level 5; and discrimination
factor 1.2.
9
To obtain the conditional probabilities of Q given C6 and C7 we have
used function G(x) as defined in (Millán and Pérez de la Cruz, 2002). The
resulting conditional probability table for the test item Q is shown in Table
2.
Table 2: Computation of the conditional probabilities using function G
C6
0
0
1
1
C7
0
1
0
1
x
0
1.778
5.171
6.908
G(x) = P (Q = known | C6, C7)
0.500
0.501
0.793
0.991
Overall we found that this approach is easy to implement and automatically provides all the conditional probabilities needed, which are computed
as a function of the four parameters estimated by teachers.
Once a student answers a LO, the evidence about the four associated
test items is used to compute the posterior probabilities. To update the
probabilities, we have used a partially Dynamic procedure, in which items
are processed in batches of four items (that is the number of items presented
in each LO). The reason for using a dynamic procedure instead a static one
is that previous studies showed that it provided a slightly better performance
(see (Millán et al., 2003)).
As an example, in Figure 6 we show the state of the BN after the student
has answered the four test items of a LO associated to concepts C6 and C7
(that is, four test items):
All BSMs in this work have been modeled and implemented using GeNIe
and SMILE.3
To integrate this model into PMatE, we have developed a web-based application using Microsoft’s Visual Studio 2005 and the C♯ classes for Bayesian
Networks provided in SMILE. All the information about the LOs needed to
generate the questions and to estimate the model parameters, as well as the
current state of each student model is stored and maintained into a Microsoft
SQL database.
Next we will discuss how the evaluation was performed, and present some
results.
3
GeNIe (Graphical Network Interface) and SMILE (Structural Modeling, Inference,
and Learning Engine) are available at http://genie.sis.pitt.edu.
10
Figure 6: State of the network after four questions have been answered
4. Evaluation with real students
A very important difference when evaluating diagnosis capabilities of systems with simulated4 and real students is that in the first case the evaluation
of the system can be compared to the real state of knowledge of the student,
which has been generated automatically. This method has the advantage
that the performance of diagnosis process can be measured more accurately,
as the variable of interest can be isolated from other variables like the quality of the test, subjectivity of teacher’s evaluation, etc. In contrast, in the
evaluation with real students, the real state of knowledge of the student is a
hidden variable. In this case, we are also dealing with all those other variables that in real settings are difficult to isolate and increase the uncertainty.
That is the reason we believe that a sound evaluation of a diagnosis model
must include both kinds of evaluations.
An extensive evaluation of the proposed model with simulated students
was presented in (Millán and Pérez de la Cruz, 2002). More recently, a preliminary but encouraging evaluation with twenty-eight ninth-grade students
was presented in (Castillo et al., 2010).
In this paper we present a more extensive evaluation, with a larger set
4
Evaluation with simulated students (VanLehn et al., 1994) is a technique commonly
used in Student Modeling, to evaluate the predictive model’s performance.
11
of students: 152 ninth-grade students (14-15 years old) that belong to six
different groups from two private schools in Figueira da Foz district, Portugal.
The research question we want to answer in the first place in the work
presented here is:
Question 1. Is it possible to develop and implement a BSM within PMatE
that allows to improve the diagnosis capabilities of the computer tests?
To answer this question, we needed to have a reasonable estimation of
student’s knowledge level. To this end, we conducted a written exam that
was graded using previously fixed criteria by three different teachers. If
the inter-rater agreement is high, the average of such evaluations is used to
compare to the estimations obtained by our Bayesian student model.
The written exam can also be automatically graded. To this end, we
have also developed a GSBM for the written exam that allows to estimate
student’s knowledge level at the different levels of granularity given students
answers. This arises a new research question:
Question 2. Is it possible to develop and implement a BSM for the written
exam that provides a reasonable estimation of student’s knowledge level at the
different levels of granularity?
In this way we can measure the diagnosis capabilities of the GBSM with
real students, and independently of the constraints imposed by the use of
PMatE.
Sections 4.1 and 4.2 provide details of the written exam and the computer
test, respectively, while section 4.3 presents some evaluation results.
4.1. The written exam
Each of the 152 participants completed a written exam. The test was
composed by 14 questions, some of them had several sections. The exam
included different question types: i ) multiple choice (9 items), ii ) fill in the
blank or matching (4 items), iii ) short answer (17 items), and iv ) problemsolving (9 short problems). We decided to include several different types of
exercises, to allow for a better estimation of the real state of knowledge of
each student.
Figure 7 shows an example of one of the questions, developed to evaluate
concepts C6 (addition), C7 (multiplication) and C10 (equivalence principle).
The example question is composed of two sections: the first one is a short
problem-solving item; the second one is a multiple-choice with three question
items. The use of this kind of questions allows the evaluation of the same
basic concepts that were evaluated in the test item illustrated in Figure 3.
12
Figure 7: An example of a question in the written exam
Figure 8 illustrates the relationships of each concept with the different
questions.
Figure 8: Relationships among concepts and questions in the written exam
Each exam was graded by three different teachers, according to previously
fixed criteria: simple concepts were evaluated by assigning them a number
between 0 and 1. This number represents teacher’s estimations about how
well the student had understood that particular concept (according to his/her
13
answers to the set of related questions). The compound concepts estimation
was then automatically computed as a weighted average (using the BSM
weights, as stated in Equation 1).
If the inter-rater agreement proves to be high, the experts average score
can be safely used as an estimation of the hidden variables (student’s knowledge level for each concept, at the different levels of granularity). We will
further discuss this in 4.3.1.
As explained before, given the answers of each student to the questions in
the written exam, it is possible to obtain estimations of student’s knowledge
level at the different levels of granularity under the framework of the GBSM.
To this end, we defined a set of parameters to obtain conditional probabilities
for the graph represented in 8, in a process similar to the described in section
3. As in the case of the computer test, the parameters were set by the domain
modelers, that assigned varying difficulty factors to each question (ranging
from 1 to 5). As the questions are open answer, they decided to set low
values for the slip and guess factors, more concretely guess= 0.05 and slip=
0.01. The discrimination index was set to be 1.4 for all questions.
4.2. The computerized test
For the computerized test, we have re-used the LO available in EQUAmat.
As explained before, each LO is parameterized, and composed of a stem and
4 true/false items. The parameters allow for dynamic generation of the LO
for each particular student.
Each of the 152 students answered to 14 LOs, i.e., 56 randomized true/false
test items (in batches of four). An example of one of such LOs has already
been shown in Figure 3. This LO evaluates the same concepts as the written
exam question shown in Figure 7.
Figure 9 illustrates the relationships of each concept with the different
questions.
4.3. Evaluation results
In this section we will present some evaluation results. As aforementioned,
the first research question we want to answer is:
Question 1. Is it possible to develop and implement a BSM within PMatE
that allows to improve the diagnosis capabilities of the computer tests?
To this end, we need to find a method to measure the diagnosis capabilities
of PMatE. As we are working with real students, the knowledge level of each
student (at the different levels of granularity) remains to be a hidden variable.
14
Figure 9: Relationships among concepts and questions in the computer test
If the inter-rater agreement is high, we can use the average of the scores given
by the experts as our target variables.
We want to test the agreement among the different estimations of the
probability of each student knowing each concept (given by the experts, or
the average of experts, of the computer tests). Therefore the variables we
want to compare take continuous values, ranging from 0 to 1. For continuous
variables, Bland-Altman plots (Altman and Bland, 1983), provide a means
to picture the inter-rater agreement, and, together with confidence intervals
for the mean difference, a good method to measure inter-rater agreement
(Hamilton, 2007). In the next sections we will present some results.
4.3.1. Measuring inter-rater agreement among experts
In the next experiment, we tried to measure the inter-rater agreement
among experts, when scoring student’s written exams. As we want to measure the level of agreement at the different granularity levels, concepts will
be grouped in the different levels of granularity shown in Fig 6, namely:
• Elementary concepts (C1 to C8),
• First level of compound concepts(C9 and C10),
• Second level of compound concepts (C11),
• Third level of compound concepts (C12, which corresponds to the overall score).
15
Overall, there are 152 students, and for each student 12 concepts have
been evaluated. In this way, the total number of data available is 1824. In
a first look to the data available we find that, from these 1824 data points,
the difference of the scores given by experts is 0 in 958 cases (experts 1 and
2); 919 cases (experts 2 and 3) and 892 cases (experts 1 and 3). That is, the
scores given by the experts are equal in 78.78%, 75.58% and 73.36% of the
cases, respectively.
Figure 10 shows the Bland-Altman plots for each of this categories. Each
point in the Bland-Altman plots represents one of the concepts. The mean of
the scores given by the two experts is represented in the abcissa (x-axis) value,
while the difference among the two scores ins represented as the ordinate (yaxis) value. If the score tends to agree, the y-axis value will be near to 0.
The lower and higher values of the abcissa represent concepts with lower
and higher level of knowledge, respectively. Also, Table 2 shows the 0.05
confidence intervals for the mean difference, at each level of granularity.
Figure 10: Bland-Altman plots for inter-agreement among experts
The small size of the confidence intervals at the different levels of granularity shows that the inter-rater agreement is quite high. Worst results are
16
achieved at the second level (C11), in which the higher inter-rater agreement
is achieved for experts 2 and 3. Bland-Altman plots in Fig.12 show also that
the agreement is higher for students with either low or high levels of knowledge, specially in the elementary concepts. This can be due to the fact that
intermediate students are more difficult to diagnose, as their behavior when
answering is more erratic.
Table 3: Confidence intervals for the mean difference among experts
ELEMENTARY (C1-C8)
Expert 1 vs Expert 2
Expert 2 vs Expert 3
Expert 1 vs Expert 3
FIRST LEVEL (C9,C10)
Expert 1 vs Expert 2
Expert 2 vs Expert 3
Expert 1 vs Expert 3
SECOND LEVEL (C11)
Expert 1 vs Expert 2
Expert 2 vs Expert 3
Expert 1 vs Expert 3
THIRD LEVEL (C12)
Expert 1 vs Expert 2
Expert 2 vs Expert 3
Expert 1 vs Expert 3
MEAN
0.043
0.039
0.049
MEAN
0.084
0.088
0.056
MEAN
0.166
0.040
0.144
MEAN
0.061
0.030
0.067
STANDARD DEVIATION
0.105
0.087
0.105
STANDARD DEVIATION
0.113
0.123
0.072
STANDARD DEVIATION
0.176
0.067
0.146
STANDARD DEVIATION
0.053
0.030
0.056
Confidence Interval
(0.037, 0.049)
(0.034, 0.044)
(0.044, 0.055)
Confidence Interval
(0.072, 0.097)
(0.074, 0.102)
(0.048, 0.064)
Confidence Interval
(0.138, 0.194)
(0.029, 0.051)
(0.120, 0.167)
Confidence Interval
(0.053, 0.070)
(0.025, 0.035)
(0.058, 0.076)
The small size of the confidence intervals for the mean difference allows
us to conclude that the degree of agreement is quite high at the different
granularity levels, so we can safely use the average scores as a reasonable
estimation of student’s knowledge level.
4.3.2. Measuring inter-rater agreement among computer test, written exam,
and expert average
Given that the inter-rater agreement among experts is high, in what follows we will use the expert’s average score as a reliable estimation of student’s
knowledge level at the different levels of granularity.
Now we want to measure the degree of agreement among the estimations provided by the BSM defined for the computer test and also by the
estimations provided by the BSM for the written exam.
Figure 11 shows the Bland-Altman plots for the inter-rater agreement between the average of experts (Exp.Av.) and the BSM estimation obtained for
the test and also for the written exam (BN test and BN exam, respectively).
Table 4 shows the 0.05 confidence intervals for the mean difference.
17
Figure 11: Bland-Altman plots for inter-agreement among expert average, BN test and
BN exam
Table 4: Confidence intervals for the mean difference among expert average, BN test and
BN exam
ELEMENTARY (C1-C8)
Expert Average vs BN test
Expert Average vs BN exam
FIRST LEVEL (C9, C10)
Expert Average vs BN test
Expert Average vs BN exam
SECOND LEVEL (C11)
Expert Average vs BN test
Expert Average vs BN exam
THIRD LEVEL (C12)
Expert Average vs BN test
Expert Average vs BN exam
MEAN
0.317
0.118
MEAN
0.255
0.096
MEAN
0.294
0.162
MEAN
0.185
0.058
STANDARD DEVIATION
0.223
0.130
STANDARD DEVIATION
0.175
0.100
STANDARD DEVIATION
0.269
0.131
STANDARD DEVIATION
0.140
0.043
18
Confidence Interval
(0.304, 0.329)
(0.110, 0.125)
Confidence Interval
(0.235, 0.274)
(0.084, 0.107)
Confidence Interval
(0.252, 0.337)
(0.141, 0.183)
Confidence Interval
(0.163, 0.208)
(0.051, 0.064)
Bland-Altman plots and confidence intervals for the mean difference help
us to answer our research questions:
Question 1. Is it possible to develop and implement a BSM within PMatE
that allows to improve the diagnosis capabilities of the computer tests?
Question 2. Is it possible to develop and implement a BSM for the written
exam that provides a reasonable estimation of student’s knowledge level at the
different levels of granularity?
Regarding the first research question, Bland-Altman plots and confidence
intervals show that the answer is negative. The confidence intervals for the
mean difference account for differences between a minimum of 0.163 and a
maximum of 0.337, depending on the granularity level. The conclusion is
that, in the current settings of PMatE, we failed to produce a BSM that
allowed to obtain a reliable estimation of student’s knowledge level at the
different levels of granularity.
However, in our opinion the results have been negatively affected by a
set of constraints imposed by the use of PMatE, more precisely a very high
guessing factor (g=0,5) and the fact that the student needed to answer all
questions even if he/she is unsure of knowing the right answer. These two
constraints certainly increase the degree of randomness in student’s answer,
thus making the diagnosis process more difficult.
With respect to the second research question, we can observe a higher
degree of agreement, so the answer is positive. Both Bland-Altman plots
and confidence intervals for the agreement among BSM for the written exam
and expert average are very similar to the corresponding to the inter-rater
agreement among experts. In particular, confidence intervals show small
mean differences, comparable to the mean differences among experts, except
perhaps in the case of the elementary concepts (for which still the confidence
interval shows an acceptable rate of agreement) and the second level of the
granularity hierarchy (where a comparable disagreement rate among experts
was also observed).
Overall, we think that the disagreement among the computer test and the
expert average was due to using different instruments of measure (written
exam vs test) more than to the diagnosis capabilities of the BSM.
4.3.3. Measuring inter-rater agreement among computer test with and without the BSM
Finally, to answer the research question, there is still one pending issue,
namely the comparison of the diagnosis capabilities of the computer test
19
with and without the BSM for the higher level concept or overall knowledge C12. For completeness, we will also include in this analysis the results
corresponding to the BSM over the written exam.
To this end, we assigned an overall score for each student, based on the
percentage of correct answers (PCA) in the computer test, and used this
measure as an estimation of the overall performance of the student in the
test. This new estimation is then compared to the diagnosis obtained for the
overall concept C12 for both the BSM in the computer test and the BSM in
the written exam (BN test and BN exam, respectively), and to the experts
average (Exp.Av.). Figure 12 shows the corresponding Bland-Altman plots.
Figure 12: Bland-Altman plots for BN test, Bn exam and Expert Average (in the overall
concept C12)
The 0.05 confidence intervals are shown in table 5:
Table 5: Confidence intervals for the mean difference among expert’s average, BN test and
BN exam, at the overall concept C12.
C12
Expert Average vs BN test
Expert Average vs BN exam
Expert Average vs PCA
BN test vs PCA
MEAN
0.185
0.058
0.159
0.179
STANDARD DEVIATION
0.140
0.043
0.109
0.139
Confidence Interval
(0.022,0.163)
(0.007,0,051)
(0.017,0.142)
(0.022,0.160)
We can see that, compared to the expert average, PCA tends to overestimate. This is not surprising as there is a 0.5 guessing factor that has
20
not been corrected. In contrast,the BN test tends to underestimate, both
compared to expert average and to PCA. The performance of the computer
test is similar in both cases (both with PCA and BSM the degree of agreement is not very high). The only case in which confidence intervals show a
high degree of agreement is for the case of BSM in the written exam and
expert average. This again supports our hypothesis that the low degree of
agreement when using the computer test is due to the constraints imposed
by the use of EQUAmat and not to the BSM.
5. Conclusions and Future Work
The work presented here can be described as the development, integration
and evaluation of a Bayesian Student Model by using an existent GBSM
((Millán and Pérez de la Cruz, 2002)) into an existing testing system((Sousa
et al., 2007)). To this end, the major steps have been: a) development of the
Bayesian model and integration in the testing system (as described in Section
3), and b) evaluation of its diagnosis capabilities, with 152 real students.
For the evaluation with real students, a paper and pencil test was constructed, by re-using LO in EQUAmat. Also, a written exam was developed.
Each of the 152 students participating in the experiment took both the computer test and the written exam. The written exam was graded by three
different human experts. For comparison purposes, a BSM for the written
exam has also been developed, using the GBSM approach.
The results of the evaluation have shown a high degree of inter-rater
agreement among experts in the scores of the written exam, which has allow
us to safely use the average expert’s score as a reliable estimation of students
knowledge level at the different levels of granularity.
With respect to the diagnosis performed by the BSM of the computer
tests and the written exam, it has been shown that the BSM defined for the
computer test fails to provide an acceptable rate of agreement with experts
average, probably due to the restrictive conditions imposed by the use of
PMatE. However, the BSM developed for the written exam is able to provide
an estimation of students knowledge level at the different levels of granularity,
with a high inter-rater agreement with experts average (comparable to the
rate of agreement among experts). This reinforces our hypothesis that that
the bad results of the computer test are due to the constraints imposed by
PMatE, and not to the BSM.
Future work is planned in several different directions:
21
• Testing the GBSM in multiple choice tests and not force students to
answer all questions. We think that this will reduce the randomness of
the answers and facilitate a more accurate diagnosis.
• Implementation of parametric learning techniques. Though the GBSM
can provide an acceptable initial BSM model, certainly the use of standard parametric learning techniques can improve the quality of the
parameters of the network, and therefore allow more accurate diagnosis.
• Improvements in the theoretical model. The goal is to make the whole
approach more sound and applicable to real situations. For example,
we plan to include: questions connected to compound concepts, prerequisite relationships (Carmona et al., 2005) and adaptive item selection
criteria (Millán and Pérez de la Cruz, 2002) that will allow to increase
the accuracy of the diagnosis while reducing the number of questions
needed.
• Increased functionality. Currently, the BSM is inspectable by students.
We plan to include interactivity so that for example student can select a
topic to be questioned about, and the system automatically generates
the questions to be posed. In this way, whenever the student feels
his/her model does not accurately reflect the state of knowledge; he/she
can receive a test so the model is adjusted.
References
Altman DG & Bland JM (1983). Measurement in medicine: the analysis of
method comparison studies. Statistician 32, 307-317.
Carmona, C., Millán, E., Pérez-de-la-Cruz, J., Trella, M., & Conejo, R.
(2005). Introducing prerequisite relations in a multi-layered Bayesian student model. UM’05. LNAI 3538, (pp. 347-356). Springer Verlag.
Castillo, G., Descalço, L., Diogo, S., Millán, E., Oliveira, P. & Anjo, B.
(2010). Computerized Evaluation and Diagnosis of Student’s Knowledge
based on Bayesian Networks. EC-TEL 2010, 5th European Conference on
Technology Enhanced learning, Barcelona, LNAI 6383, 2010, (pp.494-499),
Springer Verlag.
22
Conati, C., Gertner, A., VanLehn, K. & Druzdzel, M. (1997). On-line Student Modelling for Coached Problem Solving using Bayesian Networks.
Proceedings of UM’97, (pp. 231-242). Springer Verlag.
Collins, J.A., Greer, J.E. & Huang, S.H. (1996). Adaptive Assessment Using
Granularity Hierarchies and Bayesian Nets. ITS’96. LNCS, Vol. 1086 (pp.
569-577). Springer Verlag.
Hamilton, C. and Stamey, J. (2007). Using Bland-Altman to assess agreement
between two medical devices don’t forget the confidence intervals. Journal
Clinical Monitoring and Computing, 21, 331333.
Jameson, A. (1996). Numerical Uncertainty Management in User and Student Modeling: An Overview of Systems and Issues. User Modeling and
User-Adapted Interaction, 5, 193-251.
Millán, E., Loboda, T. & Pérez-de-la-Cruz, J.L. (2010): Bayesian networks
for student model engineering. Computers and Education 55(4): 16631683.
Millán, E. & Pérez de la Cruz, J.L. (2002). A Bayesian Diagnostic Algorithm
for Student Modeling. User Modeling and User-Adapted Interaction 12,
281-330.
Millán, E., Pérez de la Cruz, J.L & Garcı́a, F. (2003). Dynamic versus
Static Student Models Based on Bayesian Networks: An Empirical Study.
KES’03. LNCS 2774, (pp. 1337-1344). Springer Verlag.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference, Morgan Kaufmann Publishers, Inc.
Sousa Pinto, J., Oliveira, P., Anjo, B., Vieira, S.I., Isidro, R.O. & Silva,
M.H. (2007). TDmat-mathematics diagnosis evaluation test for engineering sciences students. International Journal of Mathematical Education in
Science and Technology, 38(3), 283-299.
VanLehn, K., Niu, Z., Siler, S. & Gertner, A.S. (1998). Student Modeling from Conventional Test Data: A Bayesian Approach Without Priors.
ITS’98. LNCS, vol. 1452, (pp. 434-443). Springer Verlag.
23
VanLehn, K., Ohlsson, S., Nason, R. (1994). Aplications of Simulated Students: an exploration. Journal of Artificial Intelligence in Education, 5(2),
135-175.
Vomlel, J. (2004). Bayesian Networks In Educational Testing. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
12(Supplement-1): 83-100.
24