Cognitive Diagnostic Assessment
Cognitive Diagnostic Assessment
Cognitive Diagnostic Assessment
I Introduction
Cognitive diagnostic assessment (CDA) aims to provide formative diag-
nostic feedback through fine-grained reporting of test takers’ skill mas-
tery profiles (Buck & Tatsuoka, 1998; DiBello, Stout, & Roussos, 1995;
Address for correspondence: Eunice Eunhee Jang, Ontario Institute for Studies in Education, University
of Toronto, 10–260, 252 Bloor Street West, Toronto, ON M5S 1V6; email: [email protected]
© 2009 SAGE Publications (Los Angeles, London, New Delhi and Singapore) DOI:10.1177/0265532208097336
32 Cognitive diagnostic assessment of L2 reading comprehension
1 When the study was conducted, the term Next Generation TOEFL was used to refer to a new
TOEFL test. More recently, this term was replaced by TOEFL iBT.
34 Cognitive diagnostic assessment of L2 reading comprehension
2 A multi-year research project, funded by Educational Testing Service (ETS), was undertaken to
develop the Fusion Model Skills Diagnosis System (DiBello, Roussos, & Stout, 2007; Hartz, 2002)
Eunice Eunhee Jang 35
3 The earliest statistical model in CDA is Fischer’s Linear Logistic Test Model (LLTM, 1973),
which mainly uses item difficulty parameters by decomposing them into discrete skill-based diffi-
culties. This model presupposes that two test takers with the same ability have the same probabil-
ity of success on any item, which is unrealistic in practice. The well-known Rule Space Model
(Tatsuoka, 1983) is test-taker based, in that it classifies test takers into ability vectors of dichoto-
mous mastery and nonmastery. It uses a pattern-recognition approach based on the distance
between observed item response patterns and a set of ideal response patterns. It does not provide
the necessary statistics to evaluate the diagnostic property of the test items.
36 Cognitive diagnostic assessment of L2 reading comprehension
mastery on that skill. Therefore, the r parameters are crucial for evalu-
ating the diagnostic capacity of the test instrument.
ci an indicator of the degree to which the item response function
relies on skills other than those assigned by the Q matrix. The lower
the ci , the more the item depends on j. The ci can provide diagnos-
tic information about the completeness of the Q matrix.
A computer program, Arpeggio, is used to estimate the ability
and item parameters using the hierarchical Bayesian Markov
Chain Monte Carlo (MCMC) parameter estimation procedure. The
program simulates Markov chains of posterior distributions for
all the model parameters given the observed data (Patz & Junker,
1999). Recently MCMC has been used in many CDA applica-
tions. The most crucial issue in the successful implementation of
MCMC is to evaluate whether the chain has converged to the
desired posterior distribution and whether the model parameter
estimates were reliably estimated. Despite various formal conver-
gence tests, no single approach yet yields reliable statistics for
checking convergence.
If necessary, diagnostically non-informative Q matrix entries can
be removed from the initial Q matrix through an iterative Q-matrix
refining process. For example, a high r* value ( 0.9) requires care-
ful examination as to whether or not a skill is essentially important
for correctly answering a given item. However, any change in the Q
matrix requires not only statistical evidence but also a theoretically
justifiable rationale.
Given MCMC convergence, one can obtain the posterior prob-
ability of mastery (ppm) for each skill and for each test taker from the
posterior distribution. For example, if a test measures four skills, each
test taker’s skill profile will include four skill mastery probability
values, each indicating the mastery level for one of the four skills.
The estimates of the item parameters and the standard errors of the
item parameters need to be evaluated by examining how reliably
each item classifies the test takers into masters or non-masters on
each skill. If convergence has occurred and the standard errors are
relatively small, the estimates of the item parameters will provide
useful information about each item’s diagnostic capability on its
required skills.
However, the standard notion of reliability cannot be directly
applied to evaluate most CDA models, such as the Fusion Model,
because it usually assumes a continuous, unidimensional latent trait.
Research on CDA modeling needs to develop appropriate ways to
examine its reliability and statistical quality.
38 Cognitive diagnostic assessment of L2 reading comprehension
IV Methodology
1 Overview of the original study
In the main study (Jang, 2005), I used a mixed-methods research
design, comprising both quantitative and qualitative approaches over
three developmental phases, in order to make comprehensive valid-
ity arguments through dialectical triangulation of multiple sources of
empirical evidence (Greene & Caracelli, 1997; Tashakkori &
Teddlie, 2003).
In Phase 1, I identified nine primary reading comprehension skills
by analyzing think-aloud verbal protocols and performing statistical
test and item analyses. The verbal protocol participants included
seven ESL students from a TOEFL preparation course offered by an
Intensive Language Program and four graduate ESL students at a
mid-western university in the USA. Each participant verbalized
reading processes and strategies while completing 12 to 13 reading
comprehension questions per passage sampled from the LanguEdge
RC test forms. I recruited five raters to evaluate the reading skills
that I had identified from the think-aloud data and which I provided
to them on a list. I asked them to select and rank in terms of the
importance of skills needed to correctly answer each item. I also
asked them to list any necessary skills not included on the list. The
nine skills showed a moderate level of agreement. These skills
are presented at the beginning of the data analysis section in this
paper. Full results from Phase 1 are not reported here due to space
limitations.
Using those nine skills, Phase 2 examined the characteristics of
skill profiles estimated by the Fusion Model. The Fusion Model was
fitted to the LanguEdge field test data to estimate skill mastery prob-
abilities for 2703 test takers. I evaluated the dependability of the
Fusion Model skills diagnosis process, the characteristics of the esti-
mated skill profiles, and the diagnostic capacity of the LanguEdge
RC test items. I also used 2,703 test takers’ self-assessments to
obtain background information and to examine the strength of the
relationship between the estimated skill profiles and the test takers’
self-reported reading abilities. I conducted in-depth analysis of skill
profiles from five cases to substantiate the diagnostic capacity of the
given test items.
In Phase 3, to evaluate the usefulness of the diagnostic information
in a classroom setting, the fitted model was applied to two TOEFL
preparation courses involving 28 ESL students. The students took
pre- and post-instruction diagnostic tests and, after each test, received
Eunice Eunhee Jang 41
2 Participants
a LanguEdge field test takers: A total of 2,703 test takers took the
LanguEdge field tests at 32 domestic and international test sites
across 15 countries in 2002. According to the score interpretation
guide in the LanguEdge courseware, the test takers approximated
general TOEFL populations. The test takers consisted of 1,368 males
and 1,299 females. Their reasons for taking the tests varied: 1,662 test
takers planned to study in undergraduate or graduate degree pro-
grams in the USA or Canada. The remainder had various reasons,
such as: (a) entering a school other than college; (b) getting licensed
to practice a profession in the USA; and (c) demonstrating English
proficiency to their employers.
3 Instruments
a The LanguEdge field tests: I used test takers’ response data from
two forms of the LanguEdge RC test for the Fusion Model skill pro-
filing in Phase 2. I used the same test instruments as pre- and post-
instruction diagnostic assessments in Phase 3. The LanguEdge
courseware was developed by Educational Testing Service to be used
as an instructional tool in the ESL classroom. Its expected benefits
included: (a) preparing students for communicative competence in
the real world; (b) assessing students’ progress in all language skills
and informing them of weaknesses needing improvement; and (c)
using subtests for practice and the assessment of progress.
42 Cognitive diagnostic assessment of L2 reading comprehension
Table 1 Mean scores and reliability estimates of the two test forms
III Results
1 Dependability of the Fusion Model skill profiling process
a Evaluation of MCMC convergence: I examined three different
Markov chain lengths of 5000, 15,000, and 30,000 to determine chain
convergence by inspecting a density plot, a time-series plot, and an
autocorrelation plot for each. The plots from the 30,000-chain length
indicated that convergence had occurred, because there was very lit-
tle change in the posterior distribution after the first 1000 steps.
Figure 2 shows the plots from chain length 30,000 for r*23_3 for Item
23, which refers to the r* parameter estimate of Skill 3 (SSL) for Item
23. I present the plots from this item because it showed relatively high
posterior standard deviations and jumpy chain convergence.
This result suggested that a relatively long chain (e.g., 30,000) is
necessary for reliable estimation of a large number of parameters in
the case of complex diagnosis modeling. Subsequently, I used the
chain length of 30,000 with the 13,000 burn-in to estimate skill pro-
files throughout the study.
b Evaluation of the Fusion Model parameter estimates: An initial
run of the Arpeggio program resulted in more than 16 rik* parameters
that were relatively too high ( 0.9) with relatively large standard
errors. However, dropping the 16 r*’s from the Q matrix would dras-
tically alter the item-by-skill specifications represented in the Q
matrix and thereby might make the Q matrix theoretically less justi-
fiable. It appeared that the c parameters, which account for skills
Figure 2 Density, time-series, and autocorrelation plots for r*23_3 from the chain
length 30,000
48 Cognitive diagnostic assessment of L2 reading comprehension
5 Roussos, DiBello, Henson, Jang, & Templin (in press) explain that this problem is associated with
a test having a single dominant dimension. In such a case, a continuous residual parameter ‘soaks
up’ most of the variance in the item responses. The researchers note that a good solution to this
problem is to drop the ci parameters by setting them equal to 10; this reduced version of the Fusion
Model has been often used to analyze real data.
Table 3 Fusion Model item parameter estimates for the final Q matrix (Form 1)
Item i* CDVa CIV SSL TEI TIM INF NEG SUM MCF ab b
(continued)
50
Table 3 (continued)
Item i* CDVa CIV SSL TEI TIM INF NEG SUM MCF ab b
1.2
1
Cumulative probability
0.8
Observed
0.6
Model estimated
0.4
0.2
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40
Score
Note: RMSE: root mean squared error, MAD: mean absolute difference.
scores for 1,372 test takers in Form 1. Each petal on the scatter plot
represents a test taker. The bigger and darker the sunflower is, the
more test takers there are.
The correlation between the model-estimated probabilities of mas-
tery for the nine skills and the total test scores was 0.94, supporting the
positively correlated relationship. Nevertheless, the plot also clearly
pointed to wider distributions for the low- and high-scoring test takers,
which was consistent with the misfits presented in Figure 3.
50
40
30
Total score
20
10
N = 1372
0
–1 0 1 2 3 4 5 6 7 8 9 10
No. of mastered skills
Figure 4 The scatter plot of the numbers of mastered skills against the total scores
The skills SUM, CDV and CIV had relatively large numbers of
masters, while the NEG and MCF skills had fewer. In particular, the
SUM skill showed the largest number of masters. The proportion of
test takers determined as masters for each skill was relatively con-
sistent with the observed skill difficulty levels (see Figure 1).
I examined the consistency rates for classifying test takers into the
three categories of mastery status through testing the reliability of
classifications of mastery status for each skill. As shown in Table 6,
the correct classification rates for both test forms were high,
supporting the reliability of the test taker classification. Form 1
Eunice Eunhee Jang 53
Form 1 Form 2
Skill Overall CCR CCR for M CCR for NM Overall CCR CCR for M CCR for NM
CDV CIV SSL TEI TIM INF NEG SUM MCF SSRs
CDV – .82 .86 .81 .82 .83 .85 .84 .70 .39
CIV – .80 .78 .78 .79 .83 .79 .65 .43
SSL – .84 .86 .85 .88 .84 .74 .34
TEI – .82 .82 .87 .83 .72 .32
TIM – .85 .89 .81 .75 .35
INF – .88 .82 .74 .34
NEG – .85 .79 .36
SUM – .72 .35
MCF – .23
SSR –
Form 1
1.2
1
Proportion correct
0.8
0.6
Masters
0.4 Non-masters
0.2
0
0 5 10 15 20 25 30 35 40
Items
Form 2
1.2
1
Proportion correct
0.8
0.6
Masters
0.4 Non-masters
0.2
0
0 5 10 15 20 25 30 35 40
Items
I examined Cases 1 and 5 because they had the most different skill
profiles. Case 5’s skill profile showed that she mastered all skills except
CIV. When I inspected her actual responses to the CIV-associated items,
she failed to respond correctly to all of the seven items measuring CIV
skills; this partially supports her estimated skill mastery profile. Case 1,
despite her ‘flat’ (zero-mastered) skill profile, achieved proportion-correct
subscores ranging from 0.50 to 0.60 on the items assessing her non-
mastered skills. For example, even though she correctly answered four
out of eight items assessing CIV, her mastery probability of the CIV
skill was only 0.1, relatively too low compared to her observed per-
formance on those items. This may be related to the presence of items
56 Cognitive diagnostic assessment of L2 reading comprehension
Table 9 Students’ skill mastery probabilities from the pre-instruction diagnostic test
Karen 21 0.25 0.00 0.15 0.26 0.17 0.26 0.13 0.90a 0.78
Heather 32 0.99 0.95 0.99 1.00 0.99 1.00 0.89 1.00 0.77
Hail 28 0.95 0.92 0.97 0.88 0.24 0.13 0.51b 0.76 0.05
Yoshi 21 0.16 0.02 0.09 0.09 0.21 0.01 0.10 0.42 0.51
Siree 31 0.96 0.99 0.99 0.99 0.96 0.91 0.73 1.00 1.00
Dongin 32 0.98 0.99 0.94 0.98 1.00 0.99 0.95 0.97 1.00
Chris 20 0.85 0.95 0.17 0.00 0.01 0.36 0.18 0.37 0.00
Gao 30 0.98 0.78 0.90 0.89 0.87 0.86 0.77 0.99 0.99
Kyung 37 1.00 0.99 1.00 1.00 1.00 1.00 0.93 1.00 1.00
Hyung 18 0.06 0.09 0.01 0.00 0.24 0.08 0.12 0.05 0.00
Take 19 0.24 0.00 0.23 0.50 0.01 0.01 0.12 0.78 0.81
Ohmi 16 0.04 0.05 0.02 0.02 0.15 0.00 0.03 0.18 0.00
Gkyung 10 0.01 0.01 0.01 0.00 0.13 0.01 0.03 0.00 0.00
Lee 31 0.96 0.84 0.96 1.00 0.83 0.98 0.96 1.00 0.99
Shu 22 0.10 0.00 0.54 0.66 0.01 0.14 0.22 0.27 0.87
a Bold entries indicate mastered skills.
b Italicized entries indicate that skill mastery is not ‘determined’.
The mean scores for all the students were 25 and 21 in Courses A
and B respectively. Since the students were placed into the courses
on the basis of their placement test results, I expected that the stu-
dents in Course A would perform better on the test than those in
Course B. Further, the Course A students’ skills profiles showed that
they had mastered 68 skills (ppm 0.6) whereas the Course B stu-
dents’ profiles had mastered only 27 skills.
more? But why didn’t I get a perfect score then? What does it mean
to be a master of a certain skill?’ This raised concerns about the impli-
cations of skills diagnosis for future action.
When asked whether the test items measured skills associated
with them (see the second page of DiagnOsis, Appendix A), 18% of
the students agreed that the test items assessed the associated skills.
The rest of the students expressed various views, for example:
Student 1: I think the more questions I have, the more I can be convinced to
know about my reading proficiency. But we don’t have enough questions.
Student 2: I think these questions assess the skills well, but I also think those
skills can’t be divided accurately because most questions need combined
skills anyway.
Student 3: Actually I don’t know how much these questions assess those skills
correctly. If I could understand the whole passage well, it won’t matter.
Note: Positive values indicate increase in the posterior probabilities of mastery for the skills after the post-test.
0.8
Probability
Pre-test
0.6
Post-test
0.4
0.2
0
CDV CIV SSL TEI TIM INF NEG SUM MCF
Skills
0.8
0.6 Pre-test
Post-test
0.4
0.2
0
CDV CIV SSL TEI TIM INF NEG SUM MCF
Skills
0.8
Probability
0.6
Pre-test
0.4 Post-test
0.2
0
CDV CIV SSL TEI TIM INF NEG SUM MCF
Skills
Teacher 1 (male, former): Breaking down reading this way is a good diag-
nostic procedure for the student as well as the teacher. The key is to help the
student gain meta-cognitive awareness of the various categories of reading,
so that he or she can understand the feedback and try to improve.
Teacher 2 (female, current): I think the scoring report helped me as a teacher
to understand what the weaknesses were for my students. After examining
the scoring report, I gave extra assignments to my students to help them do
more exercises on their weak skills. Students’ test scores at the end of the
semester showed some improvement.
However, the teachers also raised some important issues concerning
the use of diagnostic feedback. The former teacher, Teacher 1, pointed
out that the use of diagnosis may depend on the context of learning:
Teacher 1: We need to consider differences that lie between EAP (English for
Academic Purpose) courses and test preparation courses that we are talking
about now. In the test preparation courses, there might be more ‘teaching to
the test’ than in an EAP class. Such difference could be an important variable
for evaluating the use of diagnostic feedback.
In addition, Teacher 3 also expressed her own concern about a mis-
match between the skills diagnostic approach and her own peda-
gogical beliefs:
Teacher 3 (female, current): Knowing my students’ strengths and weaknesses
was very useful even though most of them needed to improve almost all skills
after all. But I don’t teach reading separately. I try to encourage students to
study listening, reading, and structure simultaneously. So, I don’t teach the
reading skills included in this scoring report. I can’t teach all these skills in
my class. I don’t have enough time and it’s just not how I teach my students.
In sum, the interviews with the teachers indicated that the usefulness
of diagnostic feedback depends on the degree to which it is compat-
ible with the teachers’ pedagogical approaches and the extent to
which the curriculum deals with feedback’s content.
VI Discussion
Question 1: How dependably does the Fusion Model estimate L2
reading skill profiles?
Due to the complex CDA modeling, which involves estimating a
large number of item parameters, evidence of the MCMC conver-
gence is essential for evaluating the quality of diagnosis. Overall, the
study results provided multiple lines of positive evidence substanti-
ating the claims about the statistical quality of the Fusion Model.
Visual representations of the posterior distribution estimated by
Arpeggio clearly indicated that convergence had occurred, and that
62 Cognitive diagnostic assessment of L2 reading comprehension
can be calculated by (1 – rik*)πi* values for all the Q-matrix entries. For
example, the DiagnOsis report in Appendix A lists a subset of items
for each skill in order of highest to lowest DII values. Reporting the
weights can be useful when test takers review their performance on the
test items. It would be particularly useful for test takers like Case 1,
from the case analysis, who shows a large discrepancy between total
test score and skill profiles.
VII Conclusion
The study investigated the validity of the CDA Fusion Modeling
approach to an existing large-scale reading comprehension test. This
paper focused on the characteristics and the dependability of the skill
profiles estimated by the Fusion Model and on the usefulness of this
diagnostic feedback in ESL classrooms.
The study results suggest that the CDA approach can provide more
fine-grained diagnostic information about the level of competency in
reading skills than the traditional scoring approach can. The estimates
of skill mastery profiles were reasonably reliable on various meas-
ures. The model-estimated statistics approximated the observed test
statistics, evidencing a good model fit to data. On the other hand, the
results also point to some concerns. When the CDA model is retro-
fitted to a test developed for non-diagnostic purposes, it may lose
some diagnostic capacity largely due to test items with extreme dif-
ficulty levels. These items do not necessarily contribute much to
evaluating test takers’ skill competencies. A model misfit with
observed data from the high- and low-scoring test takers seems to be
associated with the presence of such items. This finding needs to be
Eunice Eunhee Jang 67
VIII References
Afflerbach, P., & Johnston, P. (1984). Research methodology: On the use of
verbal reports in reading research. Journal of Reading Behavior, 16(4),
307–322.
Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University
Press.
Alderson, J. C. (2005). Diagnosing foreign language proficiency: the inter-
face between learning and assessment. London: Continuum.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language testing construc-
tion and evaluation. Cambridge: Cambridge University Press.
Alderson, J. C., & Huhta, A. (2005). The development of a suite of computer-
based diagnostic tests based on the Common European Framework.
Language Testing, 22, 301–320.
68 Cognitive diagnostic assessment of L2 reading comprehension
scoring report
DiagnOsis
Student Name: Margo LanguEdge Reading Comprehension Test 1
Reading Skill
Skill 3 with the classification.
Skill 2 • This diagnostic information can be more useful
Skill 1 when used in combination with your teacher’s
and your own evaluation of your reading skills.
0 0.5 1
Cognitive diagnostic assessment of L2 reading comprehension
Probability
Needs improvement Not determined Mastered
Appendix A: A Sample Skill Profile Report (Part II)
scoring report
DiagnOsis Primary Skill Descriptions and Example Questions
Example
Skill Skill Descriptions
Questions
Understand the meanings of words or phrases by searching and analyzing surrounding text and using contextual
33, 14, 32, 4, 3,
1 clues appearing in the text. With this skill, you can determine which option word or phrase has the closest 11
meaning to the author’s intended meaning by making use of clues appearing in the text.
Determine word meanings by identifying which option word or phrase has the closest meaning to the referenced
9, 27, 10, 29, 19,
2 word or phrase in the text. Textual clues often do not appear explicitly in the text. With this skill, you can 21, 7
comprehend the text using your prior knowledge of vocabulary.
Comprehend grammatical relationships of words or phrases across successive sentences. With this skill, you can
3, 26, 12, 36, 4,
3 identify words or phrases that particular pronouns refer to OR determine where a new sentence can be inserted 2, 22, 33, 24
without logical or grammatical problems in the text.
Search across sentences within a paragraph and locate relevant information that is clearly stated in the text. 22, 18, 30, 17, 8,
4 Words or phrases in the options often match literally with words or phrases appearing in the relevant section of 24, 36, 20, 12,
the text. 25, 14
Search across paragraphs and locate relevant information that is not clearly stated in the text. Words or phrases
5 in the options do not match literally with those in the text, but they are paraphrased in different words or phrases. 6, 34, 26, 4, 5, 35
With this skill, you can determine which option most accurately preserves the author’s intended meaning.
Comprehend an argument or an idea that is implied but not explicitly stated in the text OR the author’s rhetorical
31, 16, 23, 15,
? 6 purpose of mentioning particular phrases in text. With this skill, you can infer information that is implied from 28, 2, 11, 7, 32
the text or can determine the author’s underlying purpose of using particular expressions.
Search and locate relevant information across the text and determine what information is true or not true. With
7 this skill, you can verify which option is true or false based on information stated in the passage or identify key 22, 7, 28, 5
information from negatively stated options (or questions).
Identify major ideas by distinguishing them from nonessential information across paragraphs. With this skill,
8 13, 5, 17, 25, 20
you can distinguish major ideas from ideas that are not mentioned or less important in the text.
Recognize major contrasts and arguments presented across paragraphs. With this skill, you can comprehend the
9 organization of the text which often contains the relationships such as compare/contrast, cause/effect, or 37, 23, 35
alternative arguments) and can determine major contrasting ideas or arguments.
Eunice Eunhee Jang 73
• Not all example questions are equally informative in assessing related skills. Questions are listed in the order from most informative to least informative for your review.
• indicates that these skills are weak areas you need to improve. ‘?’ indicates that your mastery is not determined.