Cognitive Diagnostic Assessment

Language Testing 2009 26 (1) 031–073
Cognitive diagnostic assessment

of L2 reading comprehension ability:
Validity arguments for Fusion Model
application to LanguEdge assessment
Eunice Eunhee Jang University of Toronto, Canada
With recent statistical advances in cognitive diagnostic assessment (CDA), the

CDA approach has been increasingly applied to non-diagnostic tests partly to
meet accountability demands for student achievement. The study aimed to
evaluate critically the validity of the CDA application to an existing non-diag-
nostic L2 reading comprehension test and to provide information about chal-
lenges and conditions for the CDA approach. Based on Jang’s study (2005),
this paper focuses on the dependability of the Fusion Model’s skill profiling,
the characteristics of resulting L2 skill profiles, and the diagnostic capacity of
LanguEdge™ test items. In addition, the paper examines the validity argu-
ments from the users’ perspective by focusing on the usefulness of the diag-
nostic feedback. The results suggest that the CDA approach can provide more
fine-grained diagnostic information about the level of competency in reading
skills than traditional aggregated-test scoring can. While various empirical
evidence supported the dependability of the skill profiling process, the results
also raised some concerns about the application of the CDA approach to a test
developed for non-diagnostic purposes, most significantly, a lack of diagnos-
tic capacity of some of the test items with extremely easy or difficult levels.
The results offer useful information about the potential challenges and condi-
tions for future application of cognitive diagnostic assessment.
Keywords: cognitive diagnostic assessment (CDA), diagnostic feedback,

Fusion Model, language testing, LanguEdge assessment, reading compre-
hension, skill profiles, validity arguments
I Introduction
Cognitive diagnostic assessment (CDA) aims to provide formative diag-
nostic feedback through fine-grained reporting of test takers’ skill mas-
tery profiles (Buck & Tatsuoka, 1998; DiBello, Stout, & Roussos, 1995;
Address for correspondence: Eunice Eunhee Jang, Ontario Institute for Studies in Education, University
of Toronto, 10–260, 252 Bloor Street West, Toronto, ON M5S 1V6; email: [email protected]
© 2009 SAGE Publications (Los Angeles, London, New Delhi and Singapore) DOI:10.1177/0265532208097336
32 Cognitive diagnostic assessment of L2 reading comprehension
Embretson, 1991, 1998; Frederiksen, Glaser, Lesgold & Shafto, 1990;

Hartz, 2002; Nichols, Chipman, & Brennan, 1995; Shohamy, 1992;
Tatsuoka, 1983). The CDA approach combines theories of cognition of
interest with statistical models intended to make inferences about test
takers’ mastery of tested skills. Yet, we lack clear distinctions among
terms such as ‘processes,’ ‘skills,’ and ‘strategies’ (Grabe, 2000; Weir,
Huizhong, & Yan, 2000). In this paper, I use ‘cognitive skills’ to refer to
test takers’ abilities that influence item performance through conscious
and subconscious processing of information. The term ‘skill profile’
refers to diagnostic feedback specifying an individual test taker’s com-
petencies in tested skills.
In general, the primary goal of traditional educational tests is to
make inferences about an individual test taker’s general ability with
reference to other test takers in the normative group (Brown &
Hudson, 2002). Such traditional testing has been criticized for not
providing diagnostic information to inform students of their
strengths and weaknesses in a specific academic domain (Nichols,
1994; Snow & Lohman, 1989). As standardized tests are thus being
increasingly recognized as unsatisfactory (Mislevy, Almond &
Lukas, 2004), testing communities have called for more diagnostic
information for guiding learning, improving instruction, and evalu-
ating students’ progress.
Different purposes of assessments strongly affect how to interpret
test takers’ competencies in the tested skills. For example, if the pur-
pose of assessment is to discriminate test takers by locating them on
a continuous ability scale, an aggregated total test score or subscores
based on modality (i.e., listening, reading, and writing, etc.) should
suffice for measuring the test takers’ overall competencies.
On the other hand, an assessment designed to evaluate learners’
competencies in micro-level skills requires a much finer-grained rep-
resentation of the construct of interest. The CDA approach aims to
achieve this by evaluating individual test takers’ competencies in a
set of user-specified skills. The approach thus needs to be based on
a substantive theory of the construct that describes the processes
which a learner use to perform on tasks. It also requires clear speci-
fications that delineate the tasks in terms of how they elicit cognitive
processes.
Few diagnostic assessments are specifically designed for provid-
ing diagnostic feedback (Alderson, 2005; Gorin, 2007). Therefore,
there is a great need for a diagnostic test that includes cognitive tasks
suited for diagnosing learners’ strengths and weaknesses in the tested
Eunice Eunhee Jang 33
skills. Such a diagnostic test requires a systematic design framework

involving multiple steps (Davidson & Lynch, 2002; Mislevy, Steinberg,
& Almond, 2003; Pellegrino, Chudowsky, & Glaser, 2001). The design
framework of CDA can include: (1) defining the learning and instruc-
tional goals that serve as criteria for the diagnosis; (2) designing spe-
cific tasks that are diagnostically informative in evaluating a learner’s
competency in light of these criteria; (3) developing a scoring system
that provides fine-grained diagnostic information; and (4) optimizing
the reporting of diagnosis to maximize its usefulness.
Due to a lack of existing diagnostic tests, CDA has been retrofit-
ted to existing achievement or proficiency tests in the hope of pro-
viding fine-grained diagnostic feedback beyond the aggregated test
scores. For example, CDA models such as the Rule Space Model
(Tatsuoka, 1983, 1990) and Sheehan’s tree-based regression approach
(1997) were applied to existing language tests. Buck and Tatsuoka
(1998) and Kasai (1997) applied the Rule Space Model to a short-
answer listening comprehension test administered to 412 Japanese
college students and to the Test of English as Foreign Languages
(TOEFL) reading subtests respectively. Sheehan (1997) applied the
tree-based regression approach to the Scholastic Aptitude Test (SAT)
I Verbal Reasoning test as well as the TOEFL reading test.
II Purpose of the study

When the Internet-based (iBT) Test of English as a Foreign Language
(TOEFL) was launched, one of its aims was to provide better infor-
mation to institutions about students’ ability to communicate in an
academic setting and their readiness for academic coursework. To
assist teachers with instruction, the LanguEdge courseware was
developed to serve as an instructional tool for English as a second lan-
guage (ESL) classrooms. It was designed to provide teachers and stu-
dents with a computer-based assessment and to provide students,
teachers, and institutions with useful information about the learner’s
competency in English. The courseware included a prototype of Next
Generation TOEFL1 in two test forms, which were field-tested on
more than 2700 students at 32 test sites in 15 countries in 2002.
1 When the study was conducted, the term Next Generation TOEFL was used to refer to a new
TOEFL test. More recently, this term was replaced by TOEFL iBT.
The purpose of the present study (Jang, 2005), part of a large-scale

research project,2 was to examine critically the validity of the applica-
tion of the Fusion Model to the L2 reading comprehension assessment
from the LanguEdge courseware. Because the courseware was
intended to serve as an instructional and learning tool for ESL teachers
and students, using the courseware for diagnostic purposes was deemed
justifiable. However, the test included in the courseware was not
designed specifically to provide diagnostic feedback on micro-level
skills. The validity of the CDA’s application to the non-diagnostic L2
reading comprehension test required multiple lines of evidence. The
following validity assumptions were prioritized:
1. the construct of L2 reading comprehension ability measured by
the LanguEdge reading comprehension test can be decomposed
into a set of distinguishable reading skills;
2. diagnostic feedback generated from the application of the CDA
Fusion model to the LanguEdge L2 reading comprehension test
can produce dependable diagnostic information about an indi-
vidual learner’s strengths and weaknesses in reading ability; and
3. teachers and students will make use of diagnostic information to
gauge learning and enhance instruction.
III Literature review

1 Reading as a construct of multi-divisible skills
The application of CDA to an L2 reading comprehension assessment
presupposes that the construct of reading comprehension ability can
be decomposed into a set of multiple cognitive skills. However, the
extent to which such skills are independently identifiable remains
controversial (Alderson, 2000; Alderson & Lukmani, 1989; Lumley,
1993; Weir et al., 2000). For example, some researchers contended
that reading consists of a single global construct (Carver, 1992; Rost,
1993) and questioned the extent to which reading skills can be iden-
tified in a hierarchical manner (Alderson & Lukmani, 1989).
Other researchers agree that the reading construct is composed of
multi-componential skills but differ on the number, scope and nature
of reading skills (Anderson, Bachman, Perkins, & Cohen, 1991;
2 A multi-year research project, funded by Educational Testing Service (ETS), was undertaken to
develop the Fusion Model Skills Diagnosis System (DiBello, Roussos, & Stout, 2007; Hartz, 2002)
Bachman, Davidson, & Milanovic, 1996; Grabe, 1991; Lumley, 1993;

Nevo, 1989; Weir et al., 2000). For example, one view suggested that
reading is bi-componential, comprising two separately identifiable fac-
tors: vocabulary and general reading comprehension (Berkoff, 1979;
Carver, 1992). Weir and Porter (1996) proposed a fourfold categoriz-
ation of reading skills and strategies for an English for Academic
Purposes reading test: expeditious reading at the global and local
levels and careful reading at both levels.
Besides the kinds and scope of reading skills, reading researchers
have examined reading processes and strategies by engaging students
and teachers in think-aloud verbal protocols and interviews
(Afflerbach & Johnston, 1984; Anderson et al., 1991; Cohen, 1987;
Cohen & Hosenfeld, 1981; Cohen & Upton, 2007; Faerch & Kasper,
1987; Nevo, 1989). In particular, research on reading processes and
skills in a testing situation as compared to a non-testing situation
suggested that the strategies used depended on the types of test ques-
tions (Farr, Pritchard, & Smitten, 1990) and that students in the test-
ing situation used strategies more frequently than students who read
in the non-testing situation (Cordon & Day, 1996). Li (1992)
observed that readers also make use of different skills to successfully
complete a task and process textual information by using a variety of
skills, some of which the test developers did not anticipate. Similarly,
Anderson et al. (1991) found no statistically significant relationship
between learner-reported skills and the intended purposes of the test
questions. These findings illuminate the potential challenges in identi-
fying reading skills for cognitive diagnostic assessment. Due to space
limitation, I do not discuss this validity concern further here (see
Jang, 2005, for a full report).
2 Statistical models in CDA: Application of the Fusion Model

Recent advances in probability-based statistical modeling3 have
allowed us to make more complex inferences from data than stand-
ard test theories can (Mislevy, 1995). One of the new models, the
3 The earliest statistical model in CDA is Fischer’s Linear Logistic Test Model (LLTM, 1973),
which mainly uses item difficulty parameters by decomposing them into discrete skill-based diffi-
culties. This model presupposes that two test takers with the same ability have the same probabil-
ity of success on any item, which is unrealistic in practice. The well-known Rule Space Model
(Tatsuoka, 1983) is test-taker based, in that it classifies test takers into ability vectors of dichoto-
mous mastery and nonmastery. It uses a pattern-recognition approach based on the distance
between observed item response patterns and a set of ideal response patterns. It does not provide
the necessary statistics to evaluate the diagnostic property of the test items.
Fusion Model, serves two major purposes (DiBello et al., 1995;

Hartz & Roussos, 2002, 2005). First, it evaluates a learner’s compe-
tency on an array of cognitive skills; second, it can be used to evalu-
ate the diagnostic capacity of the test and items (Hartz, 2002; Hartz &
Roussos, 2005).
Like all Item Response Theory (IRT) models, the Fusion Model
defines the probability of observing a test taker’s response to an item
in terms of ability and item parameters. It uses item response prob-
abilities linked to a set of user-defined skills to determine the test
taker’s mastery on each of the skills. The relationship between the
test items and cognitive skills is expressed by entries of 1s and 0s in
a Q matrix (Tatsuoka, 1983). In creating a Q matrix, one needs to
consider the item-to-skill ratio in terms of the number of skills per
item and the congruence of a skill’s cognitive complexity with the
difficulty level of the associated item.
As to the nature of the interrelationships between skills and items,
the Fusion Model assumes the conjunctive interaction of skills
required to solve an item; that is, successful performance on the item
requires successful application of all the required skills. Thus, a high
probability of a correct response to an item depends on a high mas-
tery level for all the required skills for that item: a very strong
requirement, raising concerns about its congruence with theories of
the L2 reading process.
The Fusion Model item response function is as follows:
K
P ( X ij = 1| α j ,θ j ) = π i∗ Π rik∗ (1−α jk ) × qik pci (θi )
k =1
The Fusion Model includes two ability parameters, j and j,

where j refers to a vector of skill mastery parameters and j repre-
sents a residual ability parameter of potentially important skills
unspecified in the Q matrix. In addition to the ability parameters, the
Fusion Model has three item parameters, *i , r*i k, and ci:
*i the probability that a test taker, having mastered all the Q-matrix
required skills for item i, will correctly apply all of the skills when solv-
ing the item. It is related to item difficulty given the Q matrix. It ranges
from 0 to 1.
r*i k an indicator of the diagnostic capacity of item i for skill k.
The more strongly the item requires mastery of skill k, the lower r*i k.
The closer r*i k is to 0, the more item i discriminates a master of the
skill from a non-master of the skill. If the r parameters for a skill are
closer to 0, they indicate that a test is well designed for diagnosing
mastery on that skill. Therefore, the r parameters are crucial for evalu-
ating the diagnostic capacity of the test instrument.
ci an indicator of the degree to which the item response function
relies on skills other than those assigned by the Q matrix. The lower
the ci , the more the item depends on j. The ci can provide diagnos-
tic information about the completeness of the Q matrix.
A computer program, Arpeggio, is used to estimate the ability
and item parameters using the hierarchical Bayesian Markov
Chain Monte Carlo (MCMC) parameter estimation procedure. The
program simulates Markov chains of posterior distributions for
all the model parameters given the observed data (Patz & Junker,
1999). Recently MCMC has been used in many CDA applica-
tions. The most crucial issue in the successful implementation of
MCMC is to evaluate whether the chain has converged to the
desired posterior distribution and whether the model parameter
estimates were reliably estimated. Despite various formal conver-
gence tests, no single approach yet yields reliable statistics for
checking convergence.
If necessary, diagnostically non-informative Q matrix entries can
be removed from the initial Q matrix through an iterative Q-matrix
refining process. For example, a high r* value ( 0.9) requires care-
ful examination as to whether or not a skill is essentially important
for correctly answering a given item. However, any change in the Q
matrix requires not only statistical evidence but also a theoretically
justifiable rationale.
Given MCMC convergence, one can obtain the posterior prob-
ability of mastery (ppm) for each skill and for each test taker from the
posterior distribution. For example, if a test measures four skills, each
test taker’s skill profile will include four skill mastery probability
values, each indicating the mastery level for one of the four skills.
The estimates of the item parameters and the standard errors of the
item parameters need to be evaluated by examining how reliably
each item classifies the test takers into masters or non-masters on
each skill. If convergence has occurred and the standard errors are
relatively small, the estimates of the item parameters will provide
useful information about each item’s diagnostic capability on its
required skills.
However, the standard notion of reliability cannot be directly
applied to evaluate most CDA models, such as the Fusion Model,
because it usually assumes a continuous, unidimensional latent trait.
Research on CDA modeling needs to develop appropriate ways to
examine its reliability and statistical quality.
3 Profile reporting and use of diagnostic reports

Renewed attention to the impact of testing on a curriculum, people,
and society leads the testing community to call for more descriptive
test information in order to improve instructional design and guide
students’ learning. Bailey (1999) rightly pointed out that the practice
of score reporting had not received proper attention before. Spolsky
(1990) suggested ‘profiles,’ which show multiple skills tested in
more than one way, as a more valuable reporting method. He further
argued that testers should be responsible for ensuring the inter-
pretability of test information as well as measurement accuracy.
Shohamy (1992) proposed a collaborative diagnostic feedback
model that allows for utilization of test results through a ‘detailed,
innovative, and diagnostic’ feedback system, providing useful infor-
mation for advancing teaching and learning (p. 515).
Alderson, Clapham, and Wall (1995) provided extensive practical
discussions about how to prepare instructionally useful score reports,
noting that commonly a simple letter or a number grade is used to
report a test result to a student. They highlighted the importance of
providing a profile score on a scale, stating that:
Individuals may reach the same overall score in a variety of different ways,
and thus be awarded a pass, although their profiles are different. This is one
of the main reasons why profiling of test results is considered by many to be
superior to reporting one overall result, whether it is a pass/fail result or a
score to be interpreted by test users. (p. 155)
One common reporting practice is to provide a test score based on

percentile ranks, but this way provides little diagnostic information
other than a test taker’s position relative to the other test takers.
Another way to report test scores is proficiency scaling, using
standards; that is, deciding whether students perform at ‘proficient’
levels in terms of a cut-off score determined by content experts.
Though it uses input from teachers and content experts, this method
allows arbitrariness in setting standards or pass marks for profi-
ciency levels.
Another common method, the subscoring approach, usually rests
on proportion-correct scores from items coded to skills or subtests.
Though not requiring a sophisticated statistical approach, such sub-
scoring becomes problematic when a subscore per skill is based on a
small number of items or when a subtest involves more than one test
dimension.
The CDA approach, applied to a test with items intended to assess
multiple skills, provides diagnostic information about each test taker’s
mastery level by linking item response probabilities to user-specified

skills. For example, the College Board’s ‘Score Report Plus™’,
based on Tatsuoka’s (1983) Rule Space Model, provides diagnostic
feedback for students who take the Preliminary SAT (PSAT) or the
National Merit Scholarship Qualifying Test (NMSQT). It identifies
three weak skills per major content domain, without assigning any
scores to the reported skills.
Despite the necessity, few empirical studies have examined the
utility of diagnostic reports in the contexts of different theories of
learning, different pedagogical approaches, and different types of
learners, though all these may influence the effectiveness of diag-
nostic feedback (Kunnan & Jang, in press). Diagnostic feedback
needs to be appropriately descriptive and interpretable so as to help
learners improve their learning and teachers to improve their teach-
ing (Black & Wiliam, 1998).
In this regard, learners’ self-assessment is important for facilitat-
ing the use of diagnostic feedback. Self-assessment for diagnostic
purposes has been well researched through DIALANG (Alderson &
Huhta, 2005), a multilingual computerized assessment system, which
reports users’ test performance levels in relation to their self-assessed
linguistic competencies, along with advice and information to help
further language learning. DIALANG has demonstrated that inte-
grating self-assessment into diagnosis can facilitate students’ involve-
ment in the assessment process and enhance their metacognitive
abilities to evaluate their learning outcomes, monitor their own
progress and plan remedial actions (Alderson, 2005; Black & Wiliam,
1998).
The present paper (based on a larger study, Jang, 2005) focuses on
two key issues: the characteristics of reading skill profiles estimated
by the Fusion Model and the usefulness of the resulting diagnostic
information for teachers and students. Specifically, three questions
address the characteristics of the skill profiles, while a fourth evalu-
ates the usefulness of the diagnostic information:
(a) How dependably does the Fusion Model estimate L2 reading

skill profiles?
(b) What are the characteristics of the skill profiles estimated by the
Fusion Model?
(c) To what degree can the LanguEdge reading comprehension test
items provide diagnostically useful information?
(d) How is the diagnostic feedback evaluated and used by teachers
and students?
IV Methodology
1 Overview of the original study
In the main study (Jang, 2005), I used a mixed-methods research
design, comprising both quantitative and qualitative approaches over
three developmental phases, in order to make comprehensive valid-
ity arguments through dialectical triangulation of multiple sources of
empirical evidence (Greene & Caracelli, 1997; Tashakkori &
Teddlie, 2003).
In Phase 1, I identified nine primary reading comprehension skills
by analyzing think-aloud verbal protocols and performing statistical
test and item analyses. The verbal protocol participants included
seven ESL students from a TOEFL preparation course offered by an
Intensive Language Program and four graduate ESL students at a
mid-western university in the USA. Each participant verbalized
reading processes and strategies while completing 12 to 13 reading
comprehension questions per passage sampled from the LanguEdge
RC test forms. I recruited five raters to evaluate the reading skills
that I had identified from the think-aloud data and which I provided
to them on a list. I asked them to select and rank in terms of the
importance of skills needed to correctly answer each item. I also
asked them to list any necessary skills not included on the list. The
nine skills showed a moderate level of agreement. These skills
are presented at the beginning of the data analysis section in this
paper. Full results from Phase 1 are not reported here due to space
limitations.
Using those nine skills, Phase 2 examined the characteristics of
skill profiles estimated by the Fusion Model. The Fusion Model was
fitted to the LanguEdge field test data to estimate skill mastery prob-
abilities for 2703 test takers. I evaluated the dependability of the
Fusion Model skills diagnosis process, the characteristics of the esti-
mated skill profiles, and the diagnostic capacity of the LanguEdge
RC test items. I also used 2,703 test takers’ self-assessments to
obtain background information and to examine the strength of the
relationship between the estimated skill profiles and the test takers’
self-reported reading abilities. I conducted in-depth analysis of skill
profiles from five cases to substantiate the diagnostic capacity of the
given test items.
In Phase 3, to evaluate the usefulness of the diagnostic information
in a classroom setting, the fitted model was applied to two TOEFL
preparation courses involving 28 ESL students. The students took
pre- and post-instruction diagnostic tests and, after each test, received
individualized diagnosis report cards (DiagnOsis I and II) that I had

developed at both junctures. Using interviews, classroom observations
and surveys, I evaluated the usefulness of the diagnostic feedback.
2 Participants
a LanguEdge field test takers: A total of 2,703 test takers took the
LanguEdge field tests at 32 domestic and international test sites
across 15 countries in 2002. According to the score interpretation
guide in the LanguEdge courseware, the test takers approximated
general TOEFL populations. The test takers consisted of 1,368 males
and 1,299 females. Their reasons for taking the tests varied: 1,662 test
takers planned to study in undergraduate or graduate degree pro-
grams in the USA or Canada. The remainder had various reasons,
such as: (a) entering a school other than college; (b) getting licensed
to practice a profession in the USA; and (c) demonstrating English
proficiency to their employers.
b Participants from two TOEFL preparation courses: I recruited

28 students and 3 teachers (including two female current teachers
and one former male teacher) to evaluate the diagnostic reports in
Phase 3. The students were from two TOEFL preparation summer
courses offered by the Intensive English Institute at a mid-western
university as elective courses to students whose proficiency was
Level 3 out of 5 or higher. The classes included 17 female and 11
male students, the majority from Asian countries such as Korea (11),
China (3), Japan (4), and Thailand (8). The students’ initial TOEFL
scores ranged from a low of 433 to a high of 607. The students evalu-
ated the diagnostic reports via questionnaires and interviews. I inter-
viewed the three teachers about the diagnostic reports.
3 Instruments
a The LanguEdge field tests: I used test takers’ response data from
two forms of the LanguEdge RC test for the Fusion Model skill pro-
filing in Phase 2. I used the same test instruments as pre- and post-
instruction diagnostic assessments in Phase 3. The LanguEdge
courseware was developed by Educational Testing Service to be used
as an instructional tool in the ESL classroom. Its expected benefits
included: (a) preparing students for communicative competence in
the real world; (b) assessing students’ progress in all language skills
and informing them of weaknesses needing improvement; and (c)
using subtests for practice and the assessment of progress.
The courseware included a prototype for Next Generation TOEFL

in two test forms (Forms 1 and 2). Each RC test form included three
different passages and three different types of items: traditional mul-
tiple-choice items; partial credit items requiring more than one cor-
rect choice; and constructed-response items (e.g., integrated writing
and speaking tasks). I used only the multiple-choice and partial
credit items in the study. The RC test is structured as three 25-minute
reading blocks, each a timed section of the test containing one stimu-
lus passage and a set of comprehension questions. The six passages
included in the two tests covered various topics from the humanities
to the natural sciences.
Raw score totals for the RC tests were 41 points for 37 items in
Form 1 and 42 points for 39 items in Form 2. The multiple-choice
items were worth 1 point each, while total points for the partial-credit
items varied. The two test forms were equated using a classical
equipercentile equating method. The equated test scores ranged from
a minimum of 1 to a maximum of 25. Table 1 presents descriptive
statistics for the two test forms and estimates of internal consistency.
b LanguEdge test takers’ self-assessment questionnaire: The

LanguEdge test takers completed a self-assessment questionnaire.
The self-assessment data (N 2703) were used to examine the
strength of association between the model-estimated skill profiles
and the test takers’ self-assessed skill ratings. The questionnaire had
64 items (Cronbach’s .95) asking the test takers to self-rate their
language skills in reading, writing, and listening on two different
Likert-type scales. I selected five items associated with reading skills
rated on a degree scale ranging from 1 (not at all) to 5 (extremely
well): ‘I can understand vocabulary and grammar’; ‘I can understand
major ideas’; ‘I can understand how ideas relate to each other in
text’; ‘I can understand the relative importance of ideas’; and ‘I can
organize or outline important ideas.’
c Student questionnaires on diagnostic reports: I developed two

questionnaires to gather feedback from the 28 students in Phase 3.
I administered the first questionnaire to the students right after they
Table 1 Mean scores and reliability estimates of the two test forms
Form M SD Cronbach’s SEM
1 (n 1,372) 14.41 5.83 .89 1.96

2 (n 1,331) 14.39 5.86 .87 2.14
received their diagnostic report cards (DiagnOsis I) from the pre-

instruction test. The questionnaire probed the students’ opinions
about the elements of the diagnostic report and the usefulness of the
diagnostic feedback. I administered the second questionnaire at the
end of the term. It asked whether the students had paid attention to
the diagnostic feedback. I compared the kinds of skills that the stu-
dents reported they had paid attention to since the start of the course
with their skill profiles from DiagnOsis I.
4 Data collection and analysis procedure

The nine primary reading comprehension skills identified in Phase 1
are shown in Table 2. Raters showed a moderate level of agreement
on the nine skills.
I calculated the difficulty levels of the nine skills by averaging the
item difficulty parameter estimates of the items for each skill from
BILOG (see Figure 1). Items assessing the summarizing skill (SUM)
were relatively easier than items assessing textual comprehension
skills although I expected them to be more difficult (see Jang, 2005,
for a more detailed description of the Q matrix construction and
evaluation process).
In Phase 2, I used the Fusion Model program called Arpeggio
software to estimate the probabilities of mastery for the nine skills
for 2703 LanguEdge test takers. In Phase 3, the 28 students in two
TOEFL preparation courses took pre- and post-instruction diagnos-
tic tests using of the LanguEdge RC test forms 1 and 2 respectively.
Figure 1 Skill difficulty

Table 2 Descriptions of the primary processing skills
Skill Description Item
CDV Deducing the meaning of a word or a phrase by 1, 4, 11, 14, 32, 33

searching and analyzing text and by using
contextual clues appearing in the text.
CIV Determine word meaning out of context with 2, 7, 9, 10, 19, 21, 27, 29
recourse to background knowledge
SSL Comprehend relations betweens parts of text 1, 2, 3, 4, 12, 22, 24, 26,
through lexical and grammatical cohesion 33, 36
devices within and across successive
sentences without logical problems
TEI Read expeditiously across sentences within 8, 12, 13, 14, 17, 20, 22,
a paragraph for literal meaning of 24, 25, 30, 36, 37
portions of text.
TIM Read selectively a paragraph or across 4, 5, 6, 18, 26, 34, 35
paragraphs to recognize salient ideas
paraphrased based on implicit
information in text.
INF Skim through paragraphs and make 2, 7, 11, 15, 16, 17, 23,
propositional inferences about arguments 28, 31, 32
or a writer’s purpose with recourse to
implicitly stated information or prior
knowledge
NEG Read carefully or expeditiously to locate 5, 6, 7, 22, 28
relevant information in text and to determine
which information is true or
not true.
SUM Analyze and evaluate relative importance 5, 13, 17, 20, 25
of information in the text by
distinguishing major ideas from
supporting details.
MCF Recognize major contrasts and arguments 23, 30, 35, 37
in the text whose rhetorical structure
contains the relationships such as
compare/contrast, cause/effect or
alternative arguments and map them into
mental framework
The students and teachers received DiagnOsis report cards at the

beginning and end of the instructional term. I used the following ana-
lytic approaches to answer three questions related to the Fusion
Model’s skill profiling and one question concerning the usefulness of
the diagnostic information.
a The dependability of the Fusion Model’s estimation process of the

test takers’ skill profiles: Because it is often difficult to determine
when to conclude that MCMC convergence has occurred and that
a Markov chain has reached its steady state, the evaluation of chain
convergence was crucial for determining the stability of the parame-
ter estimation. Three different plots including a density plot, a time-
series plot, and an autocorrelation plot were inspected for
convergence of three different Markov chains of size 5000, 15,000,
and 30,000, including initial burn-ins of 1000, 5000, and 13,000,
respectively.
An initial Q matrix included nine average skill mastery probabil-
ity estimates (pk), and 67 rik*, 37 πi*, and 37 ci item parameters to
be estimated by the Arpeggio program. I carefully inspected the item
parameter estimates. Values of πi* below 0.6 indicate that items are
difficult even for masters of the associated skills. High rik* parameter
estimates ( 0.9) indicate a lack of diagnostic capacity for discrim-
inating the masters from the non-masters for skill k for item i. A large
number of high-parameter values from the initial run suggested
revisiting the Q matrix in order to increase the reliable estimation of
skill competency. Instead of dropping all the entries with high r* values
from the Q matrix, I examined various aspects such as item-by-skill
ratios, the importance of skills for a particular item, and interactions
of the r parameters with c parameters. I made minimal changes to
the original Q matrix and reran the Arpeggio program to estimate the
item parameters associated with this refined Q matrix. I evaluated the
degree of the model fit to data by comparing the score distribution pre-
dicted by the Fusion Model with the observed score distribution and
by correlating the model-predicted estimates with the observed ones.
b Characteristics of the skill profiles estimated by the Fusion Model:

I created the skill profiles of 2703 LanguEdge test takers based on
the skill mastery probability estimates obtained from the Fusion
Modeling. Individual skill profiles were determined by applying two
cut-off points to the posterior probability of mastery (ppm) for each
skill, resulting in three categories of mastery: a master (ppm 0.6),
a non-master (ppm 0.4), and undetermined (0.4 ppm 0.6).
I then examined the reliability of the classification of mastery status.
Traditional reliability standards are not applicable to diagnostic
testing, because they assume a continuous, unidimensional latent
trait while the latter uses latent class models (DiBello et al., 2007).
Therefore, I estimated classification consistency rates as follows.
Parallel sets of simulated data from the calibrated model were
generated to calculate correct classification rates (the proportion of

times a test taker is correctly classified on a skill) and test-retest
consistency rates (the proportion of times a test taker is classified the
same on the two parallel tests).
I then correlated the resulting skill profiles with the test takers’
self-assessment of their reading skills. I calculated Pearson product-
moment correlations between the ppms for the nine skills and the
students’ self-assessed ratings on the five self-assessment question-
naire items in order to triangulate the skill profile results.
c The diagnostic capacity of the LanguEdge RC test items: To

determine the extent to which the LanguEdge RC test items were
diagnostically informative, I examined whether the masters of the
skills required for an item would perform better than the non-masters.
I used the IMstats (Item Mastery Statistics) program (Hartz &
Roussos, 2005) to compare proportion-correct scores between the
masters and the non-masters for each item.
I further examined how individual skill profiles were affected by
differential diagnostic capacity of the LanguEdge test items. I selected
five test takers, from those who took Form 1 of the LanguEdge test,
for case analyses. The five test takers were similar in terms of their
total test scores (ranging from 24 to 26) but differed in their skill pro-
files. I matched their background information from the questionnaire
data with their skill mastery profiles. The five cases were four female
students and one male, all from different countries. Three test takers
reported that they planned to study at undergraduate or graduate
programs in the USA or Canada while the other two test takers
took the test to demonstrate English proficiency to their employers.
I examined all their skill profiles in detail, in comparison with their
self-assessments.
d Usefulness of diagnostic reports: To comprehensively argue the

validity of the CDA practice necessitated engaging teachers and stu-
dents in evaluating the usefulness of the diagnostic feedback. Thus,
I created diagnostic report cards for the 28 students based on their
pre- and post-instruction diagnostic test results. Summary reports
were created for the teachers. They received the reports at the begin-
ning and end of the two-month instructional term. The 28 students
and two current and one former teachers provided their opinions
about and the usefulness of the diagnostic feedback. The partici-
pants’ feedback on the diagnostic reports was gathered using ques-
tionnaires, interviews and classroom observations.
III Results
1 Dependability of the Fusion Model skill profiling process
a Evaluation of MCMC convergence: I examined three different
Markov chain lengths of 5000, 15,000, and 30,000 to determine chain
convergence by inspecting a density plot, a time-series plot, and an
autocorrelation plot for each. The plots from the 30,000-chain length
indicated that convergence had occurred, because there was very lit-
tle change in the posterior distribution after the first 1000 steps.
Figure 2 shows the plots from chain length 30,000 for r*23_3 for Item
23, which refers to the r* parameter estimate of Skill 3 (SSL) for Item
23. I present the plots from this item because it showed relatively high
posterior standard deviations and jumpy chain convergence.
This result suggested that a relatively long chain (e.g., 30,000) is
necessary for reliable estimation of a large number of parameters in
the case of complex diagnosis modeling. Subsequently, I used the
chain length of 30,000 with the 13,000 burn-in to estimate skill pro-
files throughout the study.
b Evaluation of the Fusion Model parameter estimates: An initial
run of the Arpeggio program resulted in more than 16 rik* parameters
that were relatively too high ( 0.9) with relatively large standard
errors. However, dropping the 16 r*’s from the Q matrix would dras-
tically alter the item-by-skill specifications represented in the Q
matrix and thereby might make the Q matrix theoretically less justi-
fiable. It appeared that the c parameters, which account for skills
Figure 2 Density, time-series, and autocorrelation plots for r*23_3 from the chain
length 30,000
unspecified by the Q matrix, tended to account for much of the vari-

ance in the item responses. In order to have the Q matrix entries such
as rik*’s have more influence on estimating skill profiles, I decided to
keep the influence of the ci parameters minimal by keeping them con-
stant rather than by removing high rik*’s from the Q matrix.5
I anticipated that more variance would be accounted for by the Q-
matrix specified skills. This decision resulted in decreasing 38 rik*
values. That is, rik*’s came to have more influential power on the item
response function.
The final Q matrix included 60 rik* and 37 πi* parameters after
removing six rik*’ s from the initial Q matrix. The values of the item
parameters for Form 1 estimated from the final model are provided
in Table 3 along with the item statistics estimated through the appli-
cation of the IRT 3-PL logistic model.
c Evaluation of the model fit to data: I evaluated the degree of

model fit to data by comparing the model-estimated score distribu-
tion with the observed one as shown in Figure 3.
While the predicted score distribution approximated the observed
score distribution, the misfits at the lowest and highest parts of the
distribution were noticeable. The plot clearly showed that the model-
estimated score distribution underestimated the lowest-scoring test
takers and overestimated the highest-scoring test takers.
I further evaluated the degree of the model fit by comparing
observed and predicted statistics. As shown in Table 4, the Mean
Absolute Difference (MAD) between observed and predicted item
proportion-correct scores was 0.002, and the MAD for the correla-
tions was 0.049. The Root Mean Square Error (RMSE) for the item
proportion-correct scores was 0.002, and the RMSE for the correla-
tions was 0.058. Relatively small discrepancies of item difficulty and
correlations between the observed and the model-predicted values
supported the claim of a good fit.
Another intuitive approach to evaluating the model fit to data was
to examine whether there is a monotonic relationship between the
ppms and the observed total test scores. Figure 4 shows the scatter
plot of the number of mastered skills against the total observed
5 Roussos, DiBello, Henson, Jang, & Templin (in press) explain that this problem is associated with
a test having a single dominant dimension. In such a case, a continuous residual parameter ‘soaks
up’ most of the variance in the item responses. The researchers note that a good solution to this
problem is to drop the ci parameters by setting them equal to 10; this reduced version of the Fusion
Model has been often used to analyze real data.
Table 3 Fusion Model item parameter estimates for the final Q matrix (Form 1)
Item i* CDVa CIV SSL TEI TIM INF NEG SUM MCF ab b
1 0.92 0.59 2.69

2 0.83 0.64 0.76 0.64 0.23
3 0.91 0.80 0.76 0.59 1.08
4 0.63 0.60 0.51 0.48 1.09 0.82
5 0.99 0.82 0.83 0.61 1.46 0.44
6 0.84 0.36 1.18 0.27
7 0.67 0.73 0.79 0.55 0.99 0.88
8 0.61 0.39 1.19 0.84
9 0.87 0.34 0.98 0.22
10 0.86 0.48 0.68 0.49
11 0.56 0.76 0.70 0.60 1.32
12 0.78 0.54 0.69 0.81 0.25
13 0.73 0.46 0.49 0.24
14 0.93 0.60 0.78 0.88 0.57
15 0.88 0.55 0.79 0.29
16 0.87 0.50 1.15 0.14
17 0.96 0.56 0.6 1.44 0.32
18 0.95 0.53 1.02 0.62
19 0.93 0.63 0.72 1.18
20 0.96 0.70 0.71 1.01 0.67
21 0.97 0.75 0.71 1.79
22 0.74 0.8 0.65 0.55 1.12 0.69
23 0.79 0.46 0.71 0.96 0.39
24 0.87 0.84 0.59 0.74 0.33
25 0.73 0.72 0.58 0.52 0.11
26 0.92 0.4 0.59 1.18 0.11
(continued)
50
Table 3 (continued)
Item i* CDVa CIV SSL TEI TIM INF NEG SUM MCF ab b
27 0.94 0.5 1.03 0.56

28 0.66 0.65 0.73 0.85 1.12
29 0.77 0.43 0.76 0.04
30 0.96 0.53 1.45 0.53
31 0.94 0.52 0.92 0.72
32 0.92 0.65 0.83 0.68 0.89
33 0.93 0.52 0.84 0.89 0.74
34 0.91 0.48 1.01 0.09
35 0.46 0.7 0.54 1.09 1.49
36 0.91 0.61 0.66 1.07 0.18
37 0.77 0.13 0.97 0.50
aNumeric values reported under the skills refer to rik* parameter estimates.
bThe IRT-3PL model was applied to the data to estimate discrimination (a) and difficulty (b) parameters. Black cells indicate the entry
was removed from the initial Q matrix.
Cognitive diagnostic assessment of L2 reading comprehension
1.2
1
Cumulative probability
0.8
Observed
0.6
Model estimated
0.4
0.2
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40
Score
Figure 3 Comparison of the observed and predicted score distributions

Note: The predicted score distribution is based on the application of the Fusion
Model after controlling for ci parameters. When ci parameters were retained, mis-
fits occurred throughout the distribution.
Table 4 Comparison of the observed and predicted statistics
Item proportion–correct scores Correlation coefficients
RMSE MAD RMSE MAD

0.002 0.002 0.058 0.049
Note: RMSE: root mean squared error, MAD: mean absolute difference.
scores for 1,372 test takers in Form 1. Each petal on the scatter plot
represents a test taker. The bigger and darker the sunflower is, the
more test takers there are.
The correlation between the model-estimated probabilities of mas-
tery for the nine skills and the total test scores was 0.94, supporting the
positively correlated relationship. Nevertheless, the plot also clearly
pointed to wider distributions for the low- and high-scoring test takers,
which was consistent with the misfits presented in Figure 3.
2 Characteristics of the skill profiles estimated by the Fusion Model

a Evaluation of classification of test takers’ skill mastery: An indi-
vidual test taker’s skill profile was determined by classifying the
posterior probability of mastery for each skill into the three categories
of mastery status: a master (ppm 0.6), a non-master (ppm 0.4),
and undetermined (0.4 ≤ ppm ≤ 0.6). Table 5 shows the classification
results.
50
40
30
Total score
20
10
N = 1372
0
–1 0 1 2 3 4 5 6 7 8 9 10
No. of mastered skills
Figure 4 The scatter plot of the numbers of mastered skills against the total scores
Table 5 Proportions of masters for nine skills for form 1 (N 1372)
Skill No. of masters No. of non-masters Undetermined
CDV 766 427 179

CIV 742 514 116
SSL 678 535 159
TEI 688 588 96
TIM 618 634 120
INF 627 603 142
NEG 528 644 200
SUM 790 437 145
MCF 589 703 80
The skills SUM, CDV and CIV had relatively large numbers of
masters, while the NEG and MCF skills had fewer. In particular, the
SUM skill showed the largest number of masters. The proportion of
test takers determined as masters for each skill was relatively con-
sistent with the observed skill difficulty levels (see Figure 1).
I examined the consistency rates for classifying test takers into the
three categories of mastery status through testing the reliability of
classifications of mastery status for each skill. As shown in Table 6,
the correct classification rates for both test forms were high,
supporting the reliability of the test taker classification. Form 1
Table 6 Classification consistency rates (CCR) for forms 1 and 2
Form 1 Form 2
Skill Overall CCR CCR for M CCR for NM Overall CCR CCR for M CCR for NM
CDV .87 .88 .85 .92 .94 .88

CIV .93 .94 .92 .84 .86 .80
SSL .89 .89 .89 .89 .90 .88
TEI .94 .94 .94 .88 .88 .87
TIM .91 .90 .91 .95 .96 .94
INF .91 .91 .91 .91 .91 .90
NEG .84 .83 .84 .81 .83 .78
SUM .90 .90 .88 .89 .90 .87
MCF .95 .94 .95 .85 .87 .83
M .90 .90 .90 .88 .90 .86
Note: M indicates masters, and NM indicates non-masters.
exhibited a little higher overall classification rate than Form 2, an

average rate of 90% in comparison with 88% for Form 2. The NEG
skill showed the lowest correct classification rate among all the skills
for both forms, probably due to an insufficient number of items
assigned to this skill.
b Association between the model-estimated ppm’s and test takers’

self-assessment: All five questionnaire items were statistically
significantly and positively correlated with the model-estimated skill
mastery probabilities at 0.01, as shown in Table 7.
The test takers’ self-assessed ratings on their reading skills were
statistically significant and positively correlated with the model-
estimated probabilities of mastery for all of the nine skills. The corre-
lation coefficients ranged from moderate (0.43) to weak (0.23). Note
that the inter-skill correlations were much higher ranging from 0.89
to 0.70.
3 Diagnostic capacity of the LanguEdge RC test items

a Discrimination of masters from non-masters: I examined the
diagnostic capacity of the LanguEdge RC tests by comparing the
proportion-correct scores of the masters and the non-masters for
each item. The results for Forms 1 and 2 are presented in Figure 5.
The mean score differences between the masters and the non-masters
are quite large for the most of the items, supporting strong diagnostic
capabilities of the test items. Average mean score differences between
the masters and the non-masters were 0.49 for Form 1 and 0.41 for Form
Table 7 Correlations between ppms and student self-assessed ratings on reading

skills
CDV CIV SSL TEI TIM INF NEG SUM MCF SSRs
CDV – .82 .86 .81 .82 .83 .85 .84 .70 .39
CIV – .80 .78 .78 .79 .83 .79 .65 .43
SSL – .84 .86 .85 .88 .84 .74 .34
TEI – .82 .82 .87 .83 .72 .32
TIM – .85 .89 .81 .75 .35
INF – .88 .82 .74 .34
NEG – .85 .79 .36
SUM – .72 .35
MCF – .23
SSR –
Note: All correlations are significant at the 0.01 level.

The students’ self-ratings (SSR) are based on the average responses to the follow-
ing five questionnaire items:
I can understand vocabulary and grammar;
I can understand major ideas;
I can understand how ideas relate to each other in text;
I can understand the relative importance of ideas;
I can organize or outline important ideas.
2. Nevertheless, the figures also clearly pointed to some problematic

items that did not distinguish the masters from the non-masters. There
were eight items in Form 1 and 12 items in Form 2 whose mean score
difference between masters and non-masters was less than 0.4. An in-depth
analysis of these items’ statistical characteristics suggested that they tended
to be either extremely difficult or extremely easy.
b Case analysis: To examine whether and how individual skill pro-

files are affected by the differential diagnostic capacity of some of
the LanguEdge test items, I selected five test takers, with test scores
ranging from 24 to 26, for case analyses. Table 8 summarizes their
linguistic backgrounds, gender, reasons for taking the test, their self-
ratings of their reading skills and their skill profiles, as estimated by
the Fusion Model.
Although all five had similar test scores, their skill mastery pro-
files were strikingly different. For example, Case 1’s skill profile
showed that she did not master any skill. Her self-assessed ratings
were very low as well. Case 2 was similar to Case 1, except that he
mastered one skill. Cases 3, 4, and 5 rated their reading skills rela-
tively higher than the first two cases. Case 3 mastered only two skills
despite his high self-assessed ratings. Cases 4 and 5’s skill profiles
showed mastery of six to eight skills.
Form 1
1.2
1
Proportion correct
0.8
0.6
Masters
0.4 Non-masters
0.2
0
0 5 10 15 20 25 30 35 40
Items
Form 2
1.2
1
Proportion correct
0.8
0.6
Masters
0.4 Non-masters
0.2
0
0 5 10 15 20 25 30 35 40
Items
Figure 5 Comparison of performance differences between item masters and

non-masters
I examined Cases 1 and 5 because they had the most different skill
profiles. Case 5’s skill profile showed that she mastered all skills except
CIV. When I inspected her actual responses to the CIV-associated items,
she failed to respond correctly to all of the seven items measuring CIV
skills; this partially supports her estimated skill mastery profile. Case 1,
despite her ‘flat’ (zero-mastered) skill profile, achieved proportion-correct
subscores ranging from 0.50 to 0.60 on the items assessing her non-
mastered skills. For example, even though she correctly answered four
out of eight items assessing CIV, her mastery probability of the CIV
skill was only 0.1, relatively too low compared to her observed per-
formance on those items. This may be related to the presence of items
Table 8 Summary of the five cases’ skill profiles
Cases Reason Model-estimated Self-assessment of

skill profiles reading
1 (Female, To study abroad None ‘Reading is not as

Venezuela) (undergraduate) good as other skills.’
‘I have
some difficulty
taking courses
taught in English
due to problems
with reading.’
2 (Male, To study abroad 1 skill (SUM) ‘Reading is not as
Indonesia) (graduate) good as other skills.’
‘I am very good at
understanding graphs
and charts in academic
text.’
3 (Female, To study abroad 2 skills ‘I have no difficulty with
Columbia) (undergraduate) (CDV & SUM) reading in English.’
‘Reading is my best
skill.’
4 (Female, To demonstrate 6 skills except ‘Reading is my best skill.’
Vietnam) English proficiency NEG & INF ‘My weak areas are
to company understanding charts
and graphs in
academic text and how
to relate different ideas
to each other.’
5 (Female, To demonstrate 8 skills except CIV ‘Reading is my best skill.’
Lebanon) English proficiency ‘I am good at
to company vocabulary and
understanding relative
importance of ideas.’
that lack diagnostic capacity. Such discrepancies between test scores

and skill mastery profiles could cause confusion for test takers when
they receive total test scores as part of their skill profile reports.
4 Teachers’ and students’ evaluation of diagnostic feedback
a Skills diagnosis report cards, DiagnOsis: I estimated 28 students’
skill profiles based on their responses to Form 1 of the LanguEdge RC
test. I prepared and distributed two report cards, DiagnOsis I and II
(see Appendix A for the sample report card), to the students and sum-
mary reports to the two teachers. Table 9 presents the posterior prob-
abilities of mastery (ppm’s) along with their total test scores for the 15
students in Course A (Course B is omitted due to space limitations).
Table 9 Students’ skill mastery probabilities from the pre-instruction diagnostic test
Student Score Probabilities of mastery for nine skills
CDV CIV SSL TEI TIM INF NEG SUM MCF
Karen 21 0.25 0.00 0.15 0.26 0.17 0.26 0.13 0.90a 0.78
Heather 32 0.99 0.95 0.99 1.00 0.99 1.00 0.89 1.00 0.77
Hail 28 0.95 0.92 0.97 0.88 0.24 0.13 0.51b 0.76 0.05
Yoshi 21 0.16 0.02 0.09 0.09 0.21 0.01 0.10 0.42 0.51
Siree 31 0.96 0.99 0.99 0.99 0.96 0.91 0.73 1.00 1.00
Dongin 32 0.98 0.99 0.94 0.98 1.00 0.99 0.95 0.97 1.00
Chris 20 0.85 0.95 0.17 0.00 0.01 0.36 0.18 0.37 0.00
Gao 30 0.98 0.78 0.90 0.89 0.87 0.86 0.77 0.99 0.99
Kyung 37 1.00 0.99 1.00 1.00 1.00 1.00 0.93 1.00 1.00
Hyung 18 0.06 0.09 0.01 0.00 0.24 0.08 0.12 0.05 0.00
Take 19 0.24 0.00 0.23 0.50 0.01 0.01 0.12 0.78 0.81
Ohmi 16 0.04 0.05 0.02 0.02 0.15 0.00 0.03 0.18 0.00
Gkyung 10 0.01 0.01 0.01 0.00 0.13 0.01 0.03 0.00 0.00
Lee 31 0.96 0.84 0.96 1.00 0.83 0.98 0.96 1.00 0.99
Shu 22 0.10 0.00 0.54 0.66 0.01 0.14 0.22 0.27 0.87
a Bold entries indicate mastered skills.
b Italicized entries indicate that skill mastery is not ‘determined’.
The mean scores for all the students were 25 and 21 in Courses A
and B respectively. Since the students were placed into the courses
on the basis of their placement test results, I expected that the stu-
dents in Course A would perform better on the test than those in
Course B. Further, the Course A students’ skills profiles showed that
they had mastered 68 skills (ppm 0.6) whereas the Course B stu-
dents’ profiles had mastered only 27 skills.
b Students’ feedback on the diagnostic feedback before the instruc-

tion: The open-ended student questionnaire indicated that all 28
students found the bar graph of skill mastery probabilities on the report
card most useful. Approximately 40% of the students responded that
the reported skill profiles accurately reflected their reading skills. The
questionnaire also assessed the students’ affective reactions to the
report card. Approximately 28% of the students expressed disappoint-
ment about their skill profiles; they were ‘embarrassed’, ‘disappointed’,
or felt ‘terrible’ about their profiles.
Yet, these students expressed their desire to study harder to
improve weak skills and requested more guidance on how to improve
them. A follow-up interview with a high-performing male student
drew my attention. He asked me, ‘My report says that I have mastered
all the skills. Does it mean that I don’t have to study these skills any
more? But why didn’t I get a perfect score then? What does it mean
to be a master of a certain skill?’ This raised concerns about the impli-
cations of skills diagnosis for future action.
When asked whether the test items measured skills associated
with them (see the second page of DiagnOsis, Appendix A), 18% of
the students agreed that the test items assessed the associated skills.
The rest of the students expressed various views, for example:
Student 1: I think the more questions I have, the more I can be convinced to
know about my reading proficiency. But we don’t have enough questions.
Student 2: I think these questions assess the skills well, but I also think those
skills can’t be divided accurately because most questions need combined
skills anyway.
Student 3: Actually I don’t know how much these questions assess those skills
correctly. If I could understand the whole passage well, it won’t matter.
c Students’ skill profiles after the instruction: I prepared DiagnOsis

II, the post-instruction report card, on the basis of the students’ per-
formance on Form 2 of the LanguEdge RC test. In addition to pro-
viding the same information as DiagnOsis I, DiagnOsis II focused on
the progress that the students had made. Table 10 summarizes the
changes in the posterior probability of mastery for each skill and for
each student in Course A.
Out of the 117 entries in Table 10, 53 (45%) showed positive
changes in the ppm’s, 32 (27%) no changes and 30 (26%) negative
changes. In addition, 12 (10%) entries showed a change in the mas-
tery status, from non-mastery to mastery. The students achieved the
largest improvement on the vocabulary skills (CDV and CIV).
The changes in the ppm’s from before to after the instruction fell
into three distinct skill mastery trajectories, as shown in Figure 6. In
the first pattern, high-performing students’ skill profiles exhibited
stability over time. The second pattern highlighted a significant
improvement of skill mastery over time. The third pattern, unstable
skill mastery, was typical of low-proficient students. Their trajectory
appeared unstable partly because their responses to items may be
inconsistent and less predictable.
d Students’ feedback on the post-instruction diagnostic feedback:

The post-instruction questionnaire asked the students how well the
skill profiles reflected their reading ability. Approximately 39% of
the students responded ‘quite a lot’ or ‘absolutely,’ while about 50%
responded ‘a little bit.’ Approximately 46% of the students reported
that the diagnostic feedback was ‘a little bit’ useful for improving
Table 10 Differences in the ppms between the tests
Student Pretest Posttest Difference in ppms between posttest and pretest
Karen 21 23 0.7 0.2 0.5 0 0 0 0 0.6 0.3

Hail 28 24 0.1 0.1 0.1 0 0 0.2 0.1 0.1 0.6
Yoshi 21 22 0.7 0.4 0.7 0 0.5 0 0.4 0.3 0.3
Siree 31 25 0.6 0.3 0.5 0.3 0.5 0.1 0.3 0.2 0.6
Dongin 32 29 0.1 0 0.2 0 0 0 0.1 0 0.1
Chris 20 23 0.5 0.4 0.1 0.1 0 0.5 0.1 0.2 0.3
Kyung 37 36 0 0.1 0 0 0 0 0 0 0
Hyung 18 27 0.9 0.7 0.9 0.2 0.2 0.9 0.6 0.9 0.5
Take 19 22 0.6 0.3 0.2 0.1 0 0.8 0.4 0.4 0.6
Ohmi 16 26 0.9 0.7 0.9 0.9 0.4 0.9 0.8 0.8 0.6
Gkyung 10 21 0 0 0.1 0.5 0.1 0.4 0.3 0.8 0.1
Lee 31 31 0 0.1 0 0 0.2 0 0.1 0 0
Shu 22 19 0.1 0.1 0.4 0.1 0 0.3 0.1 0.5 0.8
Mean 23.5 25.2 2.7 1.6 1.6 1.4 0.3 3.7 2.3 2.4 0.6
Note: Positive values indicate increase in the posterior probabilities of mastery for the skills after the post-test.
the third mastered all of them, according to their report cards.

reading skills, and about 32% considered ‘quite a lot’ or ‘always’

useful. Three students reported that they did not consider the feed-
back at all. Two of these three students did not master any skills, but
Stable skill profile

Lee
1.2
0.8
Probability
Pre-test
0.6
Post-test
0.4
0.2
0
Skills
Improved skill profile

Ohmi
1.2
1
Probability
0.8
0.6 Pre-test
Post-test
0.4
0.2
0
Skills
Unstable skill profile

Karen
1
0.8
Probability
0.6
Pre-test
0.4 Post-test
0.2
0
Skills
Figure 6 Three skill mastery patterns over time
e Teachers’ feedback on diagnostic feedback: Interviews with the

two current teachers and one former teacher indicated that the teach-
ers found the diagnostic feedback useful for raising students’ aware-
ness of their strengths and weaknesses in reading skills and for
guiding their teaching.
Teacher 1 (male, former): Breaking down reading this way is a good diag-
nostic procedure for the student as well as the teacher. The key is to help the
student gain meta-cognitive awareness of the various categories of reading,
so that he or she can understand the feedback and try to improve.
Teacher 2 (female, current): I think the scoring report helped me as a teacher
to understand what the weaknesses were for my students. After examining
the scoring report, I gave extra assignments to my students to help them do
more exercises on their weak skills. Students’ test scores at the end of the
semester showed some improvement.
However, the teachers also raised some important issues concerning
the use of diagnostic feedback. The former teacher, Teacher 1, pointed
out that the use of diagnosis may depend on the context of learning:
Teacher 1: We need to consider differences that lie between EAP (English for
Academic Purpose) courses and test preparation courses that we are talking
about now. In the test preparation courses, there might be more ‘teaching to
the test’ than in an EAP class. Such difference could be an important variable
for evaluating the use of diagnostic feedback.
In addition, Teacher 3 also expressed her own concern about a mis-
match between the skills diagnostic approach and her own peda-
gogical beliefs:
Teacher 3 (female, current): Knowing my students’ strengths and weaknesses
was very useful even though most of them needed to improve almost all skills
after all. But I don’t teach reading separately. I try to encourage students to
study listening, reading, and structure simultaneously. So, I don’t teach the
reading skills included in this scoring report. I can’t teach all these skills in
my class. I don’t have enough time and it’s just not how I teach my students.
In sum, the interviews with the teachers indicated that the usefulness
of diagnostic feedback depends on the degree to which it is compat-
ible with the teachers’ pedagogical approaches and the extent to
which the curriculum deals with feedback’s content.
VI Discussion
Question 1: How dependably does the Fusion Model estimate L2
reading skill profiles?
Due to the complex CDA modeling, which involves estimating a
large number of item parameters, evidence of the MCMC conver-
gence is essential for evaluating the quality of diagnosis. Overall, the
study results provided multiple lines of positive evidence substanti-
ating the claims about the statistical quality of the Fusion Model.
Visual representations of the posterior distribution estimated by
Arpeggio clearly indicated that convergence had occurred, and that
the parameters from the posterior distribution were reasonably reli-

able. Nevertheless, the convergence was very slow, thus requiring a
very long chain. In general, if the convergence does not occur, one
needs to revise the Q matrix so that the MCMC can converge to a
single difficulty level for each skill.
The initial outcomes of the Fusion Modeling suggested eliminat-
ing approximately 26% of the rik parameter estimates, as those with
values higher than 0.9. In such a case, the entries of the Q matrix
would need to be adjusted. However, the elimination of 26% of the
Q matrix entries solely on the basis of the statistical results was not
desirable, because it might result in serious changes in the item-by-
skill relationships reflected in the Q matrix. As an alternative, I kept
ci parameter estimates constant, because they appeared to account
for a large amount of variance in the item response. This decision led
to an elimination of only 10% of the rik parameters instead of 26%.
If ci parameters in the Fusion Model have too much influence on the
item response function, one needs to take care in interpreting the
Fusion Model’s item parameter estimates before modifying a Q
matrix on the basis of the statistical estimates.
To examine the degree of model fit, I compared the observed sta-
tistics with the model-predicted statistics simulated from the MCMC
posterior distribution. The MAD and RMSE values between the pre-
dicted and observed item difficulty and the correlation were rela-
tively small, supporting the claim of a good fit. I compared the
observed score distribution with the model-predicted score distribu-
tion. The plot revealed a misfit at the low and high ends of the dis-
tribution because of underestimation of the scores of the low-scoring
test takers and overestimation of those of the high-scorers. A similar
picture emerged from the plot that correlated the observed total test
scores with ppm’s. There are wider score distributions for both the
low and high scoring test takers.
Note that the observed-score distribution of a typical norm-referenced
test approximates a normal Gaussian distribution, whereas cognitive
diagnosis models such as the Fusion Model have a bimodal distribu-
tion (e.g., mastery, non-mastery). Overall, this misfit may not have a
significant effect on the classification of mastery and non-mastery
for most of the test takers. However, considering that the goal of
CDA is to provide fine-grained diagnostic feedback for individual
learners, even a slight misfit should not be overlooked because the
individual skill profiles for the low- and high-scoring test takers may
not be as accurate as those for the others and may confuse those
learners with inaccurate diagnostic feedback.
Question 2: What are the characteristics of the skill profiles

estimated by the Fusion Model?
The proportions of test takers determined as masters (see Table 5)
were congruent with the difficulty levels of the nine skills as shown in
Figure 1. The SUM skill turned out to have the largest number of mas-
ters, which is consistent with the statistical analysis of the prose sum-
mary items. Cohen and Upton (2007) likewise reported that the
‘Reading to Learn’ items (i.e., prose summary items) were not neces-
sarily more difficult than the basic comprehension questions (i.e., sen-
tence simplification and factual information items). They stated that:
One of the ‘Reading to Learn’ formats intends to measure the extent to which
L2 readers can complete a ‘prose summary’ through questions which are
referred to as ‘multiple-selection multiple-choice responses’. It entails the
dragging and dropping of the best descriptive statements about a text. One
can argue whether or not this is truly a summarization task as no writing is
called for, and even the set of possible ‘main points’ is provided for the
respondents so that they only need to select those which they are to drag into
a box (whether astutely or by guessing) – they do not need to find main
statements in the text nor generate them. (pp. 239–240)
On the one hand, the converging evidence from Cohen and

Upton’s study results and the Fusion Model’s estimates of mastery
for the SUM skill partially supports the dependability of the Model’s
estimates. On the other hand, that the prose summary items may not
necessarily measure the summarizing skill may point to potentially
misleading diagnostic information for test takers classified as either
a master or a non-master of that particular skill. This issue can be
resolved by a carefully designed diagnostic assessment that system-
atically assesses the necessary cognitive skills by incorporating vari-
ous formats beyond the multiple-choice one.
Another potential problem that emerged from the Fusion
Modeling is that the cognitive skills were greatly constrained by the
task types. When the test is not developed with diagnostic purposes
in mind, the test often fails to include sufficient numbers of items for
assessing all necessary skills. For example, while approximately 21%
of the test items are vocabulary items for Form 1 of the LanguEdge
RC test, the test does not have sufficient numbers of items that elicit
skills such as inferring authors’ intention, comprehending negations,
or summarizing the main ideas. The NEG skill showed the lowest
classification accuracy rate among the nine skills, 84% for Form 1
and 81% for Form 2 (see Table 6). Therefore, adequate specifications
of cognitive skills become a quite challenging task when the CDA is
applied to existing non-diagnostic tests.
The model-estimated probabilities of mastery for the nine skills

were moderately and positively correlated with the test takers’ self-
assessments of their own reading skills (p 0.01). The micro-analysis
of the five cases also indicated that the test takers’ self-assessed ratings
were generally congruent with the estimated skill profiles, except
for Case 3. Both results suggest that self-assessment can be used
for diagnostic purposes in combination with statistical diagnostic
feedback.
Question 3: What is the diagnostic capacity of the LanguEdge RC

test items?
The mean score differences in the proportion-correct scores between
masters and non-masters for each item were sufficiently large for
most of the items. However, the results also revealed some problem-
atic items, which failed to differentiate the masters from the non-
masters. Approximately 22% and 32% of the items for Forms 1 and
2 respectively had mean score differences between the masters and
non-masters of 0.4.
This latter result could indicate a weak association between the
items and skills, which requires a revision of the Q matrix. However,
further analysis of these items indicated that they tended to be either
extremely easy or extremely hard. The LanguEdge assessment
courseware used in this study includes a prototype of the New
TOEFL. The TOEFL is a norm-referenced test with the purpose of
placing test takers on a continuous scale; it thus requires a wide
range of item difficulty. The psychometric properties of items in
norm-referenced testing may not be relevant to a diagnostic test
where task/item difficulty is supposed to rest on the instructional
coverage and cognitive complexity that tasks require. As shown in
the five case analyses, test takers’ skill profiles can be influenced by
their responses to test items which do not have the capacity to dif-
ferentiate masters from nonmasters. The possible presence of non-
diagnostic test items can raise serious questions about the validity of
any inferences about the test takers’ skill competencies.
When applying CDA to a non-diagnostic test, one needs to scrutinize
carefully the diagnostic capacity of such diagnostic items. One way of
examining the degree to which test items are informative about skill
competency is to use a weighting system based on item parameter val-
ues, because these reflect the diagnostic capacity of the test items. In
the case of the Fusion Model, since an item’s diagnostic power is
reflected in πi* and rik* values, the Diagnostic Information Index (DII)
can be calculated by (1 – rik*)πi* values for all the Q-matrix entries. For
example, the DiagnOsis report in Appendix A lists a subset of items
for each skill in order of highest to lowest DII values. Reporting the
weights can be useful when test takers review their performance on the
test items. It would be particularly useful for test takers like Case 1,
from the case analysis, who shows a large discrepancy between total
test score and skill profiles.
Question 4: How is the diagnostic feedback evaluated and used by

teachers and students?
The study’s purpose was not to examine whether the diagnostic feed-
back improved learning, but rather to gather teachers’ and students’
views on the diagnostic feedback when it was made available to
them. Overall, the students welcomed the diagnostic feedback. The
majority found it very useful for understanding their weaknesses and
strengths in reading skills. Students with weak skill profiles
expressed emotional frustration and asked for more information for
remediation. About 39% of the students reported that they had used
the diagnostic information ‘quite a lot’ while 50% of the students
reported ‘a little bit.’ All three teachers agreed that the diagnostic
feedback would be useful for raising students’ awareness of their
strengths and weaknesses in reading skills and guiding their teaching.
The students and teachers did raise some concerns. One student’s
question about the meaning of a ‘master’ requires some thought
about definition, particularly because classifying someone as a ‘mas-
ter’ or ‘non-master’ of a skill implies some kind of future action.
Students who receive a diagnosis of weak skills are expected to show
interest in how to improve them and to desire specific guides to that
end. Positive diagnostic results can frustrate students because the
actions they should take are less apparent. As one student said, stu-
dents may interpret being labeled a ‘master’ of a certain skill as
requiring no further action. We need to be very clear that providing
diagnostic information does not complete the act of diagnosis but
calls for future action related to learning goals.
Although the students praised its usefulness, the diagnostic feed-
back’s effects on learning remain uncertain. Thus the study has
insufficient evidence to claim that the diagnostic feedback directly
improved students’ learning, especially because it was conducted
before the new TOEFL was launched. In-depth case studies or a con-
trolled experimental research may illuminate whether and how effect-
ively diagnostic feedback can be used to improve students’ learning.
The teachers’ perspectives on skills diagnosis were rather ambigu-

ous. While they appreciated the diagnostic feedback, the teachers
noted that the usefulness of skills diagnosis depends on the teacher’s
pedagogical approach, the purpose of learning and the context of
learning. They implied that feedback may be more beneficial in pro-
ficiency- or achievement-testing situations, as long as there are low
stakes attached to the tests, and the content of diagnosis reflects the
curricular detail. Collins (1990) states this concern clearly:
One notion afoot is that because we can diagnose the precise errors students
are making, we can then teach directly to counter these errors. Such diagnosis
might indeed be useful in a system where diagnosis and remediation are tightly
coupled…. But if diagnosis becomes an objective in nationwide tests, then it
would drive education to the lower-order skills for which we can do the kind
of fine diagnosis possible for arithmetic. Such an outcome would be truly
disastrous. It is the precisely the kinds of skills for which we can do fine
diagnosis, that are becoming obsolete in the computational world of today.
(pp. 76–77)
Assessment that only serves accountability systems makes teachers
teach to the test and makes students pay attention only to the tested
subjects. Given the ever-increasing demands for high-stakes, large-
scale testing, such warnings are timely.
VII Conclusion
The study investigated the validity of the CDA Fusion Modeling
approach to an existing large-scale reading comprehension test. This
paper focused on the characteristics and the dependability of the skill
profiles estimated by the Fusion Model and on the usefulness of this
diagnostic feedback in ESL classrooms.
The study results suggest that the CDA approach can provide more
fine-grained diagnostic information about the level of competency in
reading skills than the traditional scoring approach can. The estimates
of skill mastery profiles were reasonably reliable on various meas-
ures. The model-estimated statistics approximated the observed test
statistics, evidencing a good model fit to data. On the other hand, the
results also point to some concerns. When the CDA model is retro-
fitted to a test developed for non-diagnostic purposes, it may lose
some diagnostic capacity largely due to test items with extreme dif-
ficulty levels. These items do not necessarily contribute much to
evaluating test takers’ skill competencies. A model misfit with
observed data from the high- and low-scoring test takers seems to be
associated with the presence of such items. This finding needs to be
further researched in order to ensure accurate evaluation of skill

competency, especially considering that accurate diagnosis would be
most beneficial for low-scoring learners.
This study was limited to examining the characteristics of skill pro-
files using the Fusion Model; it did not examine how these skill pro-
files might have differed from those based on other statistical CDA
models. CDA statistical models have different assumptions about
how cognitive skills or combinations of the skills would influence
students’ test performance. Further research can investigate which
CDA model best reflects the theories of second language acquisition
in different learning contexts. This study was also limited to examin-
ing the application of the CDA approach to non-diagnostic tests.
Considering current CDA applications to existing non-diagnostic
tests that serve the accountability system under the No Child Left
Behind act in the USA, the results offer some critical information
about the potential challenges to the validity of such CDA applica-
tions. They also illuminate conditions for future CDA applications to
diagnostic assessment.
The study highlights the importance of developing a diagnostic
test rather than retrofitting the CDA model to a non-diagnostic test.
Indeed, developing and validating a diagnostic test is the most-
needed area of future CDA research. This future research could also
examine ways in which computer technologies can be integrated into
a diagnostic assessment system. Advances in computer interfacing
could allow teachers to design and administer a diagnostic test in
their classrooms and provide learners with immediate diagnostic
feedback (Jang, 2008). In this way, CDA can achieve its fullest util-
ity for a welcoming educator clientele.
VIII References
Afflerbach, P., & Johnston, P. (1984). Research methodology: On the use of
verbal reports in reading research. Journal of Reading Behavior, 16(4),
307–322.
Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University
Press.
Alderson, J. C. (2005). Diagnosing foreign language proficiency: the inter-
face between learning and assessment. London: Continuum.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language testing construc-
tion and evaluation. Cambridge: Cambridge University Press.
Alderson, J. C., & Huhta, A. (2005). The development of a suite of computer-
based diagnostic tests based on the Common European Framework.
Language Testing, 22, 301–320.
Alderson, J. C., & Lukmani, Y. (1989). Cognition and reading: Cognitive

levels as embodied in test questions. Reading in a Foreign Language,
5(2), 253–270.
Anderson, N. J., Bachman, L., Perkins, K., & Cohen, A. (1991). An
exploratory study into the construct validity of a reading comprehension
test: Triangulation of data sources. Language Testing, 8, 41–66.
Bachman, L. F., Davidson, F., & Milanovic, M. (1996). The use of test method
characteristics in the content analysis and design of EFL proficiency tests.
Language Testing, 13(2), 125–150.
Bailey, K. M. (1999). Washback in language testing. (TOEFL Monograph
Series Report. No. 15). Princeton, NJ: Educational Testing Service.
Berkoff, N. A. (1979). Reading skills in extended discourse in English as a
Foreign Language. Journal of Research in Reading, 2(2), 95–107.
Black, P. J., & Wiliam, D. (1998). Assessment and classroom learning.
Assessment in Education, 5, 7–74.
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing.
Cambridge: Cambridge University Press.
Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to
language testing: Examining attributes of a free response listening test.
Language Testing, 15(2), 119–157.
Carver, R. P. (1992). What do standardized tests of reading comprehension
measure in terms of efficiency, accuracy, and rate? Reading Research
Quarterly, 27(4), 347–359.
Cohen, A. D. (1987). Studying language learning strategies: How do we get
the information? In A. L. Wenden & J. Rubin (Eds.), Learner strategies
in language learning (pp. 31–40). Englewood Cliffs, NJ: Prentice-Hall.
Cohen, A. D., & Hosenfeld, C. (1981). Some uses of mentalistic data in
second-language research. Language Learning, 31(2), 285–313.
Cohen, A. D., & Upton, T. A. (2007). ‘I want to go back to the text’: Response
strategies on the reading subtest of the new TOEFL. Language Testing,
24, 209–50.
Collins, A. (1990). Reformulating testing to measure learning and thinking.
In N. Frederiksen, R. Glaser, A. Lesgold, & M. Shafto (Eds.), Diagnostic
monitoring of skill and knowledge acquisition (pp. 75–88). Hillsdale,
NJ: Lawrence Erlbaum.
Cordon, L. A., & Day, J. D. (1996). Strategy use on standardized reading
comprehension tests. Journal of Educational Psychology, 88(2),
288–295.
Davidson, F., & Lynch, B. K. (2002). Testcraft: A teacher’s guide to writing
and using language test specifications. New Haven and London: Yale
University Press.
DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psy-
chometric diagnostic assessment likelihood-based classification tech-
niques. In N. Frederiksen, R. Glaser, A. Lesgold, & M. Shafto (Eds.),
Diagnostic monitoring of skill and knowledge acquisition (pp. 361–390).
Hillsdale, NJ: Lawrence Erlbaum.
DiBello, L. V., Roussos, L. A., & Stout, W. (2007). Review of cognitive diag-
nostic assessment and a summary of psychometric models. In C. R. Rao
& S. Sinharay (Eds.), Handbook of statistics, Vol. 26: Psychometrics

(pp. 45–79). Elsevier Science B.V.: The Netherlands.
Embretson, S. (1991). A multidimensional latent trait model for measuring
learning and change. Psychometrika, 37, 359–374.
Embretson, S. (1998). A cognitive design system approach to generating valid
tests: Application to abstract reasoning. Psychological Methods, 3(3),
380–396.
Faerch, C., & Kasper, G. (Eds.). (1987). Introspection in second language
research. Philadelphia, PA: Multilingual Matters.
Farr, R., Pritchard, R., & Smitten, B. (1990). A description of what happens
when an examinee takes a multiple-choice reading comprehension test.
Journal of Educational Measurement, 27(3), 209–226.
Fischer, G. H. (1973). The linear logistic test model as an instrument in edu-
cational research. Acta Psychologia, 37, 359–374.
Frederiksen, N., Glaser, R., Lesgold, A., & Shafto, M. (Eds.). (1990).
Diagnostic monitoring of skill and knowledge acquisition. Hillsdale,
NJ: Lawrence Erlbaum.
Gorin, J. S. (2007). Test construction and diagnostic testing. In J. P. Leighton &
M. J. Gierl (Eds.), Cognitive diagnostic assessment for education:
Theory and applications (pp. 173–201). New York: Cambridge University
Press.
Grabe, W. (1991). Current developments in second language reading research.
TESOL Quarterly, 25(3), 375–406.
Grabe, W. (Ed.). (2000). Developments in reading research and their implica-
tions for computer-adaptive tests of reading. Cambridge: Cambridge
University Press.
Greene, J. C., & Caracelli, V. J. (1997). Defining and describing the paradigm
issue in mixed method evaluation. In J. C. Greene & V. J. Caracelli
(Eds.), Advances in mixed methods evaluation: The challenges and bene-
fits of integrating diverse paradigms. New directions for evaluation No.
74 (pp. 5–17). San Francisco, CA: Jossey-Bass.
Hartz, S. M. (2002). A Bayesian framework for the unified model for assessing
cognitive abilities: Blending theory with practicality. Unpublished doc-
toral dissertation, University of Illinois at Urbana-Champaign.
Hartz, S. M., & Roussos, L. A. (2005). The Fusion Model for skills diagnosis:
Blending theory with practice. ETS Research Report, Educational Testing
Service, Princeton, NJ.
Jang, E. E. (2005). A validity narrative: Effects of reading skills diagnosis on
teaching and learning in the context of NG TOEFL. Unpublished doc-
toral dissertation, University of Illinois at Urbana-Champaign.
Jang, E. E. (2008). A framework for cognitive diagnostic assessment. In
C. A. Chapelle, Y.-R. Chung, & J. Xu (Eds.), Towards adaptive CALL:
Natural Language Processing for Diagnostic Language Assessment
(pp. 117–131). Amer, IA: Iowa Scaner University.
Kasai, M. (1997). Application of the rule space model to the reading compre-
hension section of the test of English as a foreign language (TOEFL).
Unpublished doctoral dissertation. University of Illinois, Urbana
Champaign.
Kunnan, A. J., & Jang, E. E. (forthcoming). Diagnostic feedback in language

assessment. In M. Log & C. Doughty (Eds.), Handbook of second and
foreign language teaching. Walden, MA: Wiley-Blackwell.
Li, W. (1992). What is a test testing? An investigation of the agreement
between students’ test-taking processes and test constructors’ presump-
tion. Unpublished MA thesis. Lancaster University.
Lumley, T. (1993). The notion of subskills in reading comprehension tests: An
EAP example. Language Testing, 10(3), 211–234.
Mislevy, R. J. (1995). Probability-based inference in cognitive diagnosis.
In P. Nichols, S. Chipman, & R. J. Brennan (1995), Cognitively diagnos-
tic assessment (pp. 43–71). Hillsdale, NJ: Lawrence Erlbaum.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of
educational assessments. Measurement: Interdisciplinary Research and
Perspectives, 1, 3–67.
Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2004). A brief introduction to
Evidence-Centered Design. CSE Technical Report 632, The National
Center for Research on Evaluation, Standards, and Student Testing
(CRESST), Center for the Study of Evaluation (CSE). LA, CA: University
of California, Los Angeles.
Nevo, N. (1989). Test-taking strategies on a multiple-choice test of reading
comprehension. Language Learning, 6, 199–215.
Nichols, P. D. (1994). A framework for developing cognitively diagnostic
assessments. Review of Educational Research, 64(4), 575–603.
Nicols, P. D., Chipman, S. F., & Brennan, R. L. (Eds.). (1995). Cognitively
diagnostic assessment. Hillsdale, NJ: Lawrence Erlbaum.
Patz, R. J. & Junker, B. W. (1999). A straightforward approach to Markov
Chain Monte Carlo methods for item response models. Journal of
Educational and Behavioral Statistics, 24(2), 146–178.
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students
know: The science and design of educational assessment. Washington, DC:
National Academy Press.
Rost, D. (1993). Assessing the different components of reading comprehen-
sion: Fact or fiction? Language Testing, 10(1), 79–82.
Roussos, L., DiBello, L., Henson, R., Jang, E. E., & Templin, J. (in press).
Skills diagnosis for education and psychology with IRT-based parametric
latent class models. In S. E. Embretson & J. Roberts (Eds.), New direc-
tions in psychological measurement with model-based approaches.
Washington, DC: American Psychological Association.
Sheehan, K. M. (1997). A tree-based approach to proficiency scaling and diag-
nostic assessment. Journal of Educational Measurement, 34(4), 333–352.
Shohamy, E. (1992). Beyond performance testing: A diagnostic feedback test-
ing model for assessing foreign language learning. Modern Language
Journal, 76(4), 513–521.
Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology
for educational measurement. In R. L. Linn (Ed.), Educational measure-
ment (3rd ed., pp. 263–332). New York: Macmillan.
Spolsky, B. (1990). Social aspects of individual assessment. In J. de Jong &

D. K. Stevenson (Eds.), Individualizing the assessment of language abil-
ities (pp. 3–15). Avon: Multilingual Matters.
Tashakkori, A., & Teddlie, C. (Eds.). (2003). Handbook of mixed methods in
social and behavioral research. Thousand Oaks, CA: Sage Publications.
Tatsuoka, K. (1983). Rule space: An approach for dealing with misconcep-
tions based on item response theory. Journal of Educational
Measurement, 20(4), 345–354.
Tatsuoka, K. (1990). Toward an integration of Item-Response Theory and cog-
nitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold &
M. Shafto (Eds.), Diagnostic monitoring of skill and knowledge acquisi-
tion (pp. 453–488). Hillsdale, NJ: Lawrence Erlbaum.
Weir, C. J., & Porter, D. (1996). The multi-divisible or unitary nature of read-
ing: The language tester between Scylla and Charybdis. Reading in a
Foreign Language, 10, 1–19.
Weir, C. J., Huizhong, Y., & Yan, J. (2000). An empirical investigation of the
componentiality of L2 reading in English for academic purposes.
Cambridge: Cambridge University Press.
Appendix A: A Sample Skill Profile Report (Part I)
72
scoring report
DiagnOsis
Student Name: Margo LanguEdge Reading Comprehension Test 1
Review Your Answers

Question 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
1,4,6
Your Answer √ √ √ 2 √ 1 4 √ √ √ 3 2 2 √ √ √ √ 2 √ √ √ 3 2 3 5 1 √ 1 4 4 √ √ √ o √ 3
2.3
1,5,6
Correct Answer 2 3 2 3 3 3 1 1 1 4 4 3 2,4,6 2 3 2 1 3 2 3 4 2 4 1 1,5,6 4 2 2 3 3 1 2 2 2 4 1
3,7
Difficulty e m e h m m h h m m h m m e m m m e e e e h m m m m e h m e e e e m h m h
Scoring Key Score

√ Correct You earned 20 out of maximum 41 points.
Correct answer to questions with 4 choices = Plus 1 point
o Omitted 10 points from 12 easy questions
Wrong or omitted answer = No point
+ Plus partial points
Q13 & 25: 3 correct = 2 points, 2 correct = 1 point e = Easy, m = Medium, h = Hard 7 points from 17 medium questions
Q37: 5 correct = 3 points, 4 correct = 2 points, 3 correct = 1 (Difficulty is based on 1372 students ’ performance points from
3 8 hard questions
point on this test) You omitted 1 question.
Improve Your Skills
Skill mastery standing How to Interpret Skill Mastery

• Nine primary reading skills are assessed in this
Skill 9 reading comprehension test. Please review skill
Skill 8 descriptions and example questions attached
Skill 7
to this scoring report.
• The graph on the left side shows your probable
Skill 6 mastery standing of each skill.
Skill 5 • The grey region indicates that your probable
Skill 4 mastery standing cannot be determined.
• There may be some measurement error associated
Reading Skill
Skill 3 with the classification.
Skill 2 • This diagnostic information can be more useful
Skill 1 when used in combination with your teacher’s
and your own evaluation of your reading skills.
0 0.5 1
Cognitive diagnostic assessment of L2 reading comprehension
Probability
Needs improvement Not determined Mastered
Appendix A: A Sample Skill Profile Report (Part II)
scoring report
DiagnOsis Primary Skill Descriptions and Example Questions
Example
Skill Skill Descriptions
Questions
Understand the meanings of words or phrases by searching and analyzing surrounding text and using contextual
33, 14, 32, 4, 3,
1 clues appearing in the text. With this skill, you can determine which option word or phrase has the closest 11
meaning to the author’s intended meaning by making use of clues appearing in the text.
Determine word meanings by identifying which option word or phrase has the closest meaning to the referenced
9, 27, 10, 29, 19,
2 word or phrase in the text. Textual clues often do not appear explicitly in the text. With this skill, you can 21, 7
comprehend the text using your prior knowledge of vocabulary.
Comprehend grammatical relationships of words or phrases across successive sentences. With this skill, you can
3, 26, 12, 36, 4,
3 identify words or phrases that particular pronouns refer to OR determine where a new sentence can be inserted 2, 22, 33, 24
without logical or grammatical problems in the text.
Search across sentences within a paragraph and locate relevant information that is clearly stated in the text. 22, 18, 30, 17, 8,
4 Words or phrases in the options often match literally with words or phrases appearing in the relevant section of 24, 36, 20, 12,
the text. 25, 14
Search across paragraphs and locate relevant information that is not clearly stated in the text. Words or phrases
5 in the options do not match literally with those in the text, but they are paraphrased in different words or phrases. 6, 34, 26, 4, 5, 35
With this skill, you can determine which option most accurately preserves the author’s intended meaning.
Comprehend an argument or an idea that is implied but not explicitly stated in the text OR the author’s rhetorical
31, 16, 23, 15,
? 6 purpose of mentioning particular phrases in text. With this skill, you can infer information that is implied from 28, 2, 11, 7, 32
the text or can determine the author’s underlying purpose of using particular expressions.
Search and locate relevant information across the text and determine what information is true or not true. With
7 this skill, you can verify which option is true or false based on information stated in the passage or identify key 22, 7, 28, 5
information from negatively stated options (or questions).
Identify major ideas by distinguishing them from nonessential information across paragraphs. With this skill,
8 13, 5, 17, 25, 20
you can distinguish major ideas from ideas that are not mentioned or less important in the text.
Recognize major contrasts and arguments presented across paragraphs. With this skill, you can comprehend the
9 organization of the text which often contains the relationships such as compare/contrast, cause/effect, or 37, 23, 35
alternative arguments) and can determine major contrasting ideas or arguments.
• Not all example questions are equally informative in assessing related skills. Questions are listed in the order from most informative to least informative for your review.
• indicates that these skills are weak areas you need to improve. ‘?’ indicates that your mastery is not determined.

Cognitive Diagnostic Assessment

Uploaded by

Copyright:

Available Formats

Cognitive Diagnostic Assessment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cognitive Diagnostic Assessment

Uploaded by

Copyright:

Available Formats

Language Testing 2009 26 (1) 031–073

Cognitive diagnostic assessment

With recent statistical advances in cognitive diagnostic assessment (CDA), the

Keywords: cognitive diagnostic assessment (CDA), diagnostic feedback,

Embretson, 1991, 1998; Frederiksen, Glaser, Lesgold & Shafto, 1990;

skills. Such a diagnostic test requires a systematic design framework

II Purpose of the study

The purpose of the present study (Jang, 2005), part of a large-scale

III Literature review

Bachman, Davidson, & Milanovic, 1996; Grabe, 1991; Lumley, 1993;

2 Statistical models in CDA: Application of the Fusion Model

Fusion Model, serves two major purposes (DiBello et al., 1995;

The Fusion Model includes two ability parameters, j and j,

3 Profile reporting and use of diagnostic reports

One common reporting practice is to provide a test score based on

mastery level by linking item response probabilities to user-specified

(a) How dependably does the Fusion Model estimate L2 reading

individualized diagnosis report cards (DiagnOsis I and II) that I had

b Participants from two TOEFL preparation courses: I recruited

The courseware included a prototype for Next Generation TOEFL

b LanguEdge test takers’ self-assessment questionnaire: The

c Student questionnaires on diagnostic reports: I developed two

Form M SD Cronbach’s SEM

1 (n  1,372) 14.41 5.83 .89 1.96

received their diagnostic report cards (DiagnOsis I) from the pre-

4 Data collection and analysis procedure

Figure 1 Skill difficulty

Table 2 Descriptions of the primary processing skills

Skill Description Item

CDV Deducing the meaning of a word or a phrase by 1, 4, 11, 14, 32, 33

The students and teachers received DiagnOsis report cards at the

a The dependability of the Fusion Model’s estimation process of the

b Characteristics of the skill profiles estimated by the Fusion Model:

generated to calculate correct classification rates (the proportion of

c The diagnostic capacity of the LanguEdge RC test items: To

d Usefulness of diagnostic reports: To comprehensively argue the

unspecified by the Q matrix, tended to account for much of the vari-

c Evaluation of the model fit to data: I evaluated the degree of

1 0.92 0.59 2.69

27 0.94 0.5 1.03 0.56

Figure 3 Comparison of the observed and predicted score distributions

Table 4 Comparison of the observed and predicted statistics

Item proportion–correct scores Correlation coefficients

RMSE MAD RMSE MAD

2 Characteristics of the skill profiles estimated by the Fusion Model

Table 5 Proportions of masters for nine skills for form 1 (N  1372)

Skill No. of masters No. of non-masters Undetermined

CDV 766 427 179

Table 6 Classification consistency rates (CCR) for forms 1 and 2

CDV .87 .88 .85 .92 .94 .88

Note: M indicates masters, and NM indicates non-masters.

exhibited a little higher overall classification rate than Form 2, an

b Association between the model-estimated ppm’s and test takers’

3 Diagnostic capacity of the LanguEdge RC test items

Table 7 Correlations between ppms and student self-assessed ratings on reading

Note: All correlations are significant at the 0.01 level.

2. Nevertheless, the figures also clearly pointed to some problematic

b Case analysis: To examine whether and how individual skill pro-

Figure 5 Comparison of performance differences between item masters and

Table 8 Summary of the five cases’ skill profiles

1 (n 1,372) 14.41 5.83 .89 1.96

1 0.92 0.59 2.69

27 0.94 0.5 1.03 0.56

Table 5 Proportions of masters for nine skills for form 1 (N 1372)

Karen 21 23 0.7 0.2 0.5 0 0 0 0 0.6 0.3