Knoch 2009
Knoch 2009
Knoch 2009
Alderson (2005) suggests that diagnostic tests should identify strengths and
weaknesses in learners’ use of language and focus on specific elements
rather than global abilities. However, rating scales used in performance
assessment have been repeatedly criticized for being imprecise and therefore
often resulting in holistic marking by raters (Weigle, 2002). The aim of this
study is to compare two rating scales for writing in an EAP context; one ‘a
priori’ developed scale with less specific descriptors of the kind commonly
used in proficiency tests and one empirically developed scale with detailed
level descriptors. The validation process involved 10 trained raters applying
both sets of descriptors to the rating of 100 writing scripts yielded from a
large-scale diagnostic assessment administered to both native and non-native
speakers of English at a large university. A quantitative comparison of rater
behaviour was undertaken using FACETS. Questionnaires and interviews
were administered to elicit the raters’ perceptions of the efficacy of the two
types of scales. The results indicate that rater reliability was substantially
higher and that raters were able to better distinguish between different
aspects of writing when the more detailed descriptors were used. Rater
feedback also showed a preference for the more detailed scale. The findings
are discussed in terms of their implications for rater training and rating scale
development.
Alderson (2005) argues that diagnostic tests are often confused with
placement or proficiency tests. He lists several specific features
which distinguish diagnostic tests from other types of tests. Among
these, he writes that diagnostic tests should be designed to identify
strengths and weaknesses in the learner’s knowledge and use of lan-
guage and that diagnostic tests usually focus on specific rather than
global abilities.
Address for correspondence: Ute Knoch, Language Testing Research Centre, University of Melbourne,
Room 521, Level 5, Arts Centre, Carlton, Victoria, 3052, Australia; email: [email protected]
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
276 Diagnostic assessment of writing
I Rating scales
Several classifications of rating scales have been proposed in the
literature. The most commonly cited categorization is that of holi-
stic and analytic scales (Hamp-Lyons, 1991; Weigle, 2002). Weigle
summarizes the differences between these two scales in terms of six
qualities of test usefulness (p. 121), showing that analytic scales are
generally accepted to result in higher reliability, have higher con-
struct validity for second language writers, but are time-consuming
to construct and therefore expensive. Because analytic scales meas-
ure writing on several different aspects, better diagnostic informa-
tion can be expected.
Another possible classification of rating scales represents the way
the scales are constructed. Fulcher (2003) distinguishes between
two main approaches to scale development: intuitive methods or
empirical methods. Intuitively developed scales are developed based
on existing scales or what scale developers think might be com-
mon features at various levels of proficiency. Typical examples of
these scales are the FSI family of scales. In recent years, a number
of researchers have proposed that scales should be developed based
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 277
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
278 Diagnostic assessment of writing
III Method
1 Context of the research
DELNA (Diagnostic English Language Needs Assessment) is a
university-funded procedure designed to identify the English lan-
guage needs of undergraduate students following their admission to
the University of Auckland, so that the most appropriate language
support can be offered (Elder, 2003; Elder & Von Randow, 2008).
The assessment includes a screening component which is made up
of a speed-reading and a vocabulary task. This is used to quickly
eliminate highly proficient users of English and exempt these from
the time consuming and resource-intensive diagnostic procedure.
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 279
2 Instruments
a The rating scales: The DELNA rating scale: The existing
DELNA rating scale is an analytic rating scale with nine traits
(Organization, Coherence, Style, Data description, Interpretation,
Development of ideas, Sentence structure, Grammatical accuracy,
Vocabulary & Spelling) each consisting of six band levels ranging
from four to nine. The scale reflects common practice in language
testing in that the descriptors are graded using adjectives like ‘ade-
quate’, ‘sufficient’ or ‘severe’. In some trait scales, different features
of writing are conflated into one category (e.g. the vocabulary and
spelling scale). An abridged version of the DELNA scale can be
found in Appendix 1.
The new scale: The new scale was developed based on an
analysis of 600 DELNA writing samples. The scripts were ana-
lyzed using discourse analytic measures in the following categories
(for a more detailed description of each measure refer to Knoch,
2007a):
• accuracy (percentage error-free t-units);
• fluency (number of self-corrections as measured by cross-outs);
• complexity (number of words from Academic Wordlist);
• style (number of hedging devices – see Knoch (2008));
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
280 Diagnostic assessment of writing
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 281
Table 1 Comparison of traits and band levels in existing and new scale
3 Participants
Ten DELNA raters were drawn from a larger pool of raters based
on their availability at the time of the study. All raters have several
years of experience as DELNA raters and take part in regular train-
ing moderation sessions either face-to-face or online (Elder et al.,
2007; Knoch et al., 2007).
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
282 Diagnostic assessment of writing
4 Procedures
a Rater training: The rater training sessions for both the existing
rating scale and the new scale were conducted very similarly. In
each case, the raters met in plenary for a face-to-face session. In both
cases they rated 12 scripts as a group. The raters discussed their own
ratings and then compared these to benchmark ratings.
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 283
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
284 Diagnostic assessment of writing
IV Results
Research question 1: Do the individual trait scales on the two
rating scales differ in terms of (a) the discrimination between
candidates, (b) rater spread and agreement, (c) variability in the
ratings and (d) what the different traits measure?
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 285
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|Measr | +Candidate | -Rater | -Item | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 | S.7 | S.8 | S.9 |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+ 4 + + + + (9) + (9) + (9) + (9) + (9) + (9) + (9) + (9) + (9) +
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | --- | --- | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | --- | | | --- | | |
| | * | | | | | | | | | | | |
+ 3 + + + + + + + + --- + + + + +
| | | | | 8 | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | --- | | | |
| | | | | | 8 | 8 | | | | | | |
| | | | | | | | | | | 8 | | |
| | * | | | | | | 8 | | | | | |
+ 2 + *** + + + --- + + + + 8 + + + 8 + 8 +
| | *** | | | | | | | | | | | |
| | * | | | | | | | | | | | |
| | **** | | | | --- | --- | | | | | | |
| | ** | | | | | | | | 8 | --- | | |
| | *** | | | | | | --- | | | | | |
| | *** | | | | | | | --- | | | --- | --- |
| | *** | | | 7 | | | | | | | | |
+ 1 + ** + + + + + + + + + + + +
| | ***** | | | | 7 | 7 | | | --- | | | |
| | *** | | Part three | | | | | 7 | | 7 | | |
| | ***** | 2 | | | | | 7 | | | | 7 | |
286 Diagnostic assessment of writing
| | **** | | | | | | | | | | | 7 |
| | **** | | Data | | | | | | 7 | | | |
| | **** | 10 3 | | | | | | | | | | |
| | ***** | 6 | | --- | | --- | | --- | | | | |
* 0 * ******* * 4 7 * Grammatical accuracy Interpretation * * --- * * --- * * * --- * --- * *
| | *** | 1 5 | Sentence Structure Vocabulary & spelling | | | | | | --- | | | --- |
| | ***** | 8 | Conesion | | | | | | | | | |
| | **** | | Organisation Style | | | | | | | | | |
| | *** | | | | | | | 6 | | | | |
| | ***** | | | | | | 6 | | 6 | | 6 | |
| | ** | 9 | | | 6 | 6 | | | | 6 | | 6 |
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
| | ** | | | 6 | | | | | | | | |
+ -1 + * + + + + + + + + --- + + + +
| | ** | | | | | | | | | | | |
| | *** | | | | | | | --- | | | --- | |
| | * | | | | | | --- | | | --- | | --- |
| | ** | | | | --- | | | | 5 | | | |
| | * | | | | | --- | | | | | | |
| | | | | | | | | | | | | |
| | * | | | | | | | | | | | |
+ -2 + * + + + (4) + (4) + (4) + (4) + (4) + (4) + (4) + (4) + (4) +
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|Measr | *=1 | -Rater | -Item | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 | S.7 | S.8 | S.9 |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
| | | | | | | | | | | | | | |
| | | | | | | | | | | | | | |
| | | | | | | | | | | | | | |
+ -2 + + + + (4) + (4) + (5) + (5) + (4) + (4) + (4) + (4) + (4) + (4) +
Ute Knoch
------ ---------- --------------------- ------------------------- --- -- ---- ------- -- ---- --- ---- -- ---- --- -- -----
| Measr | *=2 | -Rater | -Item | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 | S.7 | S.8 | S.9 | S.10 |
------ ---------- --------------------- ------------------------- --- -- ---- ------- -- ---- --- ---- -- ---- --- -- -----
287
Table 2 Rating scale statistics for entire existing and new rating scales
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 289
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
290 Diagnostic assessment of writing
6 3.5
5 3.0
2.5
Eigenvalue
Eigenvalue
2.0
3
1.5
2 1.0
1 0.5
0 0.0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10
Factor Number Factor Number
however, the results were different. The PFA resulted in six compo-
nents with eigenvalues over 0.7.
The next step in the PFA was to identify which variables load
onto which component. For this, a rotation of the data was necessary.
However, because only one component was identified for the exist-
ing scale, no factor loadings can be displayed. A varimax rotation
was chosen to facilitate the interpretation of the factors of the new
scale. A trait was considered to be loading on a factor if the loading
was higher than .4 (as indicated in bold font). The six factor loadings
for the new scale can be seen in Table 5.
The largest factor, accounting for 34% of the variance, was made
up of accuracy, lexical complexity, coherence and cohesion. This
factor can be described as a general writing ability factor. The second
factor, which accounted for a further 13% of the variance, was made
up of hedging and interpretation of data. This is, at first glance, an
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 291
Table 5 Loadings for principal factor analysis
Component
1 2 3 4 5 6
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
292 Diagnostic assessment of writing
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 293
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
294 Diagnostic assessment of writing
might otherwise get a six or a seven and yes, if they can write completely
error-free then I can give them a nine. I have no problems with that.
The idea of being able to arrive precisely at a score was also echoed
in the following comment by Rater 7:
Rater 7: It is interesting, I found that it [the new scale] is quite different to
the DELNA one and it is quite amazing to be able to count things and say,
I know exactly which score to use now.
Whilst the comments about the new scale reported above shed a
positive light on the scale, a less positive comment was also made
by the raters.
Three raters criticized the fact that some information was lost
because the descriptors in the new scale were too specific. Rater 5,
for example, argued that a simple count of hedging devices could not
capture variety and appropriateness:
Researcher: You said that, other than hedging, style wasn’t really
considered.
Rater 5: Yeah, it does seem a bit limited. […] I suppose that is similar [to
the DELNA scale] it sort of relies on the marker’s knowledge of English in
a more kind of global way sort of. But maybe that is the inter-rater reliability
issue coming up.
Above, the results for research questions 1 and 2 were presented.
The following section aims to discuss these results in light of the
overarching research question:
To what extent is an empirically developed rating scale of
academic writing with level descriptors based on discourse
analytic measures more valid for diagnostic writing assessment
than an existing rating scale?
V Discussion
DELNA is a diagnostic assessment system. To establish construct
validity for a rating scale used for diagnostic assessment, we need
to turn to the limited literature on diagnostic assessment. Alderson
(2005) compiled a list of features which distinguish diagnostic tests
from other types of tests. Four of Alderson’s 18 statements are cen-
tral to rating scales and rating scale development. These are shown
in Table 6.
This section will discuss each of Alderson’s four statement in turn
and then focus on the raters’ perceptions of the two scales.
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 295
Table 6 Extract from Alderson’s (2005) features of diagnostic tests
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
296 Diagnostic assessment of writing
the halo effect is not necessarily only a rater effect, but also a rating
scale effect.
However, what was not established in this study was whether the
raters rated analytically because they were unfamiliar with the new
scale. It is possible that extended use of the new scale might also
result in more holistic rating behavior. Further longitudinal research
is needed to determine whether this is indeed the case.
Statement 2. Diagnostic tests should enable a detailed analysis and
report of responses to items or tasks and
Statement 3. Diagnostic tests thus give detailed feedback which can
be acted upon.
Alderson’s (2005) second and third statements assert that diag-
nostic assessments should enable a detailed analysis and report of
responses to tasks and that this feedback should be in a form that can
be acted upon. Both rating scales lend themselves to a detailed report
of a candidate’s performance. However, as evident in the quantita-
tive analysis, if the raters at times resort to a holistic impression to
guide their marking when using the DELNA scale, this will reduce
the amount of detail that can be provided to students. If most scores
are, for example, centred around the middle of the scale range, then
this information is likely to be less useful to students than if they
are presented with a more jagged profile of some higher and some
lower scores.
A score report card based on the new scale could be designed to
make clearer suggestions to students. For example, the section on
academic style could suggest the use of more hedging devices or
students could be told how they could improve the coherence of their
essays rather than just being told that their writing ‘lacks academic
style’ or is ‘incoherent’. More detailed suggestions on what score
report cards could look like are beyond the scope of this paper, but
can be found in Knoch (2007a).
Statement 4. Diagnostic tests are more likely to be [...] focussed on
specific elements than on global abilities.
Alderson’s fourth statement states that diagnostic tests are more
likely to be focussed on specific elements rather than on global
abilities. If a diagnostic test of writing should focus on specific
elements, then this needs to be reflected in the rating scale.
Therefore, the descriptors need to lend themselves to isolating
more detailed aspects of a writing performance. The descriptors
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 297
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
298 Diagnostic assessment of writing
VI Conclusion
The findings of this study have a number of implications. The first
refers to the classification of rating scale types commonly found
in the literature. Weigle (2002), as well as many other authors
distinguishes between holistic and analytic rating scales However,
this study seems to suggest that although these two types of scales
are distinct, it is also necessary to distinguish two types of analytic
scales: less detailed, a priori developed scales and more detailed,
empirically developed scales. Therefore, Weigle’s summary table
can be expanded in the following manner (Table 7).
Researchers and practitioners need to be made aware of the differ-
ences between analytic scales and need to be careful when making
decisions about the type of scale to use or the development method
to adopt.
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 299
Table 7 Extension of Weigle’s (2002) table to include empirically developed
analytic scales
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
300 Diagnostic assessment of writing
Table 8 Rating scales and score reporting for different types of writing
assessment
VII References
Alderson, C. (2005). Diagnosing foreign language proficiency. The interface
between learning and assessment. London: Continuum.
Bachman, L. F. (1990). Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford:
Oxford University Press.
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Ute Knoch 301
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
302 Diagnostic assessment of writing
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters
online: How does it compare with face-to-face training? Assessing
Writing, 12, 26–43.
Linacre, J. M. (2006). Facets Rasch measurement computer program. Chicago,
IL: Winsteps.
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do
they really mean to the raters? Language Testing, 19(3), 246–276.
McNamara, T. (1996). Measuring second language performance. Harlow,
Essex: Pearson Education.
McNamara, T. (2002). Discourse and assessment. Annual Review of Applied
Linguistics, 22, 221–242.
Mickan, P. (2003). ‘What’s your score?’ An investigation into language
descriptors for rating written performance. Canberra: IELTS Australia.
Myford, C. M., & Wolfe, E. W. (2000). Monitoring sources of variability
within the Test of Spoken English assessment system. Princeton, NJ:
Educational Testing Service.
North, B., & Schneider, G. (1998). Scaling descriptors for language profi-
ciency scales. Language Testing, 15(2), 217–263.
Turner, C. E. (2000). Listening to the voices of rating scale developers:
Identifying salient features for second language performance assessment.
The Canadian Modern Language Review, 56(4), 555–584.
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student
samples: Effects of the scale maker and the student sample on scale con-
tent and student scores. TESOL Quarterly, 36(1), 49–70.
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second
language tests. ELT Journal, 49(1), 3–12.
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of
second-language speaking ability: Test method and learner discourse.
Language Testing, 16(1), 82–111.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In
L. Hamp-Lyons (Ed.), Assessing second language writing in academic
contexts. Norwood, NJ: Ablex.
Watson Todd, R., Thienpermpool, P., & Keyuravong, S. (2004). Measuring
the coherence of writing using topic-based analysis. Assessing Writing,
9, 85–104.
Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University
Press.
White, E. M. (1985). Teaching and assessing writing. San Francisco: Jossey-
Bass.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago:
MESA Press.
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Appendix 1: Abridged DELNA scale
9 8 7 6 5 4
FLUENCY
Organization Essay fluent – well organised Evidence of organization Little organization – possibly no
– logical paragraphing – paragraphing may not be paragraphing
entirely logical
Cohesion Appropriate use of cohesive Lack / inappropriate use of Cohesive devices absent /
devices – message able to be cohesive devices causes some inadequate / inappropriate
followed throughout strain for reader – considerable strain for reader
Style Generally academic – may be Some understanding of Style not appropriate to task
slight awkwardness academic style
CONTENT
Description of Data described accurately Data described adequately / Data (partially) described / may
data may be overemphasis on figures be inaccuracies / very brief /
inappropriate
Interpretation Interpretation sufficient / Interpretation may be brief / Interpretation often inaccurate /
of data appropriate inappropriate very brief / inappropriate
Development Ideas sufficient and Ideas may not be expressed Few appropriate ideas expressed
/ extension of supported. Some may lack clearly or supported appropriately – inadequate supporting evidence
ideas obvious relevance – essay may be short – essay may be short
FORM
Sentence Controlled and varied Adequate range – errors in complex Limited control of sentence
Structure sentence structure sentences may be frequent structure
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
Grammatical No significant errors in syntax Errors intrusive / may cause Frequent errors in syntax cause
accuracy problems with expression of ideas significant strain
Vocabulary & Vocab. appropriate / may be Limited, possibly inaccurate / Range and use of vocabulary
Ute Knoch
spelling few minor spelling errors inappropriate vocab / spelling errors inadequate. Errors in word
formation & spelling cause strain
303
Appendix 2: Abridged new scale
9 8 7 6 5 4
Accuracy Accuracy Error-free Nearly error-free Nearly no or no error-free
sentences
Fluency Repair No self-corrections No more than 5 self- More than 20 self-corrections
Fluency corrections
Complexity Lexical Large number of words from academic wordlist Less than 5 words from AWL / uses only
Complexity (more than 20) / vocabulary extensive – makes very basic vocabulary
use of large number of sophisticated words
Mechanics Paragraphing 5 paragraphs 4 paragraphs 1 paragraph
Reader-Writer Hedges More than 9 hedging 7–8 hedging devices No hedging devices
Interaction devices
Content Data All data described (all Most data described (all Data description not attempted or
description trends and relevant trends, some figures) incomprehensible
figures) (most trends, most
304 Diagnostic assessment of writing
figures)
Interpretation Five or more relevant reasons and/or supporting No reasons provided
of data ideas
Part 3 of task Four or more relevant ideas No ideas provided
Coherence Coherence Writer makes regular use of super structures, Frequent: Unrelated progression,
sequential progression and possibly indirect coherence breaks and some
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
progression. extended progression.
Few incidences of unrelated progression Infrequent: sequential progression
No coherence breaks and superstructure
Cohesion Cohesion Connectives used sparingly but skilfully (not Writer uses few connectives, there
mechanically) compared to text length, and often is little cohesion.
describe a relationship between ideas This/these not or very rarely used.
Writer might use this/these to refer to ideas more
than four times