Knoch 2009

–
Language Testing 2009 26 (2) 275 304
Diagnostic assessment of writing:

A comparison of two rating scales
Ute Knoch University of Melbourne, Australia
Alderson (2005) suggests that diagnostic tests should identify strengths and
weaknesses in learners’ use of language and focus on specific elements
rather than global abilities. However, rating scales used in performance
assessment have been repeatedly criticized for being imprecise and therefore
often resulting in holistic marking by raters (Weigle, 2002). The aim of this
study is to compare two rating scales for writing in an EAP context; one ‘a
priori’ developed scale with less specific descriptors of the kind commonly
used in proficiency tests and one empirically developed scale with detailed
level descriptors. The validation process involved 10 trained raters applying
both sets of descriptors to the rating of 100 writing scripts yielded from a
large-scale diagnostic assessment administered to both native and non-native
speakers of English at a large university. A quantitative comparison of rater
behaviour was undertaken using FACETS. Questionnaires and interviews
were administered to elicit the raters’ perceptions of the efficacy of the two
types of scales. The results indicate that rater reliability was substantially
higher and that raters were able to better distinguish between different
aspects of writing when the more detailed descriptors were used. Rater
feedback also showed a preference for the more detailed scale. The findings
are discussed in terms of their implications for rater training and rating scale
development.
Keywords: diagnostic writing assessment, second language writing,

second language writing assessment, rating scale development, rating scale
validation, rating scales
Alderson (2005) argues that diagnostic tests are often confused with
placement or proficiency tests. He lists several specific features
which distinguish diagnostic tests from other types of tests. Among
these, he writes that diagnostic tests should be designed to identify
strengths and weaknesses in the learner’s knowledge and use of lan-
guage and that diagnostic tests usually focus on specific rather than
global abilities.
Address for correspondence: Ute Knoch, Language Testing Research Centre, University of Melbourne,
Room 521, Level 5, Arts Centre, Carlton, Victoria, 3052, Australia; email: [email protected]
© The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav

DOI:10.1177/0265532208101008
Downloaded from ltj.sagepub.com at UCSF LIBRARY & CKM on March 10, 2015
276 Diagnostic assessment of writing
When discussing the diagnostic assessment of writing, Alderson

(2005) describes the use of indirect tests (in this case the DIALANG
test) rather than the use of performance tests. However, indirect tests
are used less and less to assess writing ability in the current era of
performance testing because they are not considered to be adequate
and valid measures of the multi-faceted nature of writing (Weigle,
2002) and therefore an argument can be made that diagnostic tests
of writing should be direct rather than indirect.
The question, then, is how direct diagnostic tests of writing should
differ from proficiency or placement tests. One central aspect in the
performance assessment of writing is the rating scale. McNamara
(2002) and Turner (2000), for example, have argued that the rating
scale (and the way raters interpret the rating scale) represents the
de-facto test construct. Accordingly, careful attention needs to be
paid not only to the formulation of a rating scale but also to the man-
ner in which it is used. This is all the more important in diagnostic
contexts, where it is incumbent upon raters to provide valid, reliable
and detailed feedback on the features of learner performance that
require further work.
I Rating scales
Several classifications of rating scales have been proposed in the
literature. The most commonly cited categorization is that of holi-
stic and analytic scales (Hamp-Lyons, 1991; Weigle, 2002). Weigle
summarizes the differences between these two scales in terms of six
qualities of test usefulness (p. 121), showing that analytic scales are
generally accepted to result in higher reliability, have higher con-
struct validity for second language writers, but are time-consuming
to construct and therefore expensive. Because analytic scales meas-
ure writing on several different aspects, better diagnostic informa-
tion can be expected.
Another possible classification of rating scales represents the way
the scales are constructed. Fulcher (2003) distinguishes between
two main approaches to scale development: intuitive methods or
empirical methods. Intuitively developed scales are developed based
on existing scales or what scale developers think might be com-
mon features at various levels of proficiency. Typical examples of
these scales are the FSI family of scales. In recent years, a number
of researchers have proposed that scales should be developed based
Ute Knoch 277
on empirical methods. Examples of such scales are those produced

by North and Schneider (1998) who proposed the method of scaling
descriptors, Fulcher’s data-based scale (1996) as well as Upshur and
Turner (1999) and Turner and Upshur’s (2002) empirically derived,
binary-choice, boundary definition (EBB) scales.
Rating scales commonly used in the assessment of writing
have been criticized for a number of reasons. The first criticism
is that they are usually intuitively designed and therefore often do
not closely enough represent the features of candidate discourse.
Furthermore, Brindley (1998) and others have pointed out that the
criteria often use impressionistic terminology which is open to sub-
jective interpretations (Upshur & Turner, 1995; Watson Todd et al.,
2004). The band levels have moreover been criticized for often using
relativistic wording to differentiate between levels (Mickan, 2003),
rather than offering precise and detailed descriptions of the nature of
performance at each level.
The problems with intuitively developed rating scales described
above might affect the raters’ ability to make fine-grained distinc-
tions between different traits on a rating scale. This might result in
important diagnostic information being lost. Similarly, if raters resort
to letting an overall, global impression guide their ratings, even when
using an analytic rating scale, the resulting scoring profile would be
less useful to candidates. It is therefore doubtful whether intuitively
developed rating scales are suitable in a diagnostic context.
II The current study

The purpose of this study was to establish whether an empirically
developed rating scale for writing assessment with band descriptors
based on discourse analytic measures would result in more valid and
reliable ratings for a diagnostic context than a rating scale typical of
proficiency testing.
The study was conducted in two main phases. During the first
phase, the analysis phase, 600 DELNA writing scripts at five pro-
ficiency levels were analysed using a range of discourse analytic
measures. These discourse analytic measures were selected because
they were able to distinguish between writing scripts at different
proficiency levels and because they represented a range of aspects
of writing. Based on the findings in Phase 1, a new rating scale was
developed.
During the second phase of this study, the validation phase, 10

raters rated 100 writing scripts using first the existing descriptors
and then the new rating scale. Afterwards, detailed interviews were
conducted with seven of the ten raters to elicit their opinions of the
efficacy of the two scales.
This paper reports on the findings from the second phase.
Because both qualitative and quantitative data were collected
to support the findings, this study is situated in the paradigm of
mixed methods research (Creswell & Plano Clark, 2007). More spe-
cifically, an embedded mixed methods research model was chosen,
where qualitative data are used to supplement quantitative data.
The overarching research question for the whole study is as fol-
lows:
To what extent is an empirically developed rating scale of
academic writing with level descriptors based on discourse
analytic measures more valid and useful for diagnostic writing
assessment than an existing rating scale?
To guide the data collection and analysis of Phase 2, two more
specific research questions were formulated:
Research question 1: Do the ratings produced using the two rating
scales differ in terms of (a) the discrimination between candidates,
(b) rater spread and agreement, (c) variability in the ratings and
(e) what the different traits measure?
Research question 2: What are raters’ perceptions of the two differ-
ent rating scales for writing?
III Method
1 Context of the research
DELNA (Diagnostic English Language Needs Assessment) is a
university-funded procedure designed to identify the English lan-
guage needs of undergraduate students following their admission to
the University of Auckland, so that the most appropriate language
support can be offered (Elder, 2003; Elder & Von Randow, 2008).
The assessment includes a screening component which is made up
of a speed-reading and a vocabulary task. This is used to quickly
eliminate highly proficient users of English and exempt these from
the time consuming and resource-intensive diagnostic procedure.
Ute Knoch 279
The diagnostic component comprises objectively scored reading

and listening tasks and a subjectively scored writing task.
The results of the DELNA assessment are not only made available
to students, but also to their academic departments as well as tutors
working in the English Language Self-Access Centre, the Student
Learning Centre and on English as a second language credit courses.
Based on their results on DELNA, students will be asked to attend
language tutorials set up within their specific disciplines, take ESOL
credit courses, see tutors in the English Language Self-Access
Centre, the Student Learning Centre or take a specific writing course
designed for English-speaking background students.
The writing section of the DELNA assessment is an expository
writing task in which students are given a table or graph of infor-
mation which they are asked to describe and interpret. Candidates
are given a time limit of 30 minutes. The writing task is routinely
double-marked using an analytic rating scale. The scale is described
in more detail below.
2 Instruments
a The rating scales: The DELNA rating scale: The existing
DELNA rating scale is an analytic rating scale with nine traits
(Organization, Coherence, Style, Data description, Interpretation,
Development of ideas, Sentence structure, Grammatical accuracy,
Vocabulary & Spelling) each consisting of six band levels ranging
from four to nine. The scale reflects common practice in language
testing in that the descriptors are graded using adjectives like ‘ade-
quate’, ‘sufficient’ or ‘severe’. In some trait scales, different features
of writing are conflated into one category (e.g. the vocabulary and
spelling scale). An abridged version of the DELNA scale can be
found in Appendix 1.
The new scale: The new scale was developed based on an
analysis of 600 DELNA writing samples. The scripts were ana-
lyzed using discourse analytic measures in the following categories
(for a more detailed description of each measure refer to Knoch,
2007a):
• accuracy (percentage error-free t-units);
• fluency (number of self-corrections as measured by cross-outs);
• complexity (number of words from Academic Wordlist);
• style (number of hedging devices – see Knoch (2008));
• paragraphing (number of logical paragraphs from five paragraph

model);
• content (number of ideas and supporting ideas);
• cohesion (types of linking devices; number of anaphoric pro-
nominals ‘this/these’);
• coherence (based on topical structure analysis – see Knoch
(2007c)).
To ground the selection of the discourse-analytic measures in
theory, several possible models were reviewed as part of the larger
study (Knoch, 2007a). These included: models of communicative
competence (e.g. Bachman, 1990; Bachman & Palmer, 1996), mod-
els of rater decision-making (e.g. Cumming et al., 2001) and models
of writing (e.g. Grabe & Kaplan, 1996). As none of these were found
to be satisfactory by themselves, a taxonomy of all were used to
select the constructs to include in the scale. Measures to represent
these constructs were then chosen based on a requirement to fulfil
the following criteria: measures had to (a) be able to discriminate
successfully between different levels of writing, (b) be practical in
the rating process, and (c) occur in most writing samples.
The new scale differs from the existing DELNA scale in that
it provides more explicit descriptors. Where possible, raters are
given features of writing which they can count. For example, for
accuracy, raters are required to estimate the percentage of error-
free sentences. The new scale does not make use of any adverbials
in the level descriptors, nor does it ask raters to focus on more than
one aspect of writing in one trait scale. For a detailed account of
the development of the new scale, please refer to Knoch (2007a;
2007b; 2007c). An abridged version of the scale can be found in
Appendix 2.
In addition to the qualitative differences of the level descriptors
of the two rating scales described above, the two scales differ in the
number of band levels of the trait scales. The DELNA scale has the
same number of levels for each trait, whilst the new scale has vary-
ing number of levels for different traits. The reason for this is that
the scale was developed on an empirical basis. The number of levels
reflects the findings of the empirical investigation. A comparison
of the number of band levels of the two trait scales can be seen in
Table 1.
b The writing samples: The one hundred writing scripts used in

the second phase of the study were randomly selected from the
Ute Knoch 281
Table 1 Comparison of traits and band levels in existing and new scale
DELNA scale Band levels New scale Band levels
Grammatical accuracy 6 Accuracy 6

Sentence structure 6
Vocabulary and spelling 6 Lexical complexity 4
Data description 6 Data description 6
Data interpretation 6 Data interpretation 5
Data – Part 3 6 Data – Part 3 5
Style 6 Hedging 6
Organization 6 Paragraphing 5
Coherence 5
Cohesion 6 Cohesion 4
scripts produced during the 2004 administration of the DELNA

assessment.
c The training manual: To help the raters become familiar

with the new scale, a training manual was produced which the
raters were asked to study at home before the rater training ses-
sion. The idea behind the development of this manual was that
because the raters were all very familiar with the existing rating
scale, a very lengthy training session would have had to be held
to introduce them to the new scale. This was, however, not pos-
sible because of time constraints on the part of the raters. In the
manual, clear instructions are provided on how each trait is to
be rated. Each trait scale is further illustrated with examples and
practice exercises.
d The interview questions: Because the interviews were semi-

structured, the exact interview questions varied from participant to
participant. The raters were asked what they thought about the two
rating scales, what they would change about the two scales in terms
of the wording, categories and number of levels and if they found
any categories difficult to apply.
3 Participants
Ten DELNA raters were drawn from a larger pool of raters based
on their availability at the time of the study. All raters have several
years of experience as DELNA raters and take part in regular train-
ing moderation sessions either face-to-face or online (Elder et al.,
2007; Knoch et al., 2007).
4 Procedures
a Rater training: The rater training sessions for both the existing
rating scale and the new scale were conducted very similarly. In
each case, the raters met in plenary for a face-to-face session. In both
cases they rated 12 scripts as a group. The raters discussed their own
ratings and then compared these to benchmark ratings.
b Data collection: The ratings based on the existing DELNA

scale were collected over a period of eight weeks. Two months
after the raters had completed their ratings based on the DELNA
scale, they rated the same 100 scripts using the new scale. A coun-
terbalanced design was not possible because for practicality reasons
the ratings using the existing DELNA scale had to be completed
before the new scale was designed. However, because of the large
number of scripts in the study, the two months between rating
rounds and feedback received from the raters, it can be contended
that the raters were not able to remember any of the scripts in the
sample from one rating round to the next. To avoid the effect of
all raters being previously familiar with the DELNA rating scale, a
completely new group of raters could have been recruited for this
study. However, as being familiar with the context of the assess-
ment was important for the interviews, and for practicality reasons,
this was not done.
After the ratings were completed, all raters were invited to par-
ticipate in semi-structured interviews. Seven of the ten raters agreed
to participate. The interviews were conducted in a quiet room and
lasted for 30–45 minutes.
c Data analysis: Three types of data analysis were undertaken:

the analysis of the multi-faceted Rasch data, a factor analysis of the
rating data and the analysis of the interviews with the raters. Each of
these is discussed below.
First, the results of the two rating rounds were analyzed using
the multi-faceted Rasch measurement program Facets (Linacre,
2006). FACETS is a generalization of Wright and Masters’ (1982)
partial credit model that makes possible the analysis of data from
assessments that have more than the traditional two facets associ-
ated with multiple-choice tests (i.e. items and examinees). In the
many-faceted Rasch model, each facet of the assessment situation
(e.g. candidates, raters, trait) is represented by one parameter. The
model states that the likelihood of a particular rating on a given
Ute Knoch 283
rating scale from a particular rater for a particular student can be

predicted mathematically from the proficiency of the student and
the severity of the rater.
To interpret the results of the multi-faceted Rasch analysis, a
number of hypotheses were developed for comparing the two rating
scales:
1) Discrimination of the rating scale:
The first hypothesis was that a well functioning rating scale
would result in a high candidate discrimination. When a
rating scale is analyzed, the candidate separation ratio is an
excellent indicator of the discrimination of the rating scale.
The higher the separation ratio, the more discriminating the
rating scale is.
2) Rater separation:
The next hypothesis made was that a well functioning rating
scale would result in small differences between raters in terms
of their leniency and harshness as a group. Therefore, a rating
scale resulting in a smaller rater separation ratio is seen to be
functioning better.
3) Rater reliability:
The third hypothesis was that a necessary condition for valid-
ity of a rating scale, is rater reliability (Davies & Elder, 2005).
FACETS provides two measures of rater reliability: (a) the rater
point biserial correlation index, which is a measure of how
similarly the raters are ranking the candidates and (b) the per-
centage of exact rater agreement, which indicates, in percentage
terms, how many times raters awarded exactly the same score
as another rater in the sample. Higher values on both of these
indices point to a better-functioning rating scale.
4) Variation in ratings:
Because rating behaviour is directly influenced by the rating
scale used, it was further contended that a better functioning
rating scale would result in fewer raters rating either inconsist-
ently or overly consistently (by overusing the central categories
of the rating scale). The measure indicating variability in raters’
scores is the rater infit mean square value. Rater infit means
square values have an expected value of 1 and can range from
0 to infinity. The closer the calculated value is to 1, the closer

the rater’s ratings are to the expected ratings. High infit mean
square values (in this case 1.3 was chosen as the cut-off level
following McNamara (1996) and Myford and Wolfe (2000))
denote ratings that are further away from the expected ratings
than the model predicts. This is a sign that the rater in question
is rating inconsistently, showing too much variation. Similarly,
low values (.7 was chosen for this study as the lower limit) indi-
cate that the observed ratings are closer to the expected ratings
than the Rasch model predicts. This could indicate that a rater is
rating very consistently; however it is more likely that the rater
concerned is overusing certain categories of the rating scale,
normally the inside values.
Each of the four criteria discussed above will be used to compare the
two rating scales.
Apart from the multi-faceted Rasch analysis, it was further
of interest to ascertain how many different aspects of writing
ability the raters were able to discern when using the two rating
scales. For this, principal axis factoring was used. This analysis
is designed to uncover the latent structure of interrelationships
of a set of observed variables. Before this analysis, both the
determinant of the R-matrix and the Kaiser-Meyer-Olkin measure
of sample adequacy were calculated to ensure suitability of the
data to this type of analysis. To determine the number of factors
to be retained in the analysis, scree plots and Jolliffe’s (1986)
criterion of retaining eigenvalues over .7 were used. Varimax
rotation was chosen to make the output of the factor analysis more
comprehensible.
The interview data were transcribed and then subjected to a quali-
tative analysis via a hermeneutic process of reading, analysing and
re-reading. The coding themes that emerged during this process were
then grouped into categories, including positive and negative com-
ments about each of the two scales.
IV Results
Research question 1: Do the individual trait scales on the two
rating scales differ in terms of (a) the discrimination between
candidates, (b) rater spread and agreement, (c) variability in the
ratings and (d) what the different traits measure?
Ute Knoch 285
1 Comparison of individual trait scales

The first step was to compare individual trait scales wherever
possible (following Table 1). In the interest of space, only a sum-
mary of the findings will be presented with regard to the compari-
son of individual scales. For the full results please refer to Knoch
(2007a).
The findings for the comparison of the individual trait scales
generally showed that the trait scales on the new scales resulted in
a higher candidate discrimination, smaller differences between rat-
ers in terms of leniency and harshness, greater rater reliability, and
fewer raters rating with too much or too little variation.
2 Comparison of whole scales

After the individual trait scales were analysed and compared, it was
further of interest how the two scales as a whole performed. Figures
1 and 2 present the Wright maps of the two rating scales.
The left-hand column of each Wright map displays the logit values
ranging from positive values to negative values. The second column
shows the candidates in the sample. Higher ability candidates are
plotted higher on the logit scale, whilst lower ability candidates can
be found lower on the logit scale. The next column in the Wright
map represents the raters. Raters plotted higher on this map are more
severe than those plotted lower on the map. Next, the Wright map
shows the traits in each rating scale. More difficult traits are plotted
higher on the map than easier traits. Finally, the narrow columns on
the right of each Wright map represent the trait scales (with band
levels) in the order they were entered into FACETS. For the exist-
ing scale, these are from left to right: organization (S1), cohesion
(S2), style (S3), data description (S4), data interpretation (S5), part
three of prompt (S6), sentence structure (S7), grammatical accuracy
(S8) and vocabulary/spelling (S9). For the new scale, these are from
left to right: accuracy (S1), repair fluency (S2), lexical complexity
(S3), paragraphing (S4), hedging (S5), data description (S6), data
interpretation (S7), part three of prompt (S8), coherence (S9) and
cohesion (S10).
When the two Wright maps were compared, the following obser-
vations could be made. First of all, when the raters used the existing
scale, the candidates were more spread out, ranging over five logits.
When the raters employed the new scale, the candidates were only
spread over three logits. Therefore, although most individual trait
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|Measr | +Candidate | -Rater | -Item | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 | S.7 | S.8 | S.9 |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+ 4 + + + + (9) + (9) + (9) + (9) + (9) + (9) + (9) + (9) + (9) +
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | --- | --- | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | --- | | | --- | | |
| | * | | | | | | | | | | | |
+ 3 + + + + + + + + --- + + + + +
| | | | | 8 | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | --- | | | |
| | | | | | 8 | 8 | | | | | | |
| | | | | | | | | | | 8 | | |
| | * | | | | | | 8 | | | | | |
+ 2 + *** + + + --- + + + + 8 + + + 8 + 8 +
| | *** | | | | | | | | | | | |
| | * | | | | | | | | | | | |
| | **** | | | | --- | --- | | | | | | |
| | ** | | | | | | | | 8 | --- | | |
| | *** | | | | | | --- | | | | | |
| | *** | | | | | | | --- | | | --- | --- |
| | *** | | | 7 | | | | | | | | |
+ 1 + ** + + + + + + + + + + + +
| | ***** | | | | 7 | 7 | | | --- | | | |
| | *** | | Part three | | | | | 7 | | 7 | | |
| | ***** | 2 | | | | | 7 | | | | 7 | |
| | **** | | | | | | | | | | | 7 |
| | **** | | Data | | | | | | 7 | | | |
| | **** | 10 3 | | | | | | | | | | |
| | ***** | 6 | | --- | | --- | | --- | | | | |
* 0 * ******* * 4 7 * Grammatical accuracy Interpretation * * --- * * --- * * * --- * --- * *
| | *** | 1 5 | Sentence Structure Vocabulary & spelling | | | | | | --- | | | --- |
| | ***** | 8 | Conesion | | | | | | | | | |
| | **** | | Organisation Style | | | | | | | | | |
| | *** | | | | | | | 6 | | | | |
| | ***** | | | | | | 6 | | 6 | | 6 | |
| | ** | 9 | | | 6 | 6 | | | | 6 | | 6 |
| | ** | | | 6 | | | | | | | | |
+ -1 + * + + + + + + + + --- + + + +
| | ** | | | | | | | | | | | |
| | *** | | | | | | | --- | | | --- | |
| | * | | | | | | --- | | | --- | | --- |
| | ** | | | | --- | | | | 5 | | | |
| | * | | | | | --- | | | | | | |
| | | | | | | | | | | | | |
| | * | | | | | | | | | | | |
+ -2 + * + + + (4) + (4) + (4) + (4) + (4) + (4) + (4) + (4) + (4) +
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|Measr | *=1 | -Rater | -Item | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 | S.7 | S.8 | S.9 |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Figure 1 Wright map of DELNA scale

------ ---------- --------------------- ------------------------- --- -- ---- ------- -- ---- --- ---- -- ---- --- -- -----
|Measr | +Candidate | -Rater | -Item | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 | S.7 | S.8 | S.9 | S.10 |
------ ---------- --------------------- ------------------------- --- -- ---- ------- -- ---- --- ---- -- ---- --- -- -----
+ 2 + + + + (9) + (9) + (8) + (9) + (9) + (9) + (8) + (8) + (8) + (8) +
| | | | | | | | | --- | | | | --- | |
| | | | | | | | | | | | | | |
| | | | | | | | | | | | | | |
| | | | | | | | | | | | | | |
| | | | | | | | | | 8 | | | | |
| | | | | | 8 | --- | 8 | | | | --- | | |
| | | | | | | | | | | 7 | | | |
| | * | | | | | | | 8 | | | | | |
| | | | | 8 | | | | | | | | 7 | |
+ 1 + + + + + + + + + --- + + + + +
| | * | | | | | | | | | | | | 7 |
| | | | Repair fluency | | | | | --- | | --- | 7 | | |
| | * | | | | | | | | | | | | |
| | *** | | | | | 7 | | | 7 | | | | |
| | | | | --- | --- | | --- | | | | | --- | |
| | *** | | Accuracy Hedges | | | | | 7 | | | | | |
| | *** | | Lexical Complexity Part three | | | | | | | | --- | | |
| | ** | 6 3 7 | | | | | | | --- | 6 | | | |
| | **** | 2 3 7 | | 7 | | | | --- | | | | | |
* 0 * ** * 10 5 * Paragraphing * * 7 * --- * * * * * 6 * 6 * *
| | ****** | 1 8 | | | | | | | | | | | --- |
| | * | 4 9 | | --- | | | 7 | | 6 | | | | |
| | ***** | | | | | | | 6 | | | | | |
| | ** | | Data | | --- | | | | | | --- | | |
| | *** | | Coherence Interpretation | 6 | | | | | | --- | | --- | |
| | ** | | | | | 6 | | | | | | | |
| | | | Cohesion | | 6 | | | --- | --- | | | | |
| | * | | | --- | | | --- | | | | 5 | | |
| | | | | | | | | | | | | | |
+ -1 + + + + + --- + + + + + + + + 5 +
| | * | | | 5 | | | | | | | | 5 | |
| | | | | | | | | 5 | | | | | |
| | | | | | | | | | | 5 | | | |
| | | | | | 5 | --- | 6 | | | | --- | | |
| | * | | | | | | | | 5 | | | | |
| | | | | --- | | | | | | | | | |
| | | | | | | | | | | | | | |
| | | | | | | | | | | | | | |
| | | | | | | | | | | | | | |
+ -2 + + + + (4) + (4) + (5) + (5) + (4) + (4) + (4) + (4) + (4) + (4) +
Ute Knoch
------ ---------- --------------------- ------------------------- --- -- ---- ------- -- ---- --- ---- -- ---- --- -- -----
| Measr | *=2 | -Rater | -Item | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 | S.7 | S.8 | S.9 | S.10 |
------ ---------- --------------------- ------------------------- --- -- ---- ------- -- ---- --- ---- -- ---- --- -- -----
287
Figure 2 Wright map of new rating

scales on the new scale were more discriminating, it seems that as a

whole, the existing scale was more discriminating. This is also con-
firmed by the first of the rating scale statistics for the whole scale,
the candidate separation ratio, displayed in Table 2 below.
It also became apparent that the raters were a lot less spread out
when using the new scale. Their severity measures (in logits) ranged
from .25 (for the harshest rater) to −.21 (for the most lenient rater),
a range of less than half a logit. When employing the existing scale,
the raters were spread from .64 to −.74 logits, a range of nearly one
and a half logits. That the raters rated more similarly in terms of
severity could also be seen by the inter-rater reliability statistics in
Table 2, which showed that the exact agreement was higher when
the new scale was used (51.2%) than when the existing scale was
applied (37.9%). The rater point biserial correlation coefficient,
however, was lower when the new scale was used.
Next, the number of raters displaying too much or too little vari-
ability in their ratings was scrutinized. For the existing scale, half
the raters fell into one of these categories whilst no raters did for the
new scale.
When the different traits were examined on the two Wright maps,
it became clear that the traits on the new scale were slightly more
spread out in terms of difficulty, ranging from .78 on the logit scale
for repair fluency to −.74 for cohesion, a difference of one and a
half logits. On the DELNA scale the traits spread from .78 (for Data
– part three) to −.37 (for style), a difference of just over one logit.
Table 2 Rating scale statistics for entire existing and new rating scales
DELNA scale New scale
Candidate discrimination: Candidate discrimination:

Candidate separation ratio: 8.15 Candidate separation ratio: 5.34
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 8.67 Rater separation ratio: 4.19
Rater point biserial: 0.47 Rater point biserial: 0.38
Exact agreement: 37.9% Exact agreement: 51.2%
Variation in ratings: Variation in ratings:
% Raters infit high: 40% % Raters infit high: 0%
% Rater infit low: 10% % Rater infit low: 0%
Trait statistics: Trait statistics:
Spread of trait measures: 0.78 to −0.37 Spread of trait measures: 0.53 to −0.76
Trait separation: 9.14 Trait separation: 12.47
Trait fit values: data and part Trait fit values: repair fluency and
three much over 1.3, no low data slightly high, lexis and
coherence low
Ute Knoch 289
In a criterion-referenced situation as was the case when these rating

scales were used, it is not necessarily a problem to have a bunching
up of traits around the zero logit point, as is found in Figure 1 with
the traits on the existing rating scale. However, it indicates that raters
had difficulty distinguishing between the different traits or that the
traits were related or dependent on each other (Carol Myford, per-
sonal communication). The fact that the traits in Figure 2 (new scale)
were more spread out shows that the different traits were measuring
different aspects.
If the traits were not measuring the same underlying construct,
then this explains why both the candidate separation of the new scale
and the rater point biserial of the new scale were lower than that of
the existing scale.
Because the results above only indicate that the traits were meas-
uring different underlying abilities, but not how many different
groups of traits the data was measuring, a principal axis factor analy-
sis (or principal factor analysis – PFA) was performed on the rating
data. PFA reduces the data in hand into a number of components,
each with an eigenvalue representing the amount of variance of the
components. Components with low eigenvalues are discarded from
the analysis, as they are not seen to be contributing enough to the
overall variance. Table 3 (DELNA scale) and Table 4 (new scale)
below show the results from the principal factor analysis.
Both the scree plots and the tables displaying the results from the
PFA show that when the existing rating scale was analyzed, only one
major component was found. This component had an eigenvalue of
5.8 and accounted for about 64% of the entire variance. All other
eigenvalues were clearly below 1 (following Kaiser, 1960) and
below .7 (following Jolliffe, 1986) and there was no further level-
ing off point on the scree plot. When the new scale was analyzed,
Table 3 Principal factor analysis: Existing DELNA scale
Component Eigenvalue % of variance Cumulative %
1 5.803 64.472 64.472

2 .694 7.983 72.455
3 .657 7.691 80.146
4 .549 6.341 86.487
5 .415 4.756 91.243
6 .275 3.076 94.319
7 .209 2.326 96.645
8 .168 1.827 98.472
9 .138 1.528 100.000
Scree Plot Scree Plot
6 3.5
5 3.0
2.5
Eigenvalue
Eigenvalue
2.0
3
1.5
2 1.0
1 0.5
0 0.0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10
Factor Number Factor Number
Figure 3 Scree plots of principal factor analysis
Table 4 Principal factor analysis: New scale
Component Eigenvalue % of variance Cumulative %
1 3.434 34.34 34.34

2 1.276 12.76 47.10
3 1.154 11.54 58.64
4 .863 8.63 67.28
5 .817 8.17 75.44
6 .763 7.63 83.07
7 .577 5.77 88.85
8 .491 4.91 93.75
9 .389 3.89 97.64
10 .236 2.36 100.000
however, the results were different. The PFA resulted in six compo-
nents with eigenvalues over 0.7.
The next step in the PFA was to identify which variables load
onto which component. For this, a rotation of the data was necessary.
However, because only one component was identified for the exist-
ing scale, no factor loadings can be displayed. A varimax rotation
was chosen to facilitate the interpretation of the factors of the new
scale. A trait was considered to be loading on a factor if the loading
was higher than .4 (as indicated in bold font). The six factor loadings
for the new scale can be seen in Table 5.
The largest factor, accounting for 34% of the variance, was made
up of accuracy, lexical complexity, coherence and cohesion. This
factor can be described as a general writing ability factor. The second
factor, which accounted for a further 13% of the variance, was made
up of hedging and interpretation of data. This is, at first glance, an
Ute Knoch 291
Table 5 Loadings for principal factor analysis
Component
1 2 3 4 5 6
Accuracy 0.796 0.119 −0.045 0.037 0.168 −0.035

Repair fluency 0.155 −0.005 0.067 −0.062 0.959 −0.064
Lexical complexity 0.731 0.112 0.288 0.116 0.106 0.004
Paragraphing 0.009 −0.037 0.072 0.029 −0.061 0.992
Hedging 0.174 0.945 0.056 −0.009 −0.030 −0.041
Data description 0.141 0.017 0.025 0.971 −0.062 0.030
Interpretation of data 0.338 0.448 −0.340 0.253 0.241 0.009
Content – Part 3 0.269 0.105 0.850 0.090 0.139 0.092
Coherence 0.867 0.097 0.009 0.083 0.016 0.030
Cohesion 0.875 0.089 0.064 0.064 00.039 0.021
unusual factor. However, it can be argued that writers need to make

use of hedging in the section where the data is interpreted since the
writer is speculating rather than stating facts. For this reason, a writer
who scored high on hedging might also have put forward more ideas
in this section of the essay. The third factor, which accounted for
12% of the variance, consisted of Part Three of the content, the sec-
tion in which writers are required to extend their ideas. The fourth
factor, which accounted for 9% of the variance, was another content
factor, the description of data. That all three parts of the content load
on separate factors shows that they were all measuring different
aspects of content. Repair fluency was the only measure that loaded
on the fifth factor, which accounted for another 8% of the variance.
The last factor, which also accounted for 8% of the variance, only
had paragraphing loading on it. The six factors together accounted
for 83% of the entire variance of the score, whilst the single factor
found in the analysis of the existing rating scale only accounted for
64% of the data.
It can therefore be argued that the ratings based on the new scale
not only accounted for more aspects of writing ability, but it also
accounted for a larger amount of variation of the scores. In other
words, there was less unaccounted variance when the new scale was
used.
Research question 2: What are raters’ perceptions of the two
different rating scales for writing?
The most commonly emerging themes in the interviews were
grouped into the following sections:
• themes emerging about DELNA scale

• themes emerging about new scale
1 Themes emerging about DELNA scale
The most regularly emerging theme in the interviews was that raters
often experienced problems when using the DELNA scale. One of
the most commonly mentioned problems was that the raters thought
the descriptors were often too vague to arrive easily at a score. In the
extract below, for example, Rater 4 talked about the problems she
encountered when deciding on a score for Content:
Rater 4: [...] And here relevant and supported, I find that tricky support,
what exactly is support. Because sometimes it is actually, sometimes you
have a number of ideas but there is not much support for them and what
is sufficient. […] You just can’t, there is nothing specific there to hang
things on.
Problems with the vagueness of the DELNA descriptors were also
reflected in the comments by Rater 5 below:
Rater 5: [...] Sometimes I look at it [the descriptors] I’m going ‘what do you
mean by that?’ [...] You just kind of have to find a way around it cause it’s
not really descriptive enough, yeah.
A number of raters pointed directly to the adjectives used as being
the problem. In the example below, Rater 10 talked about the
descriptors for vocabulary and spelling:
Rater 10: Well there’s always a bit of a problem with vocabulary and spelling
anyway in deciding you know the difference between extensive, appropriate,
adequate, limited and inadequate. So there’s sort of adverbial [sic]. Yeah,
it’s really just a sort of adverbial thingy anyway isn’t it so I think I just go
with gut instinct on that one.
Although most raters reported having problems deciding on band
levels with the DELNA scale, the methods of coping were quite dif-
ferent for different raters. A variety of strategies (both conscious and
subconscious) emerged from the interviews. These were as follows:
• assigning a global score
• rating with a halo effect
• disregarding the DELNA descriptors.
The first strategy that a number of raters referred to in their inter-
views was assigning a global score to a script, usually after the first
reading.
Rater 5 below describes his rating process, which is more holistic
than analytic:
Ute Knoch 293
Rater 5: Mmh, yeah, I always automatically think, this is a native speaker,

this is a non-native speaker. How well will this come across, will it be suf-
ficient for academic writing and then that is sort of borderline between six
and seven quite often and then is it a better seven or is it an eight or is it less
than a six, or is it five.
This overall, holistic type rating often results in a halo effect, where
a rater awards the same score for a number of categories on the
scale. Below, Rater 10 talked about awarding scores for the three
categories grouped under fluency in the DELNA scale, organization,
cohesion and style:
Rater 10: For style, again, I just tend to go with the gut instinct. And I suspect
I often tend to give the same grade or similar grades for cohesion and style.
Probably for the whole of fluency. [...] So in a way, it is almost like giving
a global mark for the three things in consideration. With, if someone had no
paragraphing, but everything else was good, maybe a bit of variation.
Some raters seemed to clearly disregard the DELNA descriptors and
override them with their own general impression of a script. Rater 10
(below) was talking about the score she would award for organiza-
tion to a script that had no clear paragraph breaks but was otherwise
well organized. The DELNA descriptors recommend awarding
either a five or a six.
Rater 10: Mmh [...] well, I think I would, (sighs), looking at this it ought
to be a six, but it is possible particularly if I suspected that it was a native
speaker, and that it was someone that wasn’t so strong in academic writing
but actually had very good English, I might even go up to a seven, but I [...]
yeah, if I had other reservations about the language and stuff, then I would
give it a six or even a five if it is really bad. But if I was sort of convinced
by the writer in every other way, I might well push the score up in a way not
to pull them down. Just for the paragraphing.
2 Themes emerging about new scale

The most commonly emerging theme about the new scale was that
the raters liked the fact that the descriptors in the new scale were
more explicit. This is evident in the following extracts from the
interviews:
Researcher: Do you feel you used the whole range there [accuracy in the
new scale]?
Rater 10: Yes, yes, I did. Yeah. I think I would be more likely to. Because
I thought I had something to actually back it up with, it had a clearer guideline
for what I was actually doing, so I was more confident for giving nines and
fours. And I think also because I didn’t, I let go of the sense of this is a seven,
so I have to make it come out as a seven and I’d say, well, sorry, if they have
no error-free sentences they get a four and I don’t care if it is something that
might otherwise get a six or a seven and yes, if they can write completely
error-free then I can give them a nine. I have no problems with that.
The idea of being able to arrive precisely at a score was also echoed
in the following comment by Rater 7:
Rater 7: It is interesting, I found that it [the new scale] is quite different to
the DELNA one and it is quite amazing to be able to count things and say,
I know exactly which score to use now.
Whilst the comments about the new scale reported above shed a
positive light on the scale, a less positive comment was also made
by the raters.
Three raters criticized the fact that some information was lost
because the descriptors in the new scale were too specific. Rater 5,
for example, argued that a simple count of hedging devices could not
capture variety and appropriateness:
Researcher: You said that, other than hedging, style wasn’t really
considered.
Rater 5: Yeah, it does seem a bit limited. […] I suppose that is similar [to
the DELNA scale] it sort of relies on the marker’s knowledge of English in
a more kind of global way sort of. But maybe that is the inter-rater reliability
issue coming up.
Above, the results for research questions 1 and 2 were presented.
The following section aims to discuss these results in light of the
overarching research question:
To what extent is an empirically developed rating scale of
academic writing with level descriptors based on discourse
analytic measures more valid for diagnostic writing assessment
than an existing rating scale?
V Discussion
DELNA is a diagnostic assessment system. To establish construct
validity for a rating scale used for diagnostic assessment, we need
to turn to the limited literature on diagnostic assessment. Alderson
(2005) compiled a list of features which distinguish diagnostic tests
from other types of tests. Four of Alderson’s 18 statements are cen-
tral to rating scales and rating scale development. These are shown
in Table 6.
This section will discuss each of Alderson’s four statement in turn
and then focus on the raters’ perceptions of the two scales.
Ute Knoch 295
Table 6 Extract from Alderson’s (2005) features of diagnostic tests
1. Diagnostic tests are designed to identify strengths and weaknesses in a

learner’s knowledge and use of language.
2. Diagnostic tests should enable a detailed analysis and report of responses to
items or tasks.
3. Diagnostic tests thus give detailed feedback which can be acted upon.
4. Diagnostic tests are more likely to be ... focussed on specific elements than on
global abilities.
Statement 1. Diagnostic tests are designed to identify strengths

and weaknesses in a learner’s knowledge and use of
language.
Alderson’s first statement calls for diagnostic assessments to
identify strengths and weaknesses in a learner’s knowledge and use
of language. Both rating scales compared in this study were analytic
scales and were designed to identify strengths and weaknesses in
the learners’ writing ability. However, the PFA showed that the new
scale distinguished six different writing factors, whilst the current
DELNA scale resulted in one large factor. Therefore, it could be
argued that the new scale was more successful in identifying differ-
ent strengths and weaknesses.
The main reason that the ratings based on the DELNA scale
resulted in only one factor was the halo effect displayed by most
raters. Although developed as an analytic scale, the existing
scale seemed to lend itself to a more holistic approach to rating.
It is possible that, as hypothesized in this study, the rating scale
descriptors do not offer raters sufficient information on which
to base their decisions and so raters resort to a global impres-
sion when awarding scores. This then would explain why, when
using the empirically developed new scale with its more detailed
descriptors, the raters were able to discern distinct aspects of a
candidate’s writing ability.
Some studies have in fact found that raters display halo effects
only when encountering problems in the rating process (e.g. Lumley,
2002; Vaughan, 1991). Lumley, for example, found that when rat-
ers could not identify certain features in the descriptors, they would
resort to more global, impressionistic type rating. This study sug-
gests that the halo effect and impressionistic type marking might be
more widespread than has so far been reported. It was possible to
show that simply providing raters with more explicit scoring criteria
can significantly reduce this effect. It could therefore be argued that
the halo effect is not necessarily only a rater effect, but also a rating
scale effect.
However, what was not established in this study was whether the
raters rated analytically because they were unfamiliar with the new
scale. It is possible that extended use of the new scale might also
result in more holistic rating behavior. Further longitudinal research
is needed to determine whether this is indeed the case.
Statement 2. Diagnostic tests should enable a detailed analysis and
report of responses to items or tasks and
Statement 3. Diagnostic tests thus give detailed feedback which can
be acted upon.
Alderson’s (2005) second and third statements assert that diag-
nostic assessments should enable a detailed analysis and report of
responses to tasks and that this feedback should be in a form that can
be acted upon. Both rating scales lend themselves to a detailed report
of a candidate’s performance. However, as evident in the quantita-
tive analysis, if the raters at times resort to a holistic impression to
guide their marking when using the DELNA scale, this will reduce
the amount of detail that can be provided to students. If most scores
are, for example, centred around the middle of the scale range, then
this information is likely to be less useful to students than if they
are presented with a more jagged profile of some higher and some
lower scores.
A score report card based on the new scale could be designed to
make clearer suggestions to students. For example, the section on
academic style could suggest the use of more hedging devices or
students could be told how they could improve the coherence of their
essays rather than just being told that their writing ‘lacks academic
style’ or is ‘incoherent’. More detailed suggestions on what score
report cards could look like are beyond the scope of this paper, but
can be found in Knoch (2007a).
Statement 4. Diagnostic tests are more likely to be [...] focussed on
specific elements than on global abilities.
Alderson’s fourth statement states that diagnostic tests are more
likely to be focussed on specific elements rather than on global
abilities. If a diagnostic test of writing should focus on specific
elements, then this needs to be reflected in the rating scale.
Therefore, the descriptors need to lend themselves to isolating
more detailed aspects of a writing performance. The descriptors
Ute Knoch 297
of the new scale were more focussed on specific elements of writ-

ing because they were based on discourse analytic measures which
had, in an earlier phase of this research, been found to discriminate
between texts at different levels of proficiency. To arrive at a truly
diagnostic assessment of writing, all categories in an analytic scale
need to be reported back to stakeholders individually, otherwise the
diagnostic power of the assessment is lost.
1 Raters’ perceptions of the two scales

Finally, it was also important to establish the raters’ perceptions
of the efficacy of the two scales for diagnostic assessment. Raters
perceptions of the scale usefulness are important as they provide
one perspective on the construct validity of the scale. As language
experts they are well qualified to judge whether the writing construct
is adequately represented by the scale.
In the course of the interviews it became apparent that most
raters treated DELNA as a proficiency or placement test rather
than a diagnostic assessment. For example, Rater 10 commented:
I think I prefer the existing DELNA scale because I like to mark
on ‘gut instinct’ and to feel that a script is ‘probably a six’ etc. It
was a little disconcerting with the ‘new’ scale to feel that scores
were varying widely for different categories for the same script.
Similarly, Rater 5 mentioned in his interview: I notice these things
[features of form] as I am reading through, but I try not to focus
too much on them. I try to go for broad ideas and are they answer-
ing the question. Are they communicating to me what they need to
communicate first of all. And how well do they do that.’ It seems
therefore that the diagnostic purpose of the assessment was not
clear to them and/or that their experience of rating different kinds
of tests, such as IELTS, was influencing their behaviour. The
findings of this study suggest that raters need to be made aware
of the purpose of the assessment in their training sessions, so that
they recognize the importance of rating each aspect of writing
separately. This might result in raters displaying less of the halo or
central tendency effects.
Returning to the overall purpose of this study, the following
conclusions can be drawn. Bachman and Palmer (1996) suggest that
‘the most important consideration in designing and developing a
language test is the use for which it was intended’ (p. 17). We need,
therefore, to remember that the purpose of this test is to provide
detailed diagnostic information to the stakeholders on test takers’

writing ability. Most evidence speaks in favour of the new scale.
One aspect not reported on previously has to do with practicality.
Two aspects of practicality have to be taken into consideration:
(1) practicality of scale development and (2) practicality of scale
use. It is clear that the scale development process for an empirically
developed scale is more laborious. In terms of practicality of use, the
new scale proved only minimally more time-consuming. Some raters
even reported being able to use the new scale faster. Overall, it can
be argued, however, that the new scale is less practical.
But, as Bachman and Palmer (1996) and Weigle (2002) argue, it
is impossible to maximise all aspects of test usefulness. The task of
the test developer is to determine an appropriate balance among the
qualities in a specific situation. Since each context is different, the
importance of each quality of test usefulness varies from situation
to situation. Test developers should therefore strive to maximize
overall usefulness given the constraints of a particular context, rather
than try to maximize all qualities. In the context of DELNA, a diag-
nostic assessment, it could be argued that construct validity is central
(as is the case in most assessment situations). Practicality is always
a crucial consideration, but wherever possible, construct validity
should not be sacrificed simply to ensure practicality.
Overall, the new scale has been shown to generally function more
validly and reliably in the diagnostic context that it was trialled in
than the pre-existing scale.
VI Conclusion
The findings of this study have a number of implications. The first
refers to the classification of rating scale types commonly found
in the literature. Weigle (2002), as well as many other authors
distinguishes between holistic and analytic rating scales However,
this study seems to suggest that although these two types of scales
are distinct, it is also necessary to distinguish two types of analytic
scales: less detailed, a priori developed scales and more detailed,
empirically developed scales. Therefore, Weigle’s summary table
can be expanded in the following manner (Table 7).
Researchers and practitioners need to be made aware of the differ-
ences between analytic scales and need to be careful when making
decisions about the type of scale to use or the development method
to adopt.
Ute Knoch 299
Table 7 Extension of Weigle’s (2002) table to include empirically developed
analytic scales
Quality Holistic scale Analytic scale Analytic scale

– intuitively developed – empirically developed
Reliability Lower than analytic Higher than holistic. Higher than intuitively
but still acceptable. developed analytic
scales.
Construct Holistic scale Analytic scales Higher construct
Validity assumes that all more appropriate validity as based
relevant aspects as different aspects on real student
of writing develop of writing ability performance; assumes
at the same rate develop at different that different aspects of
and can thus be rates. But raters writing ability develop
captured in a single might rate with halo at different speeds.
score; holistic effect.
scores correlate with
superficial aspects
such as length and
handwriting.
Practicality Relatively fast and Time-consuming; Time-consuming; most
easy. expensive. expensive.
Impact Single score may More scales Provides even more
mask an uneven can provide diagnostic information
writing profile, may useful diagnostic than intuitively
be misleading for information developed analytic
placement and may for placement, scale; especially useful
not provide enough instruction and for rater training.
relevant information diagnosis, but might
for diagnostic be used holistically
purposes. by raters; useful for
rater training.
Authenticity White (1985) Raters may read Raters assess each
argues that reading holistically and adjust aspect individually.
holistically is a more analytical scores
natural process than to match holistic
reading analytically. impression.
Another implication relates to score reporting. Table 8 presents

different purposes for which writing tests are administered. The table
provides a short description of the purpose of each test, what type
of rating scale might be used and how the score should be reported.
For example, whilst for a proficiency test it might be less important
if the rating scale in use is holistic or analytic (as long as it results
in reliable ratings), the rating scale used in diagnostic assessment
would need to be analytic and at the same time should provide a dif-
ferentiated score profile. The need for these different types of scales
is a consequence of the way the scores are reported. Results of tests
Table 8 Rating scales and score reporting for different types of writing
assessment
Purpose of Definition Rating scale Score reporting

writing test
Proficiency test Designed to test Holistic or One averaged score.

general writing analytic.
ability of students.
Diagnostic test Designed to identify Needs to be In detail; separate
strengths and analytic and for each trait
weaknesses in needs to result rating scale.
writing ability; in differentiated
designed to provide scores across
detailed feedback traits.
which students can
act upon; designed
to focus on specific
rather than global
abilities.
of writing proficiency are usually only reported as one averaged

score but, as Alderson (2005) suggests, the score profiles of diagnos-
tic tests should be as detailed as possible and therefore any averaging
of scores is not desirable.
Overall, this study has been able to show that a rating scale
with descriptors based on discourse-analytic measures is more
valid and useful for diagnostic writing assessment purposes. The
uniqueness of diagnostic assessment has recently again been high-
lighted by Alderson’s (2005) book Diagnosing foreign language
proficiency and therefore this study is an important contribution
to the writing assessment literature. The author has attempted to
show that not all analytic rating scales can be assumed to be func-
tioning diagnostically and that scale developers and users need
to be careful when selecting or developing a scale for diagnostic
assessment.
VII References
Alderson, C. (2005). Diagnosing foreign language proficiency. The interface
between learning and assessment. London: Continuum.
Bachman, L. F. (1990). Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford:
Oxford University Press.
Ute Knoch 301
Brindley, G. (1998). Describing language development? Rating scales and

SLA. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between sec-
ond language acquisition and language testing research (pp. 112–140).
Cambridge: Cambridge University Press.
Creswell, J., & Plano Clark, V. L. (2007). Designing and conducting mixed
methods research. Thousand Oaks, CA: Sage.
Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and
TOEFL 2000 prototype writing tasks: An investigation into raters’ decision
making and development of a preliminary analytic framework. TOEFL
Monograph Series 22. Princeton, NJ: Educational Testing Service.
Davies, A., & Elder, C. (2005). Validity and validation in language testing. In
E. Hinkel (Ed.), Handbook of research in second language teaching and
learning. Mahwah, NJ: Lawrence Erlbaum.
Elder, C. (2003). The DELNA initiative at the University of Auckland.
TESOLANZ Newsletter, 12(1), 15–16.
Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating
rater responses to an online rater training program. Language Testing,
24(1), 37–64.
Elder, C., & Von Randow, J. (2008). Exploring the utility of a web-based
English language screening tool. Language Assessment Quarterly, 5(3),
173–194.
Fulcher, G. (1996). Does thick description lead to smart tests? A data-
based approach to rating scale construction. Language Testing, 13(2),
208–238.
Fulcher, G. (2003). Testing second language speaking. London: Longman/
Pearson Education.
Grabe, W., & Kaplan, R. B. (1996). Theory and practice of writing. New York:
Longman.
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-
Lyons (Ed.), Assessing second language writing in academic contexts.
Norwood, NJ: Ablex.
Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-
Verlag.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis.
Educational and Psychological Measurement, 20, 141–151.
Knoch, U. (2007a). Diagnostic writing assessment: The development and vali-
dation of a rating scale. Unpublished PhD, University of Auckland.
Knoch, U. (2007b). Do empirically developed rating scales function differ-
ently to conventional rating scales for academic writing? Spaan Fellow
Working Papers in Second or Foreign Language Assessment, 5, 1–36.
Knoch, U. (2007c). ‘Little coherence, considerable strain for reader’:
A comparison between two rating scales for the assessment of coher-
ence. Assessing Writing, 12(2), 108–128.
Knoch, U. (2008). The assessment of academic style in EAP writing: The case of
the rating scale. Melbourne Papers in Language Testing, 13(1), 34–67.
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters
online: How does it compare with face-to-face training? Assessing
Writing, 12, 26–43.
Linacre, J. M. (2006). Facets Rasch measurement computer program. Chicago,
IL: Winsteps.
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do
they really mean to the raters? Language Testing, 19(3), 246–276.
McNamara, T. (1996). Measuring second language performance. Harlow,
Essex: Pearson Education.
McNamara, T. (2002). Discourse and assessment. Annual Review of Applied
Linguistics, 22, 221–242.
Mickan, P. (2003). ‘What’s your score?’ An investigation into language
descriptors for rating written performance. Canberra: IELTS Australia.
Myford, C. M., & Wolfe, E. W. (2000). Monitoring sources of variability
within the Test of Spoken English assessment system. Princeton, NJ:
Educational Testing Service.
North, B., & Schneider, G. (1998). Scaling descriptors for language profi-
ciency scales. Language Testing, 15(2), 217–263.
Turner, C. E. (2000). Listening to the voices of rating scale developers:
Identifying salient features for second language performance assessment.
The Canadian Modern Language Review, 56(4), 555–584.
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student
samples: Effects of the scale maker and the student sample on scale con-
tent and student scores. TESOL Quarterly, 36(1), 49–70.
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second
language tests. ELT Journal, 49(1), 3–12.
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of
second-language speaking ability: Test method and learner discourse.
Language Testing, 16(1), 82–111.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In
L. Hamp-Lyons (Ed.), Assessing second language writing in academic
contexts. Norwood, NJ: Ablex.
Watson Todd, R., Thienpermpool, P., & Keyuravong, S. (2004). Measuring
the coherence of writing using topic-based analysis. Assessing Writing,
9, 85–104.
Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University
Press.
White, E. M. (1985). Teaching and assessing writing. San Francisco: Jossey-
Bass.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago:
MESA Press.
Appendix 1: Abridged DELNA scale
9 8 7 6 5 4
FLUENCY
Organization Essay fluent – well organised Evidence of organization Little organization – possibly no
– logical paragraphing – paragraphing may not be paragraphing
entirely logical
Cohesion Appropriate use of cohesive Lack / inappropriate use of Cohesive devices absent /
devices – message able to be cohesive devices causes some inadequate / inappropriate
followed throughout strain for reader – considerable strain for reader
Style Generally academic – may be Some understanding of Style not appropriate to task
slight awkwardness academic style
CONTENT
Description of Data described accurately Data described adequately / Data (partially) described / may
data may be overemphasis on figures be inaccuracies / very brief /
inappropriate
Interpretation Interpretation sufficient / Interpretation may be brief / Interpretation often inaccurate /
of data appropriate inappropriate very brief / inappropriate
Development Ideas sufficient and Ideas may not be expressed Few appropriate ideas expressed
/ extension of supported. Some may lack clearly or supported appropriately – inadequate supporting evidence
ideas obvious relevance – essay may be short – essay may be short
FORM
Sentence Controlled and varied Adequate range – errors in complex Limited control of sentence
Structure sentence structure sentences may be frequent structure
Grammatical No significant errors in syntax Errors intrusive / may cause Frequent errors in syntax cause
accuracy problems with expression of ideas significant strain
Vocabulary & Vocab. appropriate / may be Limited, possibly inaccurate / Range and use of vocabulary
Ute Knoch
spelling few minor spelling errors inappropriate vocab / spelling errors inadequate. Errors in word
formation & spelling cause strain
303
Appendix 2: Abridged new scale
9 8 7 6 5 4
Accuracy Accuracy Error-free Nearly error-free Nearly no or no error-free
sentences
Fluency Repair No self-corrections No more than 5 self- More than 20 self-corrections
Fluency corrections
Complexity Lexical Large number of words from academic wordlist Less than 5 words from AWL / uses only
Complexity (more than 20) / vocabulary extensive – makes very basic vocabulary
use of large number of sophisticated words
Mechanics Paragraphing 5 paragraphs 4 paragraphs 1 paragraph
Reader-Writer Hedges More than 9 hedging 7–8 hedging devices No hedging devices
Interaction devices
Content Data All data described (all Most data described (all Data description not attempted or
description trends and relevant trends, some figures) incomprehensible
figures) (most trends, most
figures)
Interpretation Five or more relevant reasons and/or supporting No reasons provided
of data ideas
Part 3 of task Four or more relevant ideas No ideas provided
Coherence Coherence Writer makes regular use of super structures, Frequent: Unrelated progression,
sequential progression and possibly indirect coherence breaks and some
progression. extended progression.
Few incidences of unrelated progression Infrequent: sequential progression
No coherence breaks and superstructure
Cohesion Cohesion Connectives used sparingly but skilfully (not Writer uses few connectives, there
mechanically) compared to text length, and often is little cohesion.
describe a relationship between ideas This/these not or very rarely used.
Writer might use this/these to refer to ideas more
than four times

Knoch 2009

Uploaded by

Copyright:

Available Formats

Knoch 2009

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Knoch 2009

Uploaded by

Copyright:

Available Formats

–

Language Testing 2009 26 (2) 275 304

Diagnostic assessment of writing:

Keywords: diagnostic writing assessment, second language writing,

© The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav

When discussing the diagnostic assessment of writing, Alderson

on empirical methods. Examples of such scales are those produced

II The current study

During the second phase of this study, the validation phase, 10

The diagnostic component comprises objectively scored reading

• paragraphing (number of logical paragraphs from five paragraph

b The writing samples: The one hundred writing scripts used in

DELNA scale Band levels New scale Band levels

Grammatical accuracy 6 Accuracy 6

scripts produced during the 2004 administration of the DELNA

c The training manual: To help the raters become familiar

d The interview questions: Because the interviews were semi-

b Data collection: The ratings based on the existing DELNA

c Data analysis: Three types of data analysis were undertaken:

rating scale from a particular rater for a particular student can be

0 to infinity. The closer the calculated value is to 1, the closer

1 Comparison of individual trait scales

2 Comparison of whole scales

Figure 1 Wright map of DELNA scale

Figure 2 Wright map of new rating

scales on the new scale were more discriminating, it seems that as a

DELNA scale New scale

Candidate discrimination: Candidate discrimination:

In a criterion-referenced situation as was the case when these rating

Table 3 Principal factor analysis: Existing DELNA scale

Component Eigenvalue % of variance Cumulative %

1 5.803 64.472 64.472

Scree Plot Scree Plot

Figure 3 Scree plots of principal factor analysis

Table 4 Principal factor analysis: New scale

Component Eigenvalue % of variance Cumulative %

1 3.434 34.34 34.34

Accuracy 0.796 0.119 −0.045 0.037 0.168 −0.035

unusual factor. However, it can be argued that writers need to make

• themes emerging about DELNA scale

Rater 5: Mmh, yeah, I always automatically think, this is a native speaker,

2 Themes emerging about new scale

1. Diagnostic tests are designed to identify strengths and weaknesses in a

Statement 1. Diagnostic tests are designed to identify strengths

of the new scale were more focussed on specific elements of writ-

1 Raters’ perceptions of the two scales

detailed diagnostic information to the stakeholders on test takers’

Quality Holistic scale Analytic scale Analytic scale

Another implication relates to score reporting. Table 8 presents

Purpose of Definition Rating scale Score reporting

Proficiency test Designed to test Holistic or One averaged score.

of writing proficiency are usually only reported as one averaged

Brindley, G. (1998). Describing language development? Rating scales and

You might also like