Learners of English 2016
Learners of English 2016
Learners of English 2016
Marianne Nikolov Editor
Assessing Young
Learners of
English: Global
and Local
Perspectives
Educational Linguistics
Volume 25
Series Editor
Francis M. Hult, Lund University, Sweden
Editorial Board
Marilda C. Cavalcanti, Universidade Estadual de Campinas, Brazil
Jasone Cenoz, University of the Basque Country, Spain
Angela Creese, University of Birmingham, United Kingdom
Ingrid Gogolin, Universität Hamburg, Germany
Christine Hélot, Université de Strasbourg, France
Hilary Janks, University of Witwatersrand, South Africa
Claire Kramsch, University of California, Berkeley, U.S.A.
Constant Leung, King’s College London, United Kingdom
Angel Lin, University of Hong Kong, Hong Kong
Alastair Pennycook, University of Technology, Sydney, Australia
Educational Linguistics is dedicated to innovative studies of language use and lan-
guage learning. The series is based on the idea that there is a need for studies that break
barriers. Accordingly, it provides a space for research that crosses traditional disciplin-
ary, theoretical, and/or methodological boundaries in ways that advance knowledge
about language (in) education. The series focuses on critical and contextualized work
that offers alternatives to current approaches as well as practical, substantive ways
forward. Contributions explore the dynamic and multi-layered nature of theory-
practice relationships, creative applications of linguistic and symbolic resources,
individual and societal considerations, and diverse social spaces related to language
learning.
The series publishes in-depth studies of educational innovation in contexts throughout
the world: issues of linguistic equity and diversity; educational language policy;
revalorization of indigenous languages; socially responsible (additional) language
teaching; language assessment; first- and additional language literacy; language
teacher education; language development and socialization in non-traditional set-
tings; the integration of language across academic subjects; language and technol-
ogy; and other relevant topics.
The Educational Linguistics series invites authors to contact the general editor with sug-
gestions and/or proposals for new monographs or edited volumes. For more informa-
tion, please contact the publishing editor: Jolanda Voogd, Asssociate Publishing
Editor, Springer, Van Godewijckstraat 30, 3300 AA Dordrecht, the Netherlands.
The editor and the authors of the book are grateful to two anonymous reviewers for
their insights and comments on the first drafts. Their recommendations helped us
tailor the text to our readers’ needs. We would also like to thank Dr. Francis M. Hult,
editor of Education Linguistics book series at Springer, for his helpful guidance and
Mrs. Jolanda Voogd, senior editor, and Helen van der Stelt, her senior assistant, for
their support on this project.
v
Contents
vii
viii Contents
ix
x Contributors
AO age of onset
AoA age of arrival for immigrants
ACTFL American Council on the Teaching of Foreign Language
AfL assessment for learning
CB computer based
CEFR Common European Framework of Reference for Languages
CLIL content and language integrated learning
CPH critical period hypothesis
CVIs content validity indices
EAL English as an additional language
EAP English for additional purposes
EFL English as a foreign language
ELL early language learning
ELP English language portfolio
ESL English as a second language
ETS educational testing services
EYL English to young learners
GSE Global Scale of English
IT information technology
KET Cambridge English: Key
KSAs knowledge, skills, and abilities
L2 second language
LoE length of exposure
LoR length of residence
PA peer assessment
PB paper based
PET Cambridge English: Preliminary
PPVT Peabody Picture Vocabulary Test
SA self-assessment
SAT standards-based assessment
xi
xii List of Acronyms
Marianne Nikolov
Abstract This introductory chapter aims to achieve multiple goals. The first part
outlines the most important recent trends in early language learning, teaching and
assessment and frames what the main issues are. The second part discusses the most
frequent challenges policy makers, materials designers, test developers, researchers
and teachers face. The third part introduces the chapters in the volume and explains
how they are embedded in the trends. The last part suggests ideas for further research
and points out some implications for educational and assessment practice.
1 Introduction
The aim of this chapter is to offer insights into recent trends, emerging issues and
challenges in the field of teaching and assessing young language learners and to
outline which aspects the chapters in this volume highlight in various educational
contexts. Recent developments are best viewed from a perspective of innovation
(Davison, 2013; Davison & Leung, 2009; Kennedy, 2013). This approach to early
language learning and assessment as a larger system (Markee, 2013) may allow us
to understand how innovation works at various levels and how the classroom, insti-
tutional, educational, administrative, political and cultural level subsystems inter-
act. A narrow focus on certain aspects of assessment practice is limited; innovation
and change are necessary in the whole of assessment culture (Davison, 2013). The
chapters in the book explore global issues and how they are embedded in local con-
texts. The findings may not directly translate into other situations, therefore, readers
are expected to critically reflect on them and analyze how the lessons learnt can be
relevant.
M. Nikolov (*)
Institute of English Studies, University of Pécs, Pécs, Hungary
e-mail: [email protected]
Some of the studies included in the book fall into the narrow field of language
testing and share information on frameworks and the time-consuming test design
and validation processes of test development. Other chapters go beyond these
domains and discuss results of large-scale national studies and smaller-scale class-
room projects. The common denominator in these explorations reflect stakeholders’
local needs. Alternative approaches to assessment, for example, peer and self-
assessment, diagnostic testing, assessment for learning, and ways in which young
learners’ individual differences interact with test results are also discussed in depth.
It is hoped that a wide readership will find food for thought in the book.
Specific uses of terms are clarified in the chapters and a list of acronyms is also
included at the beginning of the volume. The ages covered by the term young learn-
ers in the chapters range from 6 to 12 or so; children in the projects learn a foreign
language in the first 6 years of their studies. The use of key terms needs clarification.
In this volume we follow the widely accepted tradition of using assessment and test-
ing interchangeably, although we are aware that assessment is often used “as a
superordinate term covering all forms of evaluation” (Clapham, 1997, xiv). The
majority of sources on young learners tends to follow this tradition and this is what
authors of this volume also do.
These days, millions of children learn a foreign language (FL), most often English
(EFL), in public and private schools around the Globe. The recent dynamic increase
in the number of young language learners in early language programs is embedded
in larger trends. Firstly, more and more people learn English as a lingua franca, aim-
ing to achieve useful levels of proficiency in English, the means of international
communication. Today, English is increasingly perceived as a basic competence
and an asset for non-native speakers of English to succeed in life. Since access to
English as a commodity is often limited, early language learning has a special social
dimension. Proficiency in English can empower learners and early English may
offer better access to empowerment over time.
These trends have important implications for curricula, assessment and equity.
On the one hand, in many countries not all children have access to equal opportuni-
ties to start learning English at a young age. It has been widely observed that par-
ents’ socio-economic status plays an important role in access to English and choices
of programs. In many places around the world parents empower their children by
finding earlier, more intensive and better quality programs for their offspring. For
example, an article in The Economist (December 20th 2014, p. 83) reported that
80 % of students at international schools around the world are locals because their
parents want them to study later in an English speaking country and they believe
Trends, Issues, and Challenges in Assessing Young Language Learners 3
that earlier and better quality English learning opportunities allow them to do so.
“When people make money, they want their children to learn English, when they
make some more money, they want them to learn in English.” As a result of high
investment in children’s learning of English, highly motivated parents make sure
that their children learn English in the very best programs, as is documented by the
recent interest in content and language integrated learning (CLIL). This new devel-
opment poses new opportunities and challenges for assessment.
Parents would like to have evidence of their children’s proficiency in English at
the earliest possible stage. This need has resulted in several internationally acknowl-
edged external proficiency examinations offering young learners opportunities to
take age-appropriate exams and document their level of proficiency. How these test
results are used and why may vary (see e.g., Chik & Besser, 2011). Parents who
want their children to get language certificates assume that the proficiency achieved
at an early stage of language learning will be automatically maintained and built on
over time.
Another line of test development is documented by national and international
projects implemented in more and more countries as early language learning is
becoming more the norm than the exception. Certain phases and steps of the ardu-
ous process of test development are discussed in five chapters in this book. Needs
vary to a large extent, as the studies indicate and the uses of test results are also very
different. Some projects are initiated by policy makers in order to establish a base-
line or for gatekeeping purposes, others result from more bottom up initiatives
based on local needs.
The boom in early language learning is due to more and more parents’ and decision
makers’ belief in ‘the younger the better’ slogan; young children are expected to
outsmart older starters simply by starting at a younger age. The overwhelming opti-
mism and overconfidence characterizing early language programs is well known in
research in the social sciences and behavioral economics (Kahneman, 2011).
Wishful thinking is supported by evidence in favor of one’s beliefs. The approaches
to interpreting data on how young learners develop and what realistic expectations
are after several years of exposure to L2 can be explained by two metaphors: an
inkblot test and a puzzle (Nikolov, 2013). In the first approach, interpretations are
projected into what there is in the data and they are biased by emotions, expecta-
tions, beliefs, etc. In the second approach, all data contribute to a better understand-
ing of the whole as well as the small components of the larger picture. Although the
puzzle metaphor is also limited, as it supposes a single correct outcome, it repre-
sents a more objective, scientific, and interactionist approach. The chapters in this
volume hopefully add meaningful pieces to the picture of early language learning.
4 M. Nikolov
In recent years, concerns have been voiced about early learning of a foreign lan-
guage both in national and local programs, as evidence on ‘the younger the slower’
has emerged (e.g., deBot, 2014; García Mayo & García Lecumberri, 2003; Muñoz,
2006; Nikolov & Mihaljević Djigunović, 2006, 2011). Many experts have empha-
sized that focusing on starting age as the key variable is misleading in foreign lan-
guage contexts. The age factor is not the main issue. There is a lot more to success
over time. The quality and quantity of early provision, teachers, programs, and con-
tinuity are more important (Nikolov, 2000; Singleton 2014). Also, it is now widely
acknowledged and documented that maintaining young learners’ motivation over
many years is an unexpected challenge emerging in most contexts: the earlier L2
learning is introduced, the sooner typical classroom activities and topics become
boring for young learners. This is one of the reasons why there is a growing interest
in integrating content areas and moving towards content-based curricula, which, in
turn, pose further challenges in both teaching and assessment.
More and more stakeholders realize that offering early language learning oppor-
tunities is only the starting point. Issues related to curricula, teacher education,
monitoring progress and outcomes over the years, and transition across different
stages of education persist and pose new challenges (e.g., Nikolov 2009a, 2009b,
2009c; Rixon, 2013). In fact, the same old challenges are reemerging in a cyclic
manner, as was implicitly predicted by Johnstone (2009).
An important shift can be observed from an emphasis on the ‘fun and ease’ of
early language learning to standards-based measurement of the outcomes in the
target language (L2; e.g., Johnstone, 2009; Rixon, 2013, 2016 in this book). The
shift towards standards is not limited to foreign language programs; it is an interna-
tional trend in educational assessment for accountability in public educational poli-
cies in all subjects and competences.
Test results indicating how children progress and what levels they achieve in
their L2 at the end of milestones in education are often used as one of several key
variables interacting in the process of early foreign language learning and teach-
ing. In other words, it has been realized that early language learning is not at all
a simpler construct than language learning of older learner. Recent research proj-
ects apply all kinds of L2 tests as one of many data collection instruments in
order to answer larger research questions, as they aim to build and test models of
early foreign language learning. An important area of explorations concerns how
young learners’ individual differences, including attitudes, motivation, aptitude,
anxiety, self-perceptions, self-conficence, strategies, etc., contribute to their
development in their L2 (Bacsa & Csíkos, 2016; Mihaljević Djigunović, 2016;
Nikolov, 2016 all in this book). Another important avenue of explorations
Trends, Issues, and Challenges in Assessing Young Language Learners 5
gaining ground looks into how learners’ first (L1) and other languages interact
with one another over time (e.g., Nikolov & Csapó, 2010; Wilden & Porsch,
2016 in this volume).
Yet another important line of research examines how different types of curricula
contribute to early language learning. Traditional FL programs are often supple-
mented or substituted by early content and language integrated learning curricula
(CLIL). Overall, these research studies aim to find out not only what level of profi-
ciency children achieve in their L2, but they also want to offer explanations as to
how and why. The type of curriculum has important implications for the construct
as well as for the way the curriculum is implemented in the classroom. On the one
hand, some recent studies focus on the relationships between contextual factors and
classroom processes. Highly age-appropriate innovative approaches, including
assessment for learning (AfL, Black & Wiliam, 1998), diagnostic (Alderson, 2005;
Nikolov, 2016), peer and self-assessment are examined in ELL contexts (Butler,
2016; Hung, Samuelson & Chen, 2016 in this volume). On the other hand, some
research projects aim to find out how and to what extent different curricula contrib-
ute to L2 development.
In recent years, the field of early language learning research has grown remark-
ably. Many new studies have been published in refereed journals. (See for example
Special Issues of English Language Teaching Journal, 2014 (3) edited by Copland
and Garton; International Journal of Bilingualism, 2010 (3) edited by Nikolov; and
Studies in Second Language Learning and Teaching, 2014 (3) edited by Singleton.)
A range of books and research studies are available on the early teaching and learn-
ing of modern foreign languages offering food for thought for decision makers,
teachers, teacher educators and researchers. (For critical overviews see e.g., Murphy,
2014; Nikolov & Mihaljević Djigunović, 2006, 2011.) Publications on large scale
surveys give insights into the big picture (e.g., Edelenbos, Johnstone, & Kubanek,
2007; Emery, 2012; Garton, Copland & Burns, 2011; Rhodes & Pufahl, 2008;
Rixon, 2013, 2016 in this volume). Excellent handbooks offer classroom teachers
guidance on age-appropriate methodology and assessment (e.g., Cameron, 2001;
Curtain & Dahlberg, 2010; Jang, 2014; McKay, 2006; Pinter, 2006, 2011).
The growing body of empirical studies (e.g., Enever, 2011; Enever, Moon, &
Raman, 2009; García Mayo & García Lecumberri, 2003; Muñoz, 2006; Nikolov
2009a, 2009b) applies some kinds of tests, as they implement quantitative or mixed
research methods (Nikolov, 2009c) and analyze test results in interaction with
other variables. Testing young language learners’ progress over time in their class-
rooms and their proficiency at the end of certain periods are often the aspects of
studies. Thus, the assessment of young learners has become a central issue in early
language learning research and daily practice (Butler, 2009; Inbar-Lourie &
Shohamy, 2009; Johnstone, 2009; McKay, 2006; Nikolov & Mihaljević Djigunović,
2011; Rixon, 2013), as chapters in the present volume indicate. As Rixon (2016)
put it in the title of her chapter, these developments in assessment represent the
‘Coming of Age’.
6 M. Nikolov
The trends outlined above have important implications for the construct. Assessment
of young language learners in early learning contexts was first brought to the atten-
tion of the testing community as a bona fide domain in a special issue of Language
Testing edited by Pauline Rea-Dickins (2000). In her editorial she emphasized an
array of issues: processes and procedures teachers used in their classrooms to moni-
tor their learners’ development and their own practice, the assessment of young
learners’ achievement at the end of their primary education, and teachers’ profes-
sional development. At that time high hopes were typical in publications on early
language programs and hardly any comparative studies were available on younger
and older EFL learners. However, the field was characterized by variability and
diversity, as Rea-Dickins pointed out (p. 119).
Over the past 15 years, the picture has become even more complex for several reasons:
(1) The constructs (Inbar-Lourie & Shohamy, 2009; Johnstone, 2009) cover various types
of curricula;
(2) More evidence has been found on young learners’ varied achievements and on how
their individual differences and contextual variables, including teacher-related ones,
contribute to outcomes over time (for an overview see Nikolov & Mihaljević
Djigunović, 2011).
(3) Accountability poses a recent challenge as standards-based assessment in early lan-
guage programs has been introduced in many educational contexts.
The emergence of accountability in early language learning is not an unexpected
phenomenon. As Johnstone (2009, p. 33) pointed out, the third phase of early learn-
ing became a “truly global phenomenon and …. possibly the world’s biggest policy
development in education. Thus, meeting ‘the conditions for generalized success’
becomes an awesome challenge.” The task is to establish to what extent and in what
conditions early language learning can be claimed to be successful in a range of
very different situations where conditions vary a lot. Stakeholders are interested in
seeing results. What can young learners actually do after many years of learning
their new language? An important challenge for researchers concerns what curricu-
lum is best and what realistic age-appropriate achievement targets are included in
language policy documents. Once curricula are defined, and frameworks are in
place, the construct and expected outcomes have to be in line with how young learn-
ers develop and how their motivation can be maintained over years.
Although early language learning is often seen as a simple proposition (start
learning early), a lot of variation characterizes models according to when programs
start, how much time they allocate, what type of curriculum and method they apply,
who the teachers are, and how they implement the program. In the European con-
texts (Edelenbos, Kubanek, & Johnstone, 2007; Johnstone, 2009), three types of
curricula are popular: (1) awareness raising to languages; (2) traditional FL programs
Trends, Issues, and Challenges in Assessing Young Language Learners 7
offering one to a few classes per week, and (3) content and language integrated
learning (CLIL) curricula where up to 50% of the curriculum in taught in the L2.
The first type does not aim to develop proficiency in an L2; the other two usually
define L2 achievement targets. CLIL programs have become popular in Europe,
Asia and South America. CLIL is typically taught by non-native teachers of English,
and ‘could be interpreted as a foreign language enrichment measure packaged into
content teaching’ (Dalton-Puffer, 2011, p. 184). In most schools ‘CLIL students
nearly always continue with their regular foreign language program alongside their
CLIL content lessons’ (p. 186). What the construct is in these two programs is one
of the main challenges in early language learning research. As has been indicated,
the increased interest in early CLIL programs is due to growing evidence that in
traditional (type 2) programs children develop at a very slow rate and many of the
motivating activities lose their appeal and soon become boring. Therefore, integrat-
ing not only topics from the main curriculum (as in type 2 programs), but also teach-
ing subjects in the target language is supposed to result in killing two problems with
one stone: a focus on intrinsically motivating content also offers opportunities to
acquire L2 skills in all four skills. This means that both content and language have
to be assessed.
As for the construct of early language learning, Inbar-Lourie and Shohamy
(2009) suggest that different types of curricula should be seen along a continuum
between programs focusing on language and content. Awareness raising is at one
end, FL programs somewhere in the middle, and CLIL and immersion at the other
end. They propose that in early language programs language should be “a tool for
gaining knowledge and meaning making and for developing cognitive processing
skills” (p. 91). In this framework, L2 is closely linked to the overall curriculum and
learners’ L1, and the larger view of assessment culture where assessment is a means
to improve. Their proposed framework integrates widely accepted principles of
age-appropriate classroom methodology as well as assessment. The challenges con-
cern how curricula define the aims set for language and content knowledge, and
cognitive and other abilities and skills.
Achievement targets in L2 tend to be modest in early language programs. Young
learners are not expected to achieve native level (e.g., Curtain, 2009; Haenni Hoti,
Heintzmann, & Müller, 2009; Inbar-Lourie & Shohamy, 2009). Frameworks tend
to build on developmental stages in early language programs and reflect how young
learners move from chunks to analyzed language use (Johnstone, 2009). Most cur-
ricula include not only L2 achievement targets, but comprise further aims. Early
learning is meant to contribute to young learners’ positive attitudes towards lan-
guages, language learning, speakers of other languages, and towards learners’ own
culture and identity (e.g., Prabhu, 2009). In addition to linguistic and affective
aims, they often include aims related to cognition, metacognition and learning
strategies. There is a controversy in the multiplicity of aims. Testing in most con-
texts focuses on L2 achievements and the other aims are not assessed at all or they
are discussed only in a few research projects. Testing in early language learning
programs is most often concerned with: (1) how learners progress in their L2 over
time and (2) what levels of proficiency they achieve in some or all of the four skills
8 M. Nikolov
by the end of certain periods. In addition to these areas, there is a need to explore
how teachers assess YLs and how classroom practices interact with children’s atti-
tudes, motivation, willingness to communicate, anxiety, self-confidence and self-
perception over time.
Early language learning assessment frameworks define the main principles of
teaching and assessing young learners and aim to describe and quantify what chil-
dren are expected to be able to do at certain stages of their L2 development (e.g.,
Curtain, 2009; Jang, 2014; McKay, 2006; Nikolov, 2016 in this volume). Frameworks
developed in Europe tend to use the Common European Framework of Reference
for Languages (CEFR, Council of Europe, 2001) as a point of departure, despite the
fact that it was not designed for young learners (e.g., Hasselgren, 2005; Pižorn,
2009; Papp & Salamoura, 2009; Papp & Walczak, 2016 in this volume). In contrast,
research projects on early CLIL tend to follow a different tradition unrelated to test-
ing children or standards-based testing. They frame CLIL as an add-on to FL
instruction and analyze young learners’ performances along three criteria (complex-
ity, accuracy, and fluency) used in second language acquisition research (e.g.,
Hausen & Kuiken, 2009). Such a framework, however, is hardly suited to document
very slow development (see e.g., Bret-Blasco, 2014).
Tests for young learners have been developed for various purposes. Standards-
based tests are used in national and international projects and external examinations
as well as in smaller-scale research studies. The majority of national and interna-
tional projects tend to apply standards aligned to levels in CEFR. Test construction
and validation is a long and complex process. Some important work has been pub-
lished on the process of developing frameworks, can do statements, designing and
validating tests for various purposes, for example, for large-scale proficiency tests,
research projects and teacher-based assessments. These areas are discussed in five
chapters.
Sweden and Croatia. In addition to L2, other factors were also included to find out
how they contributed to processes and outcomes in the target languages as well as
in the affective domain (Enever, 2011; Mihaljević Djigunović, 2012). Researchers
faced challenges similar to those in previous longitudinal studies on early language
learning (Enever, 2011; García Mayo & García Lecumberri, 2003; Muñoz, 2006).
The same tests were used over the years to collect valid and reliable results on par-
ticipants’ L2 development and a single task was used for each skill.
Assessment projects are often narrowly limited and they aim to seek answers to
research questions emerging from practice. For example, how achievement tests are
applied by teachers (Peng & Zheng, 2016), and how innovative assessment tech-
niques can change classroom processes (Butler, 2016; Hung, Samuelson & Chen,
2016, both in this volume). Other projects use tests in order to build new models or
to test existing ones to find out to what extent they can reflect realities in early FL
classrooms (Mihaljević Djigunović, 2016; Bacsa & Csíkos, 2016; see chapters in
this volume).
In recent years, several international examinations have been developed and made
available to young language learners whose parents want them and can afford them.
Three widely known exams offer certificates on children’s proficiency in English:
(1) Cambridge Young Learners English Tests (www.cambridgeesol.org/exams/
young-learners), (2) Pearson Test of English Young Learners (www.pearsonpte.
com/PTEYoungLearners); and (3) TOEFL Primary (https://www.ets.org/toefl_pri-
mary). These examinations fall somewhere in the middle of the language–content
continuum with a focus on some typically taught topics young language learners
can be realistically expected to know. The levels cover A1 and A2 in the CEFR
(Council of Europe, 2001). Besides aural/oral skills literacy skills are also included.
How much work is devoted to developing and validating exams is discussed in three
of the chapters (Benigno & de Jong, 2016; Hsieh, 2016; Papp & Walczak, 2016).
Unfortunately, hardly any studies explore how these proficiency exams impact
classroom processes or how children taking them benefit from their experiences in
the long run. It would also be important to know how they maintain and further
develop their proficiency after taking examinations.
Recent research on early language learning assessment has focused on how teacher-
based assessment can scaffold children’s development in their L2 knowledge and
skills so that they can apply their learning potential (Sternberg & Grigorenko, 2002).
10 M. Nikolov
Researching and documenting how certain tests work with young learners is time-
consuming and this is an area where there is a need and a lot of room for further
work. Similarly to the most brilliant age-appropriate teaching materials and tasks,
the most valid and reliable tests can also be misused or abused. The chapters in this
volume offer insights into some actual tests and how researchers and teachers
applied them. One interesting trend needs pointing out: most of the tests discussed
in the early language learning assessment literature and these chapters are similar to
language tests widely used and accepted in the L2 testing literature. However, some
tests and criteria for assessment are borrowed from other traditions: for example,
oral production was assessed along complexity, accuracy, and fluency in Bret
Blasco’s (2014) study on CLIL.
As these are key issues in assessment, a detailed and critical analysis should
focus on what tests are used in assessment projects involving young learners. Often
a single task is used to tap into a skill and the same test is used over the years to
document development (e.g., Bret Blasco, 2014; Enever, 2011). Recently elicited
Trends, Issues, and Challenges in Assessing Young Language Learners 11
repetition has been also used to assess speaking. It is important to approach these
questions from the learners’ and teachers’ perspectives as well and to explore how
tests can be linked to offer more reliable insights into young learners’ development
(e.g., Nikolov & Szabó, 2012; Szpotowicz & Campfield, 2016 in this volume).
There is a lot of potential in learning about the traditions in the fields of second
language acquisition and language testing, and most probably both areas would
benefit from a comparative analysis.
The last two chapters provide insights into how peer-, self-assessment and
teacher assessment interact with one another. Yuko Goto Butler, in chapter “Self-
Assessment of and for Young Learners’ Foreign Language Learning”, offers a criti-
cal overview of research into self-assessment of and for young learners’ foreign
language learning and proposes five dimension for developing further research
instruments, thus linking teaching, assessment and learning. The context of the final
chapter is Taiwan. Yu-ju Hung, Beth Lewis Samuelson and Shu-cheng Chen explore
the relationships between peer- and self-assessment and teacher assessment of
young EFL learners’ oral presentations by applying both the teacher’s and her stu-
dents’ reflections for triangulation purposes.
This volume outlines some of the key areas where research has been conducted.
Similar inquiries would allow us to find out how results would compare in other
contexts. Researchers, including classroom teachers, should consider how replica-
tion studies could offer useful information on learners’ achievements in their coun-
tries and classrooms. Data collection instruments can be of invaluable help with
instructions on how to apply them. Such data repositories, for example at http://
iris-database.org/iris/app/home/index, are available. Test development is an
extremely challenging and expensive process. Questionnaires, interviews, etc. also
require special expertise to develop and validate. Sharing them would allow the
early language learning field to advance more rapidly.
It is also important to note which key areas are not discussed in this book in full
detail or at all, and where more research is needed.
(1) In order to answer research questions related to the larger picture on early start pro-
grams, studies should aim to find out in what domains younger learners excel over time
and why this is the case. This kind of research should work towards testing models of
early language learning. Studies should include proficiency tests on learners’ aural/oral
and literacy skills in their L1, L2, L3. Other instruments should tap into individual dif-
ferences of young learners and their teachers, and contextual variables (including char-
acteristics of programs, materials, methods, the quality of teaching) interacting in
children’s development over several years. The main benefits of an early start are most
probably not in higher L2 proficiency over time and this hypothesis may have impor-
tant implications for language policy, curriculum design, teacher education and class-
room practice.
(2) Hardly any studies look into the relationships between access to early foreign language
learning opportunities, assessment, and equity. Do all children have equal opportuni-
ties? Research is necessary to examine how parents’ motivation, learners’ socio eco-
nomic status and achievements on tests interact and how test results are used.
(3) A recurring theme in early language teaching programs concerns transition and conti-
nuity. Studies should go beyond the early years and focus on how teachers build on
what learners can do in later years and what role assessment practices play in the pro-
cess. In other words, research is necessary into how children are taught and assessed,
and how teachers can apply diagnostic information in their teaching.
14 M. Nikolov
(4) The impact of different kinds of assessment on young language learners, their teachers,
and the teaching-learning process should be explored in depth. Teachers’ and learners’
emic perspectives are hardly ever integrated into studies. Exploring teachers’ and their
learners’ beliefs and lived experiences could reveal why implementing innovation
often poses a major challenge. Case studies could offer insights on what it means to a
child to take an external examination, what challenges learners and their teachers face
due to parental pressure to produce results, and why teachers may resist change in their
teaching and testing practices.
(5) It would be essential to learn more about the ways in which achievement targets defined
in curricula are assessed by teachers on a daily basis. How they balance giving children
feedback on their progress in test results with maintaining their motivation and keeping
their debilitating anxiety low.
(6) Yet another avenue for classroom research for practicing teachers should explore how
teachers apply traditional (assessment of learning) and innovative assessment tech-
niques (assessment for learning, peer and self-assessment). How do they use criteria
for assessing speaking and writing and keys on closed items and students’ responses to
open items? How do they integrate other aspects of students’ behavior into their assess-
ments, for example, their willingness to communicate, attitudes, motivation, aptitude,
anxiety?
(7) Very little is known about testing learners’ knowledge and skills in CLIL programs.
Exploratory classroom studies are needed to find out how teachers tease out the two
domains and how they can diagnose if learners’ weaknesses are in their L2 or in the
subject matter.
The studies in this volume discuss various aspects of test development, outcomes
of large-scale surveys, national assessment projects, and innovative smaller-scale
studies. The ideas shared and the frameworks and instruments used for data collec-
tion should be of interest to both novice and experienced teachers, materials and test
developers, as well as for researchers. Readers should bear in mind which of the
main points are worth further explorations. It is hoped that the volume offers excit-
ing new ideas, and result in innovation and change.
References
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning
and assessment. London: Continuum.
Bacsa, É., & Csíkos, C. (2016). The role of individual differences in the development of listening
comprehension in the early stages of language learning. In M. Nikolov (Ed.), Assessing young
learners of English: Global and local perspectives. New York: Springer.
Benigno, V., & de Jong, J. (2016). A CEFR-based inventory of YL descriptors: Principles and chal-
lenges. In M. Nikolov (Ed.), Assessing young learners of English: Global and local perspec-
tives. New York: Springer.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education,
5(1), 7–71.
Bret Blasco, A. (2014). L2 English learners’ oral production skills in CLIL and EFL settings: A
longitudinal study. Doctoral dissertation, Universitat Autonoma de Barcelona.
Butler, Y. G. (2009). Issues in the assessment and evaluation of English language education at the
elementary school level: Implications for policies in South Korea, Taiwan, and Japan. The
Journal of Asia TEFL, 6, 1–31.
Trends, Issues, and Challenges in Assessing Young Language Learners 15
Butler, Y. G. (2016). Self-assessment of and for young learners’ foreign language learning. In
M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives.
New York: Springer.
Cameron, L. (2001). Teaching languages to young learners. Cambridge: Cambridge University
Press.
Chik, A., & Besser, S. (2011). International language test taking among young learners: A Hong
Kong case study. Language Assessment Quarterly, 8, 73–91.
Clapham, C. (1997). Introduction. In C. Clapham & D. Corson (Eds.), Encyclopedia of language
and education. Volume 7. Language testing and assessment (pp. xiii–xix). Dordrecht: Kluwer
Academic Publisher.
Copland, F., & Garton, S. (Eds.). (2014). English Language Teaching Journal. Special Issue.
Curtain, H. (2009). Assessment of early learning of foreign languages in the USA. In M. Nikolov
(Ed.), The age factor and early language learning (pp. 59–82). Berlin, Germany: Mouton de
Gruyter.
Curtain, H. A., & Dahlberg, C. A. (2010). Languages and children – Making the match: New lan-
guages for young learners (4th ed.). Needham Heights, MA: Pearson Allyn & Bacon.
Dalton-Puffer, C. (2011). Content and language integrated learning: From practice to principles?
Annual Review of Applied Linguistics, 31, 182–204.
Davison, C. (2013). Innovation in assessment: Common misconceptions. In K. Hyland & L. L.
C. Wong (Eds.), Innovation and change in English language education (pp. 263–277).
New York: Routledge.
Davison, C., & Leung, C. (2009). Current issues in English language teacher-based assessment.
TESOL Quarterly, 43, 393–415.
deBot, K. (2014). The effectiveness of early foreign language learning in the Netherlands. Studies
in Second Language Learning and Teaching. doi:10.14746/ssllt.2014.4.3.2
Edelenbos, P., Johnstone, R., & Kubanek, A. (2007). Languages for the children in Europe:
Published research, good practice and main principles. Retrieved from http://ec.europa.eu/
education/policies/lang/doc/youngsum_en.pdf
Edelenbos, P., & Kubanek-German, A. (2004). Teacher assessment: The concept of ‘diagnostic
competence’. Language Testing, 21, 259–283.
Emery, H. (2012). A global study of primary English teachers’ qualifications, training and career
development. London: British Council.
Enever, J. (Ed.). (2011). ElliE: Early language learning in Europe. London: British Council.
Enever, J., Moon, J., & Raman, U. (Eds.). (2009). Young learner English language policy and
implementation: International perspectives. Reading, UK: Garnet Education Publishing.
García Mayo, M. P., & García Lecumberri, M. L. (Eds.). (2003). Age and the acquisition of English
as a foreign language. Clevedon: Avon/Multilingual Matters.
Garton, S., Copland, F., & Burns, A. (2011). Investigating global practices in teaching English to
young learners. London: British Council.
Haenni Hoti, A., Heinzmann, S., & Müller, M. (2009). “I can you help?” Assessing speaking skills
and interaction strategies of young learners. In M. Nikolov (Ed.), The age factor and early
language learning (pp. 119–140). Berlin, Germany: Mouton de Gruyter.
Hasselgren, A. (2005). Assessing the language of young learners. Language Testing, 22,
337–354.
Hausen, A., & Kuiken, F. (2009). Complexity, accuracy and fluency in second language acquisi-
tion. Applied Linguistics, 30(4), 461–473.
Hsieh, C. (2016). Examining content representativeness of a young learner language assessment:
EFL teachers’ perspectives. In M. Nikolov (Ed.), Assessing young learners of English: Global
and local perspectives. New York: Springer.
Hung, Y.-J., Samuelson, B. L., & Chen, S.-C. (2016). The relationships between peer- and self-
assessment and teacher assessment of young EFL learners’ oral presentations. In M. Nikolov
(Ed.), Assessing young learners of English: Global and local perspectives. New York: Springer.
16 M. Nikolov
Inbar-Lourie, O., & Shohamy, E. (2009). Assessing young language learners: What is the con-
struct? In M. Nikolov (Ed.), The age factor and early language learning (pp. 83–96). Berlin,
Germany: Mouton de Gruyter.
International schools: The new local. (2014, December 20). The Economist, pp. 83–84.
Jang, E. E. (2014). Focus on assessment. Oxford: Oxford University Press.
Johnstone, R. (2009). An early start: What are the key conditions for generalized success? In
J. Enever, J. Moon, & U. Raman (Eds.), Young learner English language policy and implemen-
tation: International perspectives (pp. 31–42). Reading: Garnet Education Publishing Ltd.
Kahneman, D. (2011). Thinking, fast and slow. New York: Allen Lane/Penguin Books.
Kennedy, C. (2013). Models of change and innovation. In K. Hyland & L. L. C. Wong (Eds.),
Innovation and change in English language education (pp. 13–27). New York: Routledge.
Markee, N. (2013). Contexts of change. In K. Hyland & L. L. C. Wong (Eds.), Innovation and
change in English language education (pp. 28–43). New York: Routledge.
McKay, P. (2006). Assessing young language learners. Cambridge: Cambridge University Press.
McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Oxford: Blackwell
Publishing.
Mihaljević Djigunović, J. (2012). Early EFL learning in context – Evidence from a country case
study. London: The British Council.
Mihaljević Djigunović, J. (2016). Individual differences and young learners’ performance on L2
speaking tests. In M. Nikolov (Ed.), Assessing young learners of English: Global and local
perspectives. New York: Springer.
Muñoz, C. (Ed.). (2006). Age and the rate of foreign language learning. Clevedon: Avon/
Multilingual Matters.
Murphy, V. A. (2014). Second language learning in the early school years: Trends and contexts.
Oxford: Oxford University Press.
Nikolov, M. (2000). Issues in research into early second language acquisition. In J. Moon &
M. Nikolov (Eds.), Research into teaching English to young learners: International perspec-
tives (pp. 21–48). Pécs: University Press Pécs.
Nikolov, M. (Ed.). (2009a). The age factor and early language learning. Berlin/New York: Mouton
de Gruyter.
Nikolov, M. (Ed.). (2009b). Early learning of modern foreign languages: Processes and outcomes.
Clevedon, UK: Multilingual Matters.
Nikolov, M. (2009c). The age factor in context. In M. Nikolov (Ed.), The age factor and early
language learning (pp. 1–38). Berlin, Germany/New York, NY: Mouton de Gruyter.
Nikolov, M. (Ed.). (2010). International Journal of Bilingualism. Special Issue.
Nikolov, M. (2011). Az angol nyelvtudás fejlesztésének és értékelésének keretei az általános iskola
első hat évfolyamán [A framework for developing and assessing proficiency in English as a
foreign language in the first six years of primary school]. Modern Nyelvoktatás, XVII(1), 9–31.
Nikolov, M. (2013, August). Early language learning: Is it child’s play? Plenary talk. EUROSLA
Conference, Amsterdam. Retrieved from http://webcolleges.uva.nl/Mediasite/Play/7883cb9b1
cb34f98a21fb37534fc1ec61d
Nikolov, M. (2016). A framework for young EFL learners’ diagnostic assessment: Can do state-
ments and task types. In M. Nikolov (Ed.), Assessing young learners of English: Global and
local perspectives. New York: Springer.
Nikolov, M., & Csapó, B. (2010). The relationship between reading skills in early English as a
foreign language and Hungarian as a first language. International Journal of Bilingualism, 14,
315–329.
Nikolov, M., & Mihaljević Djigunović, J. (2006). Recent research on age, second language acqui-
sition, and early foreign language learning. Annual Review of Applied Linguistics, 26,
234–260.
Nikolov, M., & Mihaljević Djigunović, J. (2011). All shades of every color: An overview of early
teaching and learning of foreign languages. Annual Review of Applied Linguistics, 31, 95–119.
Trends, Issues, and Challenges in Assessing Young Language Learners 17
Nikolov, M., & Szabó, G. (2012). Developing diagnostic tests for young learners of EFL in grades
1 to 6. In E. D. Galaczi & C. J. Weir (Eds.), Voices in language assessment: Exploring the
impact of language frameworks on learning, teaching and assessment – Policies, procedures
and challenges, Proceedings of the ALTE Krakow Conference, July 2011 (pp. 347–363).
Cambridge: UCLES/Cambridge University Press.
Nikolov, M., & Szabó, G. (in press). A study on Hungarian 6th and 8th graders’ proficiency in
English and German at dual-language schools. In D. Holló & K. Károly (Eds.), Inspirations in
foreign language teaching: Studies in applied linguistics, language pedagogy and language
teaching in honour of Peter Medgyes. Harlow: Pearson Education.
Papp, S., & Salamoura, A. (2009). An exploratory study into linking young learners’ examinations
to the CEFR (Research Notes, 37, pp. 15–22). Cambridge: Cambridge ESOL.
Papp, S., & Walczak, A. (2016). The development and validation of a computer-based test of
English for young learners: Cambridge English Young Learners. In M. Nikolov (Ed.), Assessing
young learners of English: Global and local perspectives. New York: Springer.
Peng, J., & Zheng, S. (2016). A longitudinal study of a school’s assessment project in Chongqing,
China. In M. Nikolov (Ed.), Assessing young learners of English: Global and local perspec-
tives. New York: Springer.
Pinter, A. (2006). Teaching young language learners. Oxford: Oxford University Press.
Pinter, A. (2011). Children learning second languages. Basingstoke: Palgrave Macmillan.
Pižorn, K. (2009). Designing proficiency levels for English for primary and secondary school
students and the impact of the CEFR. In N. Figueras & J. Noijons (Eds.), Linking to the CEFR
levels: Research perspectives (pp. 87–102). Arnhem, The Netherlands: Cito/EALTA.
Prabhu, N. S. (2009). Teaching English to young learners: The promise and the threat. In J. Enever,
J. Moon, & U. Raman (Eds.), Young learner English language policy and implementation:
international perspectives (pp. 43–44). Reading, UK: Garnet Education Publishing.
Rea-Dickins, P. (2000). Assessment in early years language learning contexts. Language Testing,
17(2), 115–122.
Rhodes, N. C., & Pufahl, I. (2008). Foreign language teaching in U.S. Schools: Results of a
national survey. Retrieved from http://www.cal.org/projects/Exec%20Summary_111009.pdf
Rixon, S. (2013). British Council survey of policy and practice in primary English language teach-
ing worldwide. London: British Council.
Rixon, S. (2016). Do developments in assessment represent the ‘coming of age’ of young learners
English language teaching initiatives? The international picture. In M. Nikolov (Ed.), Assessing
young learners of English: Global and local perspectives. New York: Springer.
Singleton, D. (2014). Apt to change: The problematic of language awareness and language apti-
tude in age-related research. Studies in Second Language Learning and Teaching. doi:10.14746/
ssllt.2014.4.3.9.
Sternberg, R. J., & Grigorenko, E. L. (2002). Dynamic testing: The nature and measurement of
learning potential. Cambridge: Cambridge University Press.
Szpotowicz, M., & Campfield, D. E. (2016). Developing and piloting proficiency tests for Polish
young learners. In M. Nikolov (Ed.), Assessing young learners of English: Global and local
perspectives. New York: Springer.
Wilden, E., & Porsch, R. (2016). Learning EFL from year 1 or year 3? A comparative study on
children’s EFL listening and reading comprehension at the end of primary education. In
M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives.
New York: Springer.
Do Developments in Assessment Represent
the ‘Coming of Age’ of Young Learners
English Language Teaching Initiatives?
The International Picture
Shelagh Rixon
Abstract This chapter draws upon two pieces of recent research undertaken for the
British Council and in co-operation with Cambridge English concerning the state of
the art of the teaching of English as a Foreign Language at primary school level, and
of assessment of children’s English in particular. It is shown that, while some
advances have been made in curricular planning over the past 15 years in different
parts of the world and hence in target level-setting, the actual practices applied in
assessment are not well-conceived in all places. In addition, the use of assessment
data to improve continuity and coherence in English Language Teaching after tran-
sition from one level of schooling to another remains in most cases an opportunity
missed.
1 Introduction
The age range of learners discussed in this chapter is from 5 to 12 years old, corre-
sponding with the ages between which children attend primary/elementary school
in many countries. The focus is on the teaching of English to young learners (TEYL)
in state rather than private schools.
The teaching of languages to primary school aged children has been described
as one of the greatest areas of educational policy change world wide in the last
30 years.
S. Rixon (*)
Centre for Applied Linguistics, University of Warwick, Coventry, UK
e-mail: [email protected]
Indeed EYL is often not just an educational project, but also a political and economic one.
A remarkable number of governments talk not only about the need to learn a foreign lan-
guage but of an ambition to make their country bilingual. (Graddol, 2006, pp. 88–91)
It is very well accepted, almost a truism, that attitudes to and practices within
assessment are a strongly determinant factor in how teaching and learning takes
place. Many authorities (e.g., Andrews, 2004; Black & Wiliam, 1998a, 1998b;
Henry, Bettinger & Braun, 2006) have suggested that an indispensable way to pro-
mote and sustain an intended educational innovation or improvement, whether at
curriculum or methodological level, is to adjust the assessment system so that it is
coherent with the teaching and its content. Conversely, the best way to thwart
change is to take no accommodating action with regard to assessment. In earlier
times, this was often seen as applying principally to the formal, high-stakes, testing/
examination system. See Rea-Dickins and Scott (2007) for a discussion with regard
to language testing. However, attention to assessment at the classroom level, par-
ticularly “assessment for learning” or AfL (Black & Wiliam, 1998a) has more
recently been shown to have an enormous influence on developing learners’ capac-
ity for self-direction and more autonomous learning. Consideration of the range of
assessment practices in the developing field of teaching English to primary school
aged children is therefore surely of high relevance.
This chapter investigates the stated policies of regional and national educational
authorities as well as the practices and perceptions of selected young learners’ prac-
titioners with regard to the different roles that assessment currently plays in primary
school level English Language Teaching. The focal areas concern its potential roles
regarding quality of teaching and learning, in setting and checking targets and stan-
dards, for coherence between different levels of schooling and, in some contexts, for
justice in allocating scarce educational opportunities. The argument is that a cur-
ricular/teaching innovation in a given context cannot be said to have ‘come of age’,
until assessment is well understood and appropriately used at the classroom, local
education authority and national education levels to support the intentions behind
the innovation.
It might be hoped that, near the end of a 30-year or more ‘new wave’ of interest in
the teaching of languages to young children, much would have fallen into place at
the level of a range of recommended practices as well as generally agreed theory.
This, however, cannot be taken for granted. The history of TEYL initiatives over the
past 30 years has often been one of enthusiasm followed by some turbulence and
often disappointment, There has often been more rhetoric on the part of educational
authorities than willingness to put in place tangible support in terms of money and
Do Developments in Assessment Represent the ‘Coming of Age’ of Young Learners… 21
time for training opportunities for teachers and for the supply or creation of suitable
materials. Planning efforts have also frequently not been equal in energy to the con-
tent of ministerial decrees. See Enever and Moon (2009, pp. 5–20) for a fuller dis-
cussion of these issues. Surveys made near the beginning of the ‘boom’ and in the
more recent past have shown that, time after time, compromises have been made
with EYL initiatives, often, it seems, for the sake of speed of implementation for
narrowly political motives. The main points of strain have frequently been found to
be in the fundamental area of provision and preparation of teachers so that they are
professionally well equipped to carry through the innovation. Rixon, summarising
a survey of the decade from 1990 to 1999 found the following pattern in numerous
state school systems. There was either:
… a relaxation of the official criteria or qualifications for eligibility as a teacher of English
in the primary school system.
or
… an adequate supply of officially qualified teachers but considerable controversy
about whether those teachers were adequately prepared in terms of language and methodol-
ogy. (Rixon, 2000, p. 161)
This uncertainty over teacher supply and quality came in addition to consider-
able fluidity in, or, in some cases, absence of, specifications of syllabus content for
primary-aged learners of English. Such fluidity was not in itself a bad thing, but was
clearly inimical to any attempt to specify and promote assessment instruments
which might, for example, support ongoing monitoring or lead to coherent and
usable summative information on what had been learned at different stages of pri-
mary schooling.
As we have seen above, there was evidence even in 2000, nearly 20 years after the
first stirrings of interest internationally in teaching English to younger children, that
in many contexts EYL was still finding its feet in terms of decisions on curriculum
and methodology and in recruiting or preparing teachers who were confident in the
skills and knowledge they would need to function well in the classroom. Meanwhile,
several strands of practice and thinking in the assessment area had been developing
both in the English language teaching (ELT) world and the general mainstream
educational world. These offered potentially useful approaches that could help tie
together teaching and assessment in order to create more robust and coherent expe-
riences for children learning English in school. However, these developments in
themselves could also be seen as presenting yet more to be taken on board by Young
Learners teachers still developing their new professional roles.
22 S. Rixon
It was only in the late 1990s (e.g., Rea-Dickins & Rixon, 1997, 1999) that the
assessment of the English language learning of primary school aged children
started to be raised by researchers as an area of particular concern with the differ-
ent purposes which assessment might serve in this area being spelled out and dis-
cussed. Among these the purposes of monitoring learning, allowing formative
development and providing information to facilitate transition between one level of
schooling and another were highlighted by writers who often had the improvement
of pedagogy high amongst their priorities. For example, the models for assessment
of children’s language development that were deemed by Rea-Dickins and Rixon
in their 1997 chapter to be the most interesting and likely to influence children’s
language learning for the better were mostly derived from work in mainstream UK
schools with children with English as a second language (ESL, now known as
EAL – English as an Additional Language). The techniques used in the main
emphasised classroom assessment, continuous observation and record-keeping,
with concern always for the development of the individual child and thus with a
largely formative purpose.
It was recognised that this mainly classroom based tracking and record-keeping
approach might not be familiar (and might hold little appeal) in contexts and edu-
cational systems which, for selection or other administrative purposes, required
more speedily arrived-at summative results for large numbers of learners. The
assessment events of this latter type might take place at the end of a term or school
year or near the end of primary education. However, it was striking that this sum-
mative style of assessment was what also seemed to predominate in day-to-day
classroom assessment in many of the EYL contexts that Rea-Dickins and Rixon
were at that time researching. In an international survey involving 122 primary
school teachers of English (Rea-Dickins & Rixon, 1999) 100 % of teachers’ self-
reports gave an account of classroom assessment practices which were exclusively
based on ‘paper and pencil’ written tests and quizzes. This was in spite of the fact
that they also claimed to be focusing mostly on developing speaking and listening
skills.
From the late 1990s to the early 2000s, new editions of standard textbooks on
language testing (e.g., Hughes, 2003) inserted new chapters on assessing chil-
dren. However, the discussion tended to remain at the generic level of principle
and the hunt for ‘child-friendly’ items largely within the familiar formats used
with older learners. In the early 2000s, there came a welcome departure with the
publication of an account of EYL assessment (Ioannou-Georgiou & Pavlou,
2003) which seemed to consider the area in a completely new way. Refreshingly,
this book started with a persuasive discussion of portfolio-based evidence as a
feasible norm for young learners’ (YL) assessment and only then worked its way
through to child-friendly versions of gradually more conventional and familiar
assessment practices by the end of the book. This was a bold reversal of more
timid accounts.
Do Developments in Assessment Represent the ‘Coming of Age’ of Young Learners… 23
None of these works, however, included research into specific local understand-
ings and practice in Young Learners assessment. A special issue of the journal
Language Testing (Rea-Dickins, 2000) had addressed this area, albeit with a mainly
European focus. Recently, research into specific contexts has increased. See, for
example, Brumen, Cagran and Rixon (2009) on Slovenia, Croatia and the Czech
Republic and other chapters in this present volume. This type of research serves to
throw light on many of the issues covered in this chapter, in particular the under-
standings and actual practices of teachers regarding assessment compared with the
ideals or the rhetoric to be found at an official level.
The growing interest in EYL assessment by academics and teacher educators
such as those above roughly coincided with interest in younger learners from inter-
national providers of tests and exams aimed at a large-scale market (see Taylor &
Saville, 2002). The aim of providers such as Cambridge English (then Cambridge
ESOL), whose YLE tests were launched in May 1997, was to find ‘child-friendly’
yet practicable ways of assessing large numbers of youngsters and assigning them a
summative grade that was reliable yet meaningful and informative.
However, more influential still in some contexts have been movements in general
educational assessment which affect the whole curriculum and may thereby also
affect what takes place with regard to English. It is worth discussing three recent
major movements in mainstream educational assessment at this point since overall
educational reforms in some contexts may have been influenced by or directly
adopted a version of one of these. In these cases it is likely that the assessment of
English as one curricular subject amongst many will be affected by the general
reform.
The driving force of standards-based assessment is the attempt to ensure that schools
and teachers strive to bring all learners to an acceptable minimum standard of learn-
ing (or beyond) and are held accountable for doing so. The No Child Left Behind
movement in the USA is a striking early example of this as is the UK National
Curriculum with its accompanying standard assessment tasks at the end of primary
schooling. In educational systems using standards-based assessment, local or
national tests aim to reveal the proportion of pupils attaining specified required
24 S. Rixon
from test scores and not necessarily directly approached by articulating what a
learner ‘can do’ and setting up a challenge which gives them the opportunity to
demonstrate it by performing using the required skills and functions. Links may be
drawn from test scores to inferred skills and abilities, but this is a controversial area.
Assessment techniques within the performance-based tradition concerning lan-
guage learning typically involve holistic tasks rather than responses to discrete test
items. Role play, challenges involving information gaps and other requirements to
simulate real language use as far as is possible are very common. Assessment judge-
ments are made through observation, scrutiny of output such as written work in a
required genre and are based on criteria derived from carefully-written performance
descriptors. Self-assessment and reflection may be involved and collections of evi-
dence of learning in portfolios may also play a part. The European Language
Portfolio (ELP) in versions available for both older and younger learners (see http://
www.coe.int/t/dg4/education/elp/) is a widely used device not only for collecting
examples of work but for structuring self-assessment. It is directly linked with the
performance descriptors set up by the Common European Framework of Reference
(CEFR, Council of Europe, 2001).
The CEFR (Council of Europe, 2001) is the most prominent example of a frame-
work which can support a performance-based approach to assessment. It has been
pointed out, however, (e.g., Jones, 2002) that the descriptors do not in themselves
provide direct specifications for tasks which could form part of an assessment. An
assessment-deviser would need to bring further detail to its “can do” statements and
overall descriptions in order to set up appropriate assessment challenges to elicit a
required performance that will demonstrate what the learner can do. There is also
the issue that the judgement is not a stark ‘yes’ or ‘no’. There is also scope for
judgements of a candidate’s performance concerning ‘how well’ and ‘how much’
they manage within a specified level.
Although the lower levels of the CEFR may seem to offer appropriate levels of
language challenge for young children, there are some fundamental problems. As
discussed by Hasselgren (2005), we do not currently have a CEFR designed for use
with children involving domains that are appropriate for them and which includes
skills and topics that are suited to their cognitive and social development and range
of interests and experiences. Papp and Salmoura (2009) discuss attempts to cali-
brate the Cambridge YLE examinations against the CEFR. An additional issue is
that, in cases where an A1 or A2 level is specified as the end-point for primary
26 S. Rixon
school learning and the children in fact learn English for a number of years, there is
probably a need to subdivide these already modest levels of attainment in order to
be able to give sub-grades for levels of attainment arrived at before the final year of
learning.
This survey (Papp, Chambers, Galaczi & Howden, 2011), which was questionnaire-
based, covered the area of classroom teaching and assessment in great detail, with
responses concerning their own practices from numerous individuals directly involved
Do Developments in Assessment Represent the ‘Coming of Age’ of Young Learners… 29
professionally with Young Learners of English. The results of this survey are not
publicly available in their entirety, although they will be drawn on in a future volume
on assessing Young Learners in the Cambridge University Press Studies in Language
Testing series (Papp and Rixon, forthcoming). Many thanks therefore go to Cambridge
English for permission to publish summaries of key sections here. Because of the
unavailability of the original document, page references will not be given.
The research interest was on individual perceptions as well as trends in the area
of English language assessment. Much use was made of open response questions to
which individual teachers gave detailed answers.
Respondents worked in private as well as state institutions, a number of them
working in both. In all, 726 valid responses were returned from 55 different coun-
tries, the majority of respondents being from Greece, Italy, Mexico, Romania and
Spain. Of the total sample, about 300 respondents worked mainly with learners in
the 6–11 year old age range which is the subject of the present chapter. The rest
worked with secondary school-aged learners. See Appendix A for the list of coun-
tries covered.
The British Council survey (Rixon, 2013) was undertaken as a follow-up to an ear-
lier survey on the same topic already quoted above (Rixon, 2000). The research
scope of this survey was broader than that of the Cambridge survey in that it took in
overall developments in policy and practice such as starting ages for English, avail-
ability of pre-school English, numbers of hours of English per year and over a whole
primary school career, teaching materials and teacher qualifications and eligibility
as well as relations between the public and private sectors. Because of the growing
importance of assessment in Young Learners teaching, a special section of the sur-
vey questionnaire was devoted to policies regarding assessment.
Returns were mostly via an on-line questionnaire. The purpose of the survey was
to collect data on policy and officially-supported practices in as many countries and
regions as possible worldwide. In contrast with the Cambridge survey, this was a
global ‘facts and figures’ exercise rather than an investigation into individual views
and practices. It was thus felt appropriate not to make use of the questionnaire with
a massive number of individuals but to identify one, or at most two, well-informed
sources for each context. Authoritative informants on local policy and practice in a
country or region were identified via the local British Council Offices. Responses
were received from 64 separate countries or regions. See Appendix B for the list of
contexts covered.
In many countries and regions, thanks to an increase in on-line information,
much of the statistical data requested could be obtained and checked by reference to
official websites. In cases where the answers were based not on official data but on
an estimate or on the respondent’s personal experience, the respondents were asked
30 S. Rixon
to state the degree of confidence with which they were answering. It is thus claimed
that the data reported are of as good quality and as reliable as possible and, in cases
where they are not independently verified, this fact is made transparent to the reader.
There is evidence in the 2013 British Council survey that the tensions noted in 2000
between enthusiasm for innovation and less concern for practical provision have
continued. Teacher supply and/or quality was judged adequate in only 17 (27 %) of
contexts. In spite of the difficulties in teacher supply, the most frequently-reported
recent policy change was the lowering of the age at which English was to be taught
compulsorily in the primary school. Some verbatim comments from respondents
illustrate issues encountered with keeping up or catching up with current demand
for adequately trained teachers, with, for example, both the Taiwanese and the
Israeli respondents complaining that teachers of English to primary school children
often needed to be drawn from teachers specialising in other subjects.
However, the survey also revealed some cases in which, in spite of continued
enthusiasm for lowering the age at which English could be begun, more realistic
attitudes were evident.
There was a change introduced in the Regulations of the Cabinet of Ministers as to the age
of starting the 1st foreign language – moving it to Year 1 (age 7), but it has been decided to
wait with this change for a couple of years due to lack of funding (Latvia). (Rixon, 2013,
p. 148)
In addition, there were cases where, in spite of problems reported at the time of
response, planning was in place and attempts to improve teacher preparation for the
future were evident: For example, in France, teaching personnel from numerous
different backgrounds were still being used at the time of response. This had been
an issue highlighted for France even as far back as the earlier, 2000, survey. However,
the comment in the more recent survey showed that steps had been taken to ensure
the supply of better qualified teachers in the future.
This is temporary as it is now compulsory for all new teachers to graduate from teacher
training college (IUFM) with the required level of the foreign language. They will receive
a certificate called CLES (Certificat de Langue de l’Enseignement Supérieur). This certifi-
cate certifies language competence only not methodology (France). (Rixon, 2013, p. 108)
It is notable that, here the emphasis is on the language levels of the graduating
teachers rather than on the need also to cater for their preparation in appropriate
language teaching methodology. However, when resources are stretched this seems
a pragmatic if not ideal priority. It is one which remains widespread across other
contexts. In a climate in which even language teaching methodology is rarely the
subject of teacher preparation, one has then to ask how likely it is for new recruits
Do Developments in Assessment Represent the ‘Coming of Age’ of Young Learners… 31
to receive specialist training in appropriate reasons for, and means of assessment of,
children’s language learning.
One contention of this chapter is that the degree to which teachers are confident
all-round ELT professionals, in ways which go beyond their own language profi-
ciency, has huge implications for the nature and quality of language learning assess-
ment. If elementary school teachers in many contexts are still learning to become
fully skilled teachers of English, they might reasonably be expected still to be find-
ing their way as implementers, informed critics or devisers of English language
assessment approaches.
The Rea-Dickins and Rixon survey (1999) cited above showed teachers implement-
ing class tests in a way that did not chime with their stated teaching priorities: 100 %
of the sample of 122 teachers from nearly 20 countries stated that their main aims
were to promote listening and speaking but none of them used class tests involving
these skills. The Cambridge English survey of 12 years later involved more coun-
tries and teachers who came from private school as well as state school teaching
backgrounds (although many had more than one job and some taught in both types
of institution). Their self-reports concerning knowledge about and use of different
assessment formats suggested that there was an awareness of a much broader variety
of possibilities for assessment and of the different purposes it might serve.
Amongst the nearly 300 teachers of 6–11 year olds who responded to the
Cambridge survey, the following types of assessment were selected as significant
and actually used. These are listed in rank order according to the number of
responses for each one:
1. Tests produced by the class teacher
2. Tests given in the textbook used in class
3. Collection of students’ work in a file or portfolio
4. Observation and written description of learner performance
5. Standardised tests and examinations
6. Self-assessment
7. Peer-assessment
The picture presented by the data from these teachers is of a good spread of
actually-used assessment types per teacher. Out of just under 300 teachers, the num-
bers choosing these top seven assessment types was closely ranged between around
200 and 125. The top two choices of teacher-produced or textbook-supplied tests –
similar perhaps to the written classroom tests used by the teachers in the Rea-
Dickins and Rixon survey of 1999 – were made by approximately 200 respondents
each, with the rest, apart from peer-assessment, at nearly the same level. Peer-
assessment received the lowest number of selections, being chosen by approxi-
mately 125.
32 S. Rixon
The British Council survey (Rixon, 2013) supports the discussion of assessment
from the more top-down perspective of national or regional policy. Officially-
endorsed assessment principles and skills may or may not already have percolated
down to the classroom level in a given context but first signs of coming of age at a
national or regional level may also be traced when officialdom puts in place an
assessment policy that is likely to add to clarity about the standards expected or is
presented as having the intention of bringing about a positive impact on classroom
teaching.
The following themes regarding assessment explored by the survey will be
discussed:
1. Standards-setting and the growing role of the CEFR
2. Assessment as an official requirement in EYL teaching in primary school
3. Means of assessing if standards are reached
4. Consequences and lack of consequences of assessment
5. The role of assessment in facilitating transitions between school levels.
Do Developments in Assessment Represent the ‘Coming of Age’ of Young Learners… 33
However, in many contexts, once the first cohorts to start English in the early years
of schooling began to reach the end of primary school, attitudes and policies often
changed. Assessment of attainment at the end of primary school became required in
Germany in the later 1990s in the same way as it had in the early 1990s in France,
another context in which in the early years of the innovation no assessment had been
required. A similar change took place in Italy in 1997 when a section was added to
the school report form concerning the child’s attainments and progress in learning a
foreign language (in which English was the most popular choice). The present author
remembers attending a number of in-service courses in France, Germany and Italy
designed to support teachers in their new assessment responsibilities.
In the British Council survey, there were reports from 11 out of 64 (17 %) of the
locations surveyed that there had been recent policy changes concerning the intro-
duction of assessment. In addition to these 11 cases, we should not forget that in a
number of other countries, such as those mentioned immediately above, assessment
of English had been already well established some years previously. A later ques-
tion in the survey allowed for respondents to make comments and explain more
about how assessment was carried out.
Before discussing means by which end of primary school assessment is carried out,
it should be remembered, as noted above, that in a large number of contexts (28;
44 % of the sample) it was stated that there was no requirement for formal assess-
ment of English language learning at the end of primary school. This involves a
number of contexts in which standards have been set but there are no formal means
by which their attainment is ascertained.
Where assessment at the end of primary school takes place, this may be by formal
testing but it may also be by a means devised within the school or following a frame-
work supplied from outside but implemented by teachers. France provides an exam-
ple of a recently introduced highly systematic application of this latter practice:
In France, there is continuous assessment from Year 3 to Year 5. At the end of year 5, teach-
ers complete an evaluation (Palier 2 CM2 La pratique d’une langue vivante étrangère)
which covers five skills areas: oral interaction, understanding, individual speaking with no
interaction, e.g. reproducing a model, a song, a rhyme, a phrase, reading aloud, giving a
short presentation e.g. saying who you are and what you like. (France).(Rixon, 2013,
p. 107)
Do Developments in Assessment Represent the ‘Coming of Age’ of Young Learners… 35
Other contexts favour the more conservative means of formal testing which may
also be linked with official evaluation of school success. This is usually as a result
of English, as one curricular subject among many, being included in a wider educa-
tional policy. Russia and Bahrein, for example, were reported as having instituted
new systems of formal assessment across the curriculum at the end of primary
schooling. The stated purpose for this in both cases was in order to monitor and
evaluate school performance.
In Taiwan, assessment specific to English is being implemented at a local level
with, it seems, a diagnostic purpose as well as a school evaluation purpose.
Cities and counties in Taiwan are now developing and administering their own English
proficiency tests at the primary level. The purpose is to assess the effectiveness of English
instruction and to identify those in need of remedial teaching. Assessment is mid-term and
final, starting from the third grade. (Taiwan) (Rixon, 2013, p. 224)
In contexts in which assessment takes place we have seen above that there are dif-
ferent purposes, including the wish to monitor and evaluate school performance. It
was not clear in many of these cases how draconian the consequences of failure to
reach adequate standards would be.
36 S. Rixon
Table 1 Responses in the British Council Survey concerning transition from primary school to the
next level of education
Quite I don’t
Always Often often Sometimes Rarely Never know/no info
Teachers from the two 2 1 2 8 14 24 13
levels of schooling meet
to discuss the transition
Information on 7 0 3 4 8 25 17
children’s levels from
externally provided
formal testing at the end
of Primary School is
passed to the new school
Information on 14 4 3 4 5 21 13
children’s levels from
school-based assessment
is passed to the new
school
The British Council survey (Rixon, 2013, pp. 39–40) aimed to investigate ways
in which assessment data is used or fails to be used in order to promote coordination
between primary/elementary school and secondary school level language learning.
Table 1 shows the numbers of responses of each type to the three questions below
regarding assessment and transition.
1. Do primary and secondary school English teachers meet to discuss pupils mov-
ing to secondary school?
2. Is school-based assessment information passed to the next school?
3. Is information from externally provided formal testing passed to the next school?
The three questions covered three levels of possible formality with which infor-
mation might be passed from one school to the next: It seems from the results of this
part of the survey that the opportunity for making good use of information on chil-
dren’s attainments in English whether through assessment results or informal data
was often missed (yet again).
6 Limitations
The data in this chapter come mostly from surveys in which summaries of prevail-
ing practices are given by experts and experienced teachers and there has been no
opportunity for analysis or discussion of materials used in assessment or of the
experiences and understandings of ordinary teachers and their pupils. Although
some practices may be shared or imitated across national boundaries and instru-
ments such as the CEFR may be influential, it does not make sense to seek for trends
on an international scale.
38 S. Rixon
As pointed out above, a chapter based mainly on survey data cannot make detailed
recommendations for assessment practice in a given context. However, from the
discussion it may be seen that the signs that EYL initiatives are on their way to com-
ing of age with regard to assessment are rather few. As with much in the field of the
teaching of languages to young learners, statements of the ideal in good practice in
the learning/assessment bond often outstrip the reality. It was to be expected that,
given the global nature of the two main surveys quoted, there would be a wide range
of practice found, much of which would be affected by the beliefs and traditions of
local teaching and assessment cultures. However, in some contexts, local authorities
and experts are introducing new approaches which may require a considerable revi-
sion of mind-set on the part of teachers and public alike. The research reported on
in this chapter also suggests a wide range of technical assessment expertise, from
contexts in which assessment practices may be haphazard or occasionally diametri-
cally at odds with the stated pedagogic aims of the teaching programme to those in
which assessment seems to be well understood at both an official and a classroom
practitioner level.
The following key points seem to have emerged:
1. Teachers who in many contexts are still not yet fully bedded in as language
teachers may be expected to lag a little in classroom language assessment prac-
tices. More and higher quality pre-and in-service teacher education on the topic
is needed.
2. There is a notable increase in the setting of target levels but there is not always
provision of means to ascertain whether those levels are in fact obtained. There
is an urgent need for assessment instruments to be developed that are a good
match with the targets.
3. Assessment instruments provided by specialists for regional/national use have
increased since 1999/2000 in terms of quantity. This is a positive development
provided that these instruments in fact match with stated aims.
4. Sharing of assessment information at school transition remains patchy. This is an
area in which all but a few countries need to take serious stock and devise means
to improve continuity and coherence.
There seems to be much that could be learned now and in the near future from
detailed qualitative accounts of the development in assessment of children’s English
language learning in some of the contexts from which the information in this chap-
ter was collected. It is to be hoped that publication of close-up, localised, studies of
assessment practices with young learners will be on the increase.
Do Developments in Assessment Represent the ‘Coming of Age’ of Young Learners… 39
Appendices
References
Andrews, S. (2004). Washback and curriculum innovation. In L. Cheng, Y. Watanabe, & A. Curtis
(Eds.), Washback in language testing: Research (pp. 37–52). Mahwah, NJ: Laurence Erlbaum
and Associates.
Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2003). Assessment for learning:
Putting it into practice. Buckingham: Open University Press.
Black, P., & Wiliam, D. (1998a). Assessment and classroom learning. Assessment in Education,
5(1), 7–74.
Black, P., & Wiliam, D. (1998b). Inside the black box: Raising standards through classroom
assessment. London: King’s College.
Brumen, M., Cagran, B., & Rixon, S. (2009). Comparative assessment of young learners’ foreign
language competence in three Eastern European countries. Educational Studies, 35(3),
269–295.
Burstall, C., Jamieson, M., Cohen, S., & Hargreaves, M. (1974). Primary French in the balance.
Slough: NFER Publishing Company.
Butler, G., & Lee, Y. (2010). The Effects of self-assessment among Young Learners of English.
Language Testing, 27(1), 5–31.
Council of Europe. (2001). Common European framework of reference for languages: Learning,
teaching, assessment. Cambridge: Cambridge University Press.
Davison, C. (2007). Views from the chalkface: English language school-based assessment in Hong
Kong. Language Assessment Quarterly, 4(1), 37–68.
Education, Audiovisual & Culture Executive Agency. (2008). Pri-sec-co. Primary and secondary
continuity in foreign language teaching. Project no. 134029-LLP-1-2007-1-DE-COMENIUS-
CMP. Retrieved from http://eacea.ec.europa.eu/llp/project_reports/documents/comenius/all/
com_mp_134029_prisecco.pdf
Enever, J., & Moon, J. (2009). New global contexts for teaching primary ELT: Change and chal-
lenge. In J. Enever, J. Moon, & U. Raman (Eds.), Young learner English language policy and
implementation: International perspectives (pp. 5–20). Reading: Garnet.
European Languages Portfolio. Retrieved from http://www.coe.int/t/dg4/education/elp/
Graddol, D. (2006). English next. London, UK: British Council.
Hasselgren, A. (2005). Assessing the language of young learners. Language Testing, 22(3),
337–354.
Henry, A. K., Bettinger, E., & Braun, M. K. (2006). Improving education through assessment,
innovation, and evaluation. Cambridge, MA: American Academy of Arts and Sciences.
Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge: Cambridge University
Press.
Hunt, M. J., Barnes, A., Powell, B., & Martin, C. (2008). Moving on: The challenges for foreign
language learning on transition from primary to secondary school. Teaching and Teacher
Education, 24(4), 915–926.
Ioannou-Georgiou, S., & Pavlou, P. (2003). Assessing young learners. Oxford: Oxford University
Press.
Jones, N. (2002). Relating the ALTE framework to the common European framework of reference.
In J. C. Alderson (Ed.), Case studies on the use of the common European framework of refer-
ence (pp. 167–183). Strasbourg: Council of Europe Publishing.
Kubanek-German, A. (2000). Early language programmes in Germany. In M. Nikolov & H. Curtain
(Eds.), An early start: Young learners and modern languages in Europe and beyond (pp. 59–70).
Strasbourg: Council of Europe Publishing.
McKay, P. (2005). Research into the assessment of school-age language learners. Annual Review
of Applied Linguistics, 25, 243–263.
Papp, S., Chambers, L., Galaczi, E., & Howden, D. (2011). Results of Cambridge ESOL 2010
survey on YL assessment. University of Cambridge ESOL Examinations: Cambridge ESOL
internal document VR1310.
Do Developments in Assessment Represent the ‘Coming of Age’ of Young Learners… 41
Papp, S., & Salmoura, A. (2009). An exploratory study into linking young learners’ examinations
to the CEFR. Research Notes, 37, 15–22.
Rea-Dickins. P. (Ed.). (2000). Assessing young language learners [special issue]. Language
Testing, 17(2).
Rea-Dickins, P., & Rixon, S. (1997). The assessment of young learners of English as a foreign
language. In C. Clapham & D. Corson (Eds.), Encyclopedia of language and education
(Language testing and assessment, Vol. 7, pp. 151–161). Dordrecht, The Netherlands: Kluwer
Academic Publishers.
Rea-Dickins, P., & Rixon, S. (1999). Assessment of young learners: Reasons and means. In
S. Rixon (Ed.), Young learners of English: Some research perspectives (pp. 89–101). Harlow:
Pearson Education.
Rea-Dickins, P., & Scott, C. (2007). Washback from language tests on teaching, learning and pol-
icy: Evidence from diverse settings. Investigating washback in language testing and assess-
ment [Special Issue]. Assessment in Education: Principles, Policy and Practice, 14(1), 1–7.
Rixon, S. (2000). Collecting eagle’s eye data on the teaching of English to young learners: The
British Council overview. In J. Moon & M. Nikolov (Eds.), Research into teaching English to
young learners (pp. 153–167). Pécs: University of Pécs Press.
Rixon, S. (2013). British Council survey of policy and practice in primary English Language
Teaching worldwide. Retrieved from http://www.teachingenglish.org.uk/article/
british-council-survey-policy-practice-primary-english-language-teaching-worldwide
Taylor, L., & Saville, N. (2002). Developing English language tests for young learners. Research
Notes, 7, 2–5.
The “Global Scale of English Learning
Objectives for Young Learners”: A CEFR-
Based Inventory of Descriptors
Abstract This chapter presents an ongoing project to create the “Global Scale of
English Learning Objectives for Young Learners” – CEFR-based functional descrip-
tors ranging from below A1 to high B1 which are tailored to the linguistic and com-
municative needs of young learners aged 6–14. Building on the CEFR principles, a
first set of 120 learning objectives was developed by drawing on a number of ELT
sources such as ministry curricula and textbooks. The learning objectives were then
assigned a level of difficulty in relation to the CEFR and the Global Scale of English
and calibrated by a team of psychometricians using the Rasch model. The objectives
were created and validated with the help of thousands of teachers, ELT authors, and
language experts worldwide – with the aim to provide a framework to guide learn-
ing, teaching, and assessment practice at primary and lower-secondary levels.
1 Introduction
V. Benigno (*)
Pearson English, Pearson Education, London, UK
e-mail: [email protected]
J. de Jong
Pearson Assessment Centre, Pearson Education Inc., Iowa City, IA, USA
Amsterdam VU University, Amsterdam, Netherlands
e-mail: [email protected]
awareness and helps to convey societal values such as openness to diversity and
respect” (European Commission, 2011, p. 7).
It is generally believed that early foreign language (FL) introduction provides
substantial benefit to both individuals (in terms of linguistic development, social
status, and opportunities) and governments (as a symbol of prestige and economic
drive). However, some concerns have been raised about the dangers of inadequate
preparation and limited knowledge about who young learners are, how they develop,
and what they need. This has led some researchers to argue against the validity of
“the earlier the better” hypothesis. Among the most common arguments against this
principle are: (a) learning is not exclusively determined by age but also by many
other factors, e.g., the effectiveness of teaching; and (b) younger learners have an
imprecise mastery of their L1 and poorer cognitive skills in comparison to older
learners. Studies on the age factor (e.g., Lightbown & Spada, 2008) have shown
that, at least in the early stages of second language development, older learners
progress faster than younger ones, questioning the benefit of the early introduction
of an FL in the curriculum. Other studies (e.g., Singleton, 1989), however, have
argued that early language learning involves implicit learning and leads to higher
proficiency in the long run. There is indeed some evidence to support the hypothesis
that those who begin learning a second language in childhood in the long run
generally achieve higher levels of proficiency than those who begin in later life
(Singleton, 1989, p. 137), whereas there is no actual counter evidence to disprove
the hypothesis.
It is worth highlighting that “the earlier the better” principle is mainly questioned
in FL contexts, whereas several studies on bilingual acquisition show great benefits
for children who learn two linguistic systems simultaneously (Cummins, 2001).
Another major concern among TEYL educators and stakeholders is the lack of
globally (or widely) accepted guidelines to serve as a reference for standard setting.
Although there is some consensus on who young learners are and how their profi-
ciency develops at different cognitive stages, there seems to be a lack of consistency
in practices around the world. According to Inbar-Lourie and Shohamy (2009,
pp. 93–94, cited in Nikolov & Szabó, 2012, p. 348), early programmes range from
awareness raising to language focus programmes and from content-based curricula
to immersion. It appears to be particularly problematic to develop a global assess-
ment which fits the richness of content aimed at young learners of different ages and
with different learning needs worldwide. While the CEFR has become accepted as
the reference for teaching and assessment of adults in Europe, different language
institutions have produced different, and sometimes conflicting, interpretations of
what the different levels mean. Moreover, there is no single document establishing
a common standard for younger learners, but rather several stand-alone projects that
try to align content to the CEFR or to national guidelines (e.g., Hasselgren,
Kaledaité, Maldonado-Martin, & Pizorn, 2011). Pearson’s decision to develop a
CEFR-based inventory of age-appropriate functional descriptors was motivated by
the awareness of (1) the lack of consensus on standards for young learners and
(2) the consequent need for a more transparent link between instructional and
assessment materials, on the one hand, and teaching practices, on the other.
46 V. Benigno and J. de Jong
Although it is not the purpose of the present study to provide a detailed picture
of all aspects of TEYL, we will briefly touch upon some of the main issues related
to its implementation (see Nikolov & Curtain, 2000 for further details). In the first
section of this chapter we present the heterogeneous and multifaceted reality of
TEYL and discuss the need for standardisation. We outline the linguistic, affective
and cognitive needs which characterize young learners. This brief overview is
intended to provide the reader with some background on the current situation of
TEYL and to support our arguments for the need of a set of descriptors for young
learners. In the second section we discuss the limitations of the CEFR as a tool to
assess young learners. We also describe the reporting scale used at Pearson, the
Global Scale of English -henceforth GSE- (Pearson, 2015a), which is aligned to the
CEFR. Then, we move to the main focus of our paper and explain how we devel-
oped the learning objectives by extending the CEFR functional descriptors and how
we adapted them to the specific needs of a younger audience. Our descriptor set is
intended to guide content development at primary and lower-secondary levels and
to serve as a framework for assessment for EFL learners aged 6–14 and on the
CEFR levels below A1 to high B1. The last section discusses the contribution of our
paper to the research on young learners and briefly mentions some issues related to
assessment.
One of the major concerns related to TEYL is the absence of globally agreed and
applied standards for measuring and comparing the quality of teaching and assess-
ment programmes. Nikolov and Szabó (2012) mention a few initiatives aimed at
adapting the CEFR to young learners’ needs and examinations, along with their
many challenges. According to Hasselgren (2005), the wide diffusion of the
European Language Portfolio checklists developed by the Council of Europe (2014)
for young learners has shown the impact of the CEFR on primary education.
However, a glimpse into the different local realities around the world reveals a cha-
otic picture. Consider the obvious variety of foreign language programmes across
Europe in terms of starting age, hours of instruction, teachers’ proficiency in the
foreign language, teachers’ knowledge of TEYL, and support available to them
(McKay, 2006; Nikolov & Curtain, 2000). Although there may be arguments for
using different methods, approaches, and practices, a problem arises when no or
little effort is made to work toward a common goal. Because of the absence of
agreed standards, even within national education systems, existing learning, teach-
ing and assessment resources are extremely diverse, leading to a lack of connected-
ness and resulting inefficacy. The implementation of a standard is therefore needed
The “Global Scale of English Learning Objectives for Young Learners”… 47
According to McKay (2006), young language learners are those who are learning a
foreign or second language and who are doing so during the first 6 or 7 years of
formal schooling. In our work we extend the definition to cover the age range from
6 to 14, the age at which learners are expected to have attained cognitive maturity.
In our current definition, the pre-primary segment is excluded and age ranges are
not differentiated. In the future, however, we may find it appropriate to split learners
into three groups:
1. Entry years age, usually 5- or 6-year-olds: teaching often emphasizes oral skills
and sometimes also focuses on literacy skills in the children’s first and foreign
language
2. Lower primary age, 7–9: approach to teaching tends to be communicative with
little focus on form
3. Upper primary/lower secondary age, 10–14: teaching becomes more formal and
analytical.
In order to develop a set of learning objectives for young learners, a number of
considerations have been taken into account.
– Young learners are expected to learn a new linguistic and conceptual system
before they have a firm grasp of their own mother tongue. McKay (2006) points
out that, in contrast to their native peers who learn literacy with well-developed
oral skills, non-native speaker children may bring their L1 literacy background
but with little or no oral knowledge of the foreign language. Knowledge of L1
literacy can facilitate or hinder learning the foreign language: whilst it helps
learners handle writing and reading in the new language, a different script may
indeed represent a disadvantage. In order to favour the activation of the same
mechanisms that occur when learning one’s mother tongue, EFL programmes
generally focus on the development of listening and speaking first and then on
reading and writing. The initial focus is on helping children familiarize them-
selves with the L2’s alphabet and speech sounds, which will require more or less
effort depending on the learners’ L1 skills and on the similarity between the
target language and their mother tongue. The approach is communicative and
tends to minimize attention to form. Children’s ability to use English will be
48 V. Benigno and J. de Jong
affected by factors such as the consistency and quality of the teaching approach,
the number of hours of instruction, the amount of exposure to L2, and the oppor-
tunity to use the new language. EFL young learners mainly use the target lan-
guage in the school context and have a minimal amount of exposure to the foreign
language. Their linguistic needs are usually biased towards one specific experi-
ential domain, i.e. interaction in the classroom. In contrast, adolescents and adult
learners are likely to encounter language use in domains outside the classroom.
– The essentials for children’s daily communication are not the same as for adults.
Young children often use the FL in a playful and exploratory way (Cazden, 1974
cited in Philp, Oliver & Mackey, 2008, p. 8). What constitutes general English
for adults might be irrelevant for children (particularly the youngest learners)
who talk more about topics related to the here and now, to games, to imagination
(as in fairy tales) or to their particular daily activities. The CEFR (2001, p. 55)
states that children use language not only to get things done but also to play and
cites examples of playful language use in social games and word puzzles.
– The extent to which personal and extra-linguistic features influence the way chil-
dren approach the new language and the impact of these factors are often under-
estimated (to this regard, see Mihaljević Djigunović, 2016 in this volume):
learning and teaching materials rarely make an explicit link between linguistic
and cognitive, emotional, social and physical skills.
Children experience continuous growth and have different skills and needs at
different developmental stages. The affective and cognitive dimensions, in particu-
lar, play a more important role for young learners than for adults, implying a greater
degree of responsibility on the part of parents, educators, schools, and ministries of
education. One should keep in mind that because of their limited life experience
each young learner is more unique in their interests and preferences than older
learners are. Familiar and enjoyable contexts and topics associated with children’s
daily experience foster confidence in the new language and help prevent them from
feeling bored or tired; activities which are not contextualised and not motivating
inhibit young learners’ attention and interest. From a cognitive point of view, teach-
ers should not expect young learners to be able to do a task beyond their level. Tasks
requiring metalanguage or manipulation of abstract ideas should not come until a
child reaches a more mature cognitive stage. Young learners may easily understand
words related to concrete objects but have difficulties when dealing with abstract
ideas (Cameron, 2001, p. 81). Scaffolding can support children during their growth
to improve their cognition-in-language and to function independently. In fact chil-
dren are dependent upon the support of a teacher or other adult, not only to reformu-
late the language used, but also to guide them through a task in the most effective
way. Vygotsky’s (1978) notion of the teacher or “more knowledgeable other” as a
guide to help children go beyond their current understanding to a new level of
understanding has become a foundational principle of child education: “what a
child can do with some assistance today she will be able to do by herself tomorrow”
(p. 87). The implication of this for assessing what young learners can do in a new
language has been well expressed by Cameron (2001, p. 119):
The “Global Scale of English Learning Objectives for Young Learners”… 49
Vygotsky turned ideas of assessment around by insisting that we do not get a true assess-
ment of a child’s ability by measuring what he can do alone and without help; instead he
suggested that what a child can do with helpful others both predicts the next stage in learn-
ing and gives a better assessment of learning.
The above brief overview of the main characteristics of young learners shows the
need for learning objectives that are specifically appropriate for young learners.
Following the principles laid out in the CEFR, we created such a new, age-
appropriate set of functional descriptors. Although adult and young learners share a
common learning core, only a few of the original CEFR descriptors are suitable for
young learners.
Below we first discuss the limitations of the CEFR as a tool to describe young
learners’ proficiency and present our arguments for the need to complement it with
more descriptors across the different skills and levels. Then, we present the Global
Scale of English, a scale of English proficiency developed at Pearson (Pearson,
2015a). This scale, which is linearly aligned to the CEFR scale, is the descriptive
reporting scale for all Pearson English learning, teaching, and assessment
products.
The CEFR (Council of Europe, 2001) has acquired the status of the standard refer-
ence document for learning, teaching, and assessment practices in Europe (Little,
2006) and many other parts of the world. It is based on a model of communicative
language use and offers reference levels of language proficiency on a six-level scale
distinguishing two “Basic” levels (A1 and A2), two “Independent” levels (B1 and
B2), and two “Proficient” levels (C1 and C2). The original Swiss project (North,
2000) produced a scale of nine levels, adding the “plus” levels: A2+, B1+ and B2+.
The reference levels should be viewed as a non-prescriptive portrayal of a learner’s
language proficiency development. A section of the original document published in
2001 explains how to implement the framework in different educational contexts
and introduces the European Language Portfolio, the personal document of a
learner, used as a self-assessment instrument, the content of which changes accord-
ing to the target groups’ language and age (Council of Europe, 2001).
The CEFR has been widely adopted in language education (Little, 2007) acting
as a driving force for rigorous validation of learning, teaching, and assessment prac-
tices in Europe and beyond (e.g., CEFR-J, Negishi, Takada & Tono, 2012). It has
been successful in stimulating a fruitful debate about how to define what learners
50 V. Benigno and J. de Jong
can do. However, since the framework was developed to provide a common basis to
describe language proficiency in general, it exhibits a number of limitations when
implemented to develop syllabuses for learning in specific contexts. The CEFR
provides guidelines only. We have used it as a starting point to create learning
objectives for young learners, in line with the recommendations made in the
original CEFR publication:
In accordance with the basic principles of pluralist democracy, the Framework aims to be
not only comprehensive, transparent and coherent, but also open, dynamic and non-
dogmatic. For that reason it cannot take up a position on one side or another of current theo-
retical disputes on the nature of language acquisition and its relation to language learning,
nor should it embody any one particular approach to language teaching to the exclusion of
all others. Its proper role is to encourage all those involved as partners to the language learn-
ing/teaching process to state as explicitly and transparently as possible their own theoretical
basis and their practical procedures. (Council of Europe, 2001, p. 18)
The CEFR, however, has some limitations. Its levels are intended as a general,
language-independent system to describe proficiency in terms of communicative
language tasks. As such, the CEFR is not a prescriptive document but a framework
for developing specifications, for example the Profile Deutsch (Glabionat, Müller,
Rusch, Schmitz & Wertenschlag, 2005). The CEFR has received some criticism for
its generic character (Fulcher, 2004) and some have warned that a non-unanimous
interpretation has led to its misuse and to the proliferation of too many different
practical applications of its intentions (De Jong, 2009). According to Weir (2005,
p. 297), for example, “the CEFR is not sufficiently comprehensive, coherent or
transparent for uncritical use in language testing”. In this respect, we acknowledge
the invaluable contribution of the CEFR as a reference document to develop specific
syllabuses and make use of the CEFR guidelines as the basis on which to develop a
set of descriptors for young learners.
A second limitation in the context of YL is that the framework is adult-centric
and does not really take into account learners in primary and lower-secondary edu-
cation. For example, many of the communicative acts performed by children at the
lower primary level lie at or below A1, but the CEFR contains no descriptors below
A1 and only a few at A1. Whilst the CEFR is widely accepted as the standard for
adults, its usefulness to teach and assess young learners is limited and presents more
challenges. We therefore regard the framework as not entirely suitable for describing
young learners’ skills and the aim of this project is to develop a set of age-appropriate
descriptors.
Thirdly the CEFR levels provide the means to describe achievement in general
terms, but are too wide to track progress over limited periods of time within any
learning context. Furthermore, the number of descriptors in the original CEFR
framework is rather limited in three of the four modes or language use (listening,
reading, and writing), particularly outside of the range from A2 to B2. In order to
describe proficiency at the level of precision required to observe progress realisti-
cally achievable within, for example, a semester, a larger set of descriptors, covering
all language modes, is needed.
The “Global Scale of English Learning Objectives for Young Learners”… 51
A learner at 25 on GSE
100%
90%
Likelihood of Success
80%
70%
60%
50%
40%
30%
20%
10%
0%
10 20 30 40 50 60 70 80 90
GSE Task Difficulty
learners’ levels and it offers the potential of more precise measurement of progress
than is possible with the CEFR itself. The CEFR consists of six main levels to
describe increasing proficiency and defines clear cut-offs between levels.
We should point out that learning a language is not a sequential process since
learners might be strong in one area and weak in another. But what does it mean
then to be, say, 25 on the GSE? It does not mean that learners have mastered every
single learning objective for every skill up to that point. Neither does it mean that
they have mastered no objectives at a higher GSE value. The definition of what it
means to be at a given point of proficiency is based on probabilities. If learners are
considered to be 25 on the GSE, they have a 50 % likelihood of being capable of
performing all learning objectives of equal difficulty (25), a greater probability of
being able to perform learning objectives at a lower GSE point, such as 10 or 15,
and a lower probability of being able to cope with more complex learning objec-
tives. The graphs below show the probability of success along the GSE of a learner
at 25 and another learner at 61 (Figs. 2 and 3).
Pearson’s learning objectives for young learners were created with the intention of
describing what language tasks learners who are aged 6–14 can perform. Our inven-
tory describes what learners can do at each level of proficiency in the same way as
a framework, i.e. expressing communicative skills in terms of descriptors. In the
next section we explain how we created YL descriptors sourcing them from different
inputs. Then, we describe the rating exercise and the psychometric analysis carried
out to validate and scale the descriptor set. Our work is overseen by a Technical
Advisory Group (TAG) including academics, researchers, and practitioners
The “Global Scale of English Learning Objectives for Young Learners”… 53
working with young learners who provide critical feedback on our methodology
and evaluate the quality and appropriateness of our descriptor set and our rating and
scaling exercises.
The learning objectives were developed with the aim of describing early stages of
developing ELT competencies. Accordingly, descriptors are intended to cover areas
connected with personal identity such as the child’s family, home, animals, posses-
sions, and free-time activities like computer games, sports and hobbies. Social inter-
action descriptors refer to the ‘here and now’ of interaction face to face with others.
Descriptors also acknowledge that children are apprentice learners of social interac-
tion; activities are in effect role-plays preparing for later real world interaction, such
as ordering food from a menu at a restaurant. The present document is a report on
the creation of the first batch: 120 learning objectives were created in two phases as
described below: drawing learning objectives from various sources and editing
them. In the next descriptor batches we are planning to refer to contexts of language
use applicable particularly to the 6- to 9-year-old age range, including ludic lan-
guage in songs, rhymes, fairy tales, and games.
Phase 1 started in September 2013 and lasted until February 2014. A number of
materials were consulted to identify learning objectives for young learners:
European Language Portfolio (ELP) documents, curriculum documents and exams
(e.g., Pearson Test of English Young Learners, Cambridge Young Learners, Trinity
exams, national exams), and Primary, Upper Primary and Lower Secondary course
books. This database of learning objectives was our starting point to identify lin-
guistic and communicative needs of young learners.
54 V. Benigno and J. de Jong
Phase 2 started in February 2014 and is still in progress: we are currently (summer
2014) working on two new batches of learning objectives (batch 2 and batch 3).
With regard to batch 1, 120 new descriptors were created by qualified and experi-
enced authors on the basis of the learning objectives previously identified. Authors
followed specific guidelines and worked independently on developing their own
learning objectives. Once a pool of learning objectives was finalised, they were vali-
dated for conformity to the guidelines and for how easy it was to evaluate their dif-
ficulty and to assign a proficiency level to them. We held in-house workshops to
validate descriptors with editorial teams. Authors assessed one another’s work. If
learning objectives appeared to be unfit for purpose or no consensus was reached
among the authors, they were amended or eliminated.
The set of 120 learning objectives included 30 for each of the four skills.
Additionally, twelve learning objectives were used as anchor items with known val-
ues on the GSE, bringing the total number of learning objectives to 132. Among the
anchors, eight learning objectives were descriptors taken verbatim from the CEFR
(North, 2000) and four were adapted from the CEFR: they had been rewritten, rated
and calibrated in a previous rating exercise for general English learning objectives.
In our rating exercises for the GSE, the same anchors are used in different sets of
learning objectives in order to link the data. The level of the anchors brackets the
target CEFR level of the set of learning objectives to be rated: for example, if a set
of learning objectives contains learning objectives targeted at the A1 to B2 levels,
anchors are required from below A1 up to C1. A selection of the most YL-appropriate
learning objectives from the CEFR was used as anchors.
A number of basic principles are applied in editing learning objectives. Learning
objectives need to be relatively generic, describing performance in general, yet
referring to a specific skill. In order to reflect the CEFR model, all learning objec-
tives need to refer to the quantity dimension, i.e., what are the language actions a
learner can perform, and to the quality dimension, i.e., how well (in terms of effi-
cacy and efficiency) a learner is expected to perform these at a particular level. Each
descriptor refers to one language action. The quantity dimension refers to the type
and context of communicative activity (e.g., listening as a member of an audience),
while the quality dimension typically refers to the linguistic competences determin-
ing efficiency and effectiveness in language use, and is frequently expressed as a
condition or constraint (e.g., if the speech is slow and clear). Take, for example, one
of our learning objectives for writing below:
• Can copy short familiar words presented in standard printed form (below
A1 – GSE value 11).
The language operation itself is copying, the intrinsic quality of the performance
is that words are short and familiar, and the extrinsic condition is that they are pre-
sented in standard printed form.
The same communicative act often occurs at different proficiency levels with a
different level of quality.
See, for example, the progression in these two listening learning objectives
developed by Pearson:
The “Global Scale of English Learning Objectives for Young Learners”… 55
• Can recognise familiar words in short, clearly articulated utterances, with visual
support. (below A1; GSE value 19)
• Can recognise familiar key words and phrases in short, basic descriptions
(e.g., of objects, animals or people), when spoken slowly and clearly. (A1; GSE
value 24)
The first descriptor outlines short inputs embedded in a visual context, provided
that words are familiar to the listener and clearly articulated by the speaker.
The listener needs to recognize only specific vocabulary items to get the meaning.
The second descriptor shows that as children progress in their proficiency, they are
gradually able to cope with descriptions that require more linguistic resources than
isolated word recognition and the ability to hold a sequence in memory.
Similarly, for speaking, the earliest level of development is mastery of some
vocabulary items and fixed expressions such as greetings. Social exchanges develop
in predictable situations until the point where children can produce unscripted utter-
ances. See, for example, the difference between a learner at below A1 and another
learner at A1:
• Can use basic informal expressions for greeting and leave-taking, e.g., Hello, Hi,
Bye. (below A1; GSE value 11).
• Can say how they feel at the moment, using a limited range of common adjec-
tives, e.g., happy, cold. (A1; GSE value 22).
For writing, the following learning objectives show a progression from very
simple (below A1) to elaborate writing involving personal opinions (B1):
• Can copy the letters of the alphabet in lower case (below A1; GSE value 10).
• Can write a few basic sentences introducing themselves and giving basic per-
sonal information, with support (A1; GSE value 26).
• Can link two simple sentences using “but” to express basic contrast, with sup-
port. (A2; GSE value 33).
• Can write short, simple personal emails describing future plans, with support.
(B1; GSE value 43).
The third example above shows that ‘support’ (from interlocutor, e.g., the
teacher) is recognized in the learning objectives as a facilitating condition. Support
can be realized in the form of a speaker’s gestures or facial expressions or from
pictures, as well as through the use of adapted language (by the teacher or an adult
interlocutor).
Similarly, the following reading descriptors show the progression from basic
written receptive skills to the ability to read simple texts with support:
• Can recognise the letters of the Latin alphabet in upper and lower case. (below
A1; GSE value 10).
• Can recognise some very familiar words by sight-reading. (A1; GSE value 21)
• Can understand some details in short, simple formulaic dialogues on familiar
everyday topics, with visual support. (A2; GSE value 29)
56 V. Benigno and J. de Jong
A number of other secondary criteria were applied. North (2000, pp. 386–389)
lists five criteria learning objectives should meet in order to be scalable.
• Positiveness: Learning objectives should be positive, referring to abilities rather
than inabilities.
• Definiteness: Learning objectives should describe concrete features of perfor-
mance, concrete tasks and/or concrete degrees of skill in performing tasks. North
(2000, p. 387) points out that this means that learning objectives should avoid
vagueness (“a range of”, “some degree of”) and in addition should not be depen-
dent for their scaling on replacement of words (“a few” by “many”; “moderate”
by “good”).
• Clarity: Learning objectives should be transparent, not dense, verbose or
jargon-ridden.
• Brevity: North (2000, p. 389) reports that teachers in his rating workshops tended
to reject or split learning objectives longer than about 20 words and refers to
Oppenheim (1966/1992, p. 128) who recommended up to approximately 20
words for opinion polling and market research. We have used the criterion of
approximately 10–20 words.
• Independence: Learning objectives should be interpretable without reference to
other learning objectives on the scale.
Based on our experience in previous rating projects (Pearson, 2015b), we added
the following requirements.
• Each descriptor needs to have a functional focus, i.e., be action-oriented, refer to
the real-world language skills (not to grammar or vocabulary), refer to classes of
real life tasks (not to discrete assessment tasks), and be applicable to a variety of
everyday situations. E.g. “Can describe their daily routines in a basic way”
(A1, GSE 29).
• Learning objectives need to refer to gradable “families” of tasks, i.e., allow for
qualitative or level differentiations of similar tasks (basic/simple, adequate/
standard, etc.), e.g., “Can follow short, basic classroom instructions, if supported
by gestures” (Listening, below A1, GSE 14).
To ensure that this does not conflict with North’s (2000) ‘Definiteness’ require-
ment, we have added two further stipulations:
• Learning objectives should use qualifiers such as “short”, “simple”, etc. in a
sparing and consistent way as defined in an accompanying glossary.
• Learning objectives must have a single focus so as to avoid multiple tasks which
might each require different performance levels.
In order to reduce interdependency between learning objectives we produced a
glossary defining commonly used terms such as “identify” (i.e., pick out specific
information or relevant details even when never seen or heard before), “recognize”
(i.e., pick out specific information or relevant details when previously seen or
heard), “follow” (i.e., understand sufficiently to carry out instructions or directions,
or to keep up with a conversation, etc. without getting lost). The glossary also
provides definitions of qualifiers such as “short”, “basic”, and “simple”.
The “Global Scale of English Learning Objectives for Young Learners”… 57
Once the pool of new learning objectives was signed off internally, they were vali-
dated and scaled through rating exercises similar to the methodology used in the
original CEFR work by North (2000). The ratings had three goals: (1) to establish
whether the learning objectives were sufficiently clear and unambiguous to be inter-
pretable by teachers and language experts worldwide; (2) to determine their posi-
tion on the CEFR and the GSE scales; and (3) to determine the degree of agreement
reached by teachers and experts in assigning a position on the GSE to learning
objectives.
The Council of Europe (2009) states that to align materials (tests, items, and
learning objectives) to the CEFR, people are required to have knowledge of (be
familiar with) policy definitions, learning objectives, and test scores. As it is diffi-
cult to find people with knowledge of all three, multiple sources are required
(Figueras & Noijons, 2009, p. 14). The setting of the rating exercise for each group
was a workshop, a survey or a combination of both workshop and online survey for
teachers. Training sessions for Batch 1 were held between March and April 2014 for
two groups accounting for a total of 1,460 raters: (1) A group of 58 expert raters
who were knowledgeable about the CEFR, curricula, writing materials, etc. This
group included Pearson English editorial staff and ELT teachers. (2) A group of
1,402 YL teachers worldwide who claimed to have some familiarity with the
CEFR. The first group took part in a face-to-face webinar where they were given
information about the CEFR, the GSE, and the YL project and then trained to rate
individual learning objectives. They were asked to rate the learning objectives, first
according to CEFR levels, and then, to decide if they thought the descriptor would
be taught at the top, middle or bottom of the level. Based on this estimate, they were
asked to select a GSE value corresponding to a sub-section of the CEFR level.
The second group participated in online surveys, in which teachers were asked to
rate the learning objectives according to CEFR levels only (without being trained on
the GSE).
All raters were asked to provide information about their knowledge of the CEFR,
the number of years of teaching experience and the age groups of learners they
taught (choosing from a range of options between lower primary and young adult/
adult – academic English). We did not ask the teachers to provide information on
their own level of English, as the survey was self-selecting; if they were familiar
with the CEFR and able to complete the familiarisation training, we assumed their
level of English was high enough to be able to perform the rating task. They
answered the following questions:
• How familiar are you with the CEFR levels and descriptors?
• How long have you been teaching?
• Which of the following students do you mostly teach? If you teach more than one
group, please select the one you have most experience with – and think about this
group of students when completing the ratings exercise.
• What is your first language?
• In which country do you currently teach?
58 V. Benigno and J. de Jong
After all ratings were gathered, they were analysed and were assigned a CEFR/GSE
value. The data consisted of ratings assigned to a total of 132 learning objectives by
58 language experts and 1,402 teachers. Below we describe the steps we followed
to assign a GSE value to each descriptor and to measure certainty values of the
individuals’ ratings.
As the GSE is a linear transformation of North’s (2000) original logit-based
reporting scale, the GSE values obtained for the anchor learning objectives can be
used as evidence for the alignment of the new set of learning objectives with the
original CEFR scale. Three anchor learning objectives were removed from the data
set. One anchor descriptor had accidentally been used as an example (with a GSE
value assigned to it) in the expert training. Independence of the expert ratings could
therefore not be ascertained. Another anchor did not obtain a GSE value in align-
ment with the North (2000) reported logit value. For the third descriptor no original
logit value was available in North (2000), although it was used as an illustrative
descriptor in the CEFR (Council of Europe, 2001). Therefore, the number of valid
anchors was nine and the total number of rated learning objectives was 129.
The values of the anchors found in the current project were compared to those
obtained for the same anchors used in preceding research rating adult oriented
learning objectives (Pearson, 2015b). The correlation (Pearson’s r) between ratings
The “Global Scale of English Learning Objectives for Young Learners”… 59
assigned to anchors in the two research projects was 0.95. The anchors had a
correlation of 0.94 (Pearson’s r) with the logit values reported by North (2000),
indicating a remarkable stability of these original estimates, especially when taking
into account that the North data were gathered from teachers in Switzerland more
than 15 years ago.
The rating quality of each rater was evaluated according to a number of criteria.
As previously explained, the original number of 1,460 raters (recruited at the start
of the project) reduced to only 274 raters after running psychometric analysis of all
data. Raters were removed if (1) the standard deviation of their ratings was close to
zero as this was an index of lack of variety in their ratings; (2) they rated less than
50 % of the learning objectives; (3) the correlation between their ratings on the set
of learning objectives and the average rating from all raters was lower than 0.7; and
(4) if they showed a deviant mean rating (z mean beyond p = <0. 05). As a result,
from the total of 1,460 (37 of 58 expert raters and 237 of 1,402 teachers) only 274
raters passed these filtering criteria. The selected teachers came from 42 different
countries worldwide.
Table 1 shows the distribution of the learning objectives along CEFR levels
according to the combined ratings of the two groups. It was found to peak at the A2
and B1 levels, indicating the need to focus more on low level learning objectives in
the following batches.
Table 2 shows the certainty index distribution based on the two groups’ ratings.
Certainty is computed as the proportion of ratings within two adjacent most often
selected levels of the CEFR. Let us take, for example, a descriptor which is rated as
A1 by a proportion of .26 of the raters, as A2, by .65 of the raters, and by .09 as B1.
In this case the degree of certainty in rating this descriptor is the sum of the
proportions observed with the two largest adjacent categories, i.e., A1 and A2 with
.26 and .65 respectively. The sum of these yields a value of .91. This is taken as the
degree of certainty in rating this descriptor. Only 4 % of the data set showed cer-
tainty values below .70, while only 7 % of the learning objectives showed certainty
below .75. At this stage we take the low certainty as an indication of possible issues
with the descriptor, but will not reject any descriptor. At a later stage, we will com-
bine the set reported on here with all other available descriptor sets and evaluate the
resulting total set using the one-parameter Rasch model (Rasch,1960/1980) to esti-
mate final GSE values This will increase the precision of the GSE estimates and
reduce the dependency on the raters involved in the individual projects. At that time
the certainty of ratings will be re-evaluated and learning objectives with certainty
below a certain threshold will be removed.
5 Final Considerations
Appendices
References
Lightbown, P. M., & Spada, N. (2008). How languages are learned. New York: Oxford University
Press.
Little, D. (2006). The common European framework of reference for languages: Content, purpose,
origin, reception and impact. Language Teaching, 39(3), 167–190.
Little, D. (2007). The common European framework of references for languages: Perspectives on
the making of supranational language education policy. The Modern Language Journal, 91(4),
645–655.
McKay, P. (2006). Assessing young language learners. Cambridge: Cambridge University Press.
Mihaljević Djigunović, J. M. (2016). Individual differences and young learners’ performance on
L2 speaking tests. In M. Nikolov (Ed.), Assessing young learners of English: Global and local
perspectives. New York: Springer.
Negishi, M., Takada, T., & Tono, Y. (2012). A progress report on the development of the CEFR-J
(Studies of Language Testing, 36, pp. 137–165). Cambridge: Cambridge University Press.
Nikolov, M. (2016). A framework for young EFL learners’ diagnostic assessment: Can do state-
ments and task types. In M. Nikolov (Ed.), Assessing young learners of English: Global and
local perspectives. New York: Springer.
Nikolov, M., & Curtain, H. (Eds.). (2000). An early start: Young learners and modern languages
in Europe and beyond. Strasbourg: Council of Europe.
Nikolov, M., & Mihaljević Djigunović, J. (2006). Recent research on age, second language acqui-
sition, and early foreign language learning. Annual Review Applied Linguistics, 26, 234–260.
Nikolov, M., & Mihaljević Djigunović, J. (2011). All shades of every color: An overview of early
teaching and learning of foreign languages. Annual Review of Applied Linguistics, 31, 95–119.
Nikolov, M., & Szabó, G. (2012). Developing diagnostic tests for young learners of EFL in grades
1 to 6. In Galaczi E. D. & Weir C. J. (Eds.), Voices in language assessment: Exploring the
impact of language frameworks on learning, teaching and assessment: Policies, procedures
and challenges (pp. 347–363). Proceedings of the ALTE Krakow Conference, July 2011.
Cambridge: UCLES/Cambridge University Press.
North, B. (2000). The development of a common framework scale of language proficiency.
New York: Peter Lang.
Oppenheim, A., N. (1966/1992) (2nd ed.) Questionnaire design, interviewing and attitude mea-
surement. London: Pinter Publishers.
Pearson. (2010). Aligning PTE Academic Test Scores to the common European framework of refer-
ence for languages. Retrieved June 2, 2014, from http://pearsonpte.com/research/Documents/
Aligning_PTEA_Scores_CEF.pdf.
Pearson. (2015a). The Global Scale of English. Retrieved May 25, 2015, from http://www.english.
com/gse.
Pearson. (2015b). The Global Scale of English Learning Objectives for Adults. Retrieved May 25,
2015, from http://www.english.com/blog/gse-learning-objectives-for-adults.
Philp, J., Oliver, R., & Mackey, A. (Eds.). (2008). Second language acquisition and the young
learner: Child’s play? Amsterdam: John Benjamins.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests.
Copenhagen: Danish Institute for Educational Research. Expanded edition (1980) with fore-
word and afterword by B.D. Wright. Chicago: The University of Chicago Press.
Singleton, D. (1989). Language acquisition. The age factor. Clevedon: Multilingual Matters.
Speitz, H. (2012). Experiences with an earlier start to modern foreign languages other than English
in Norway. In A. Hasselgren, I. Drew, & S. Bjørn (Eds.), The young language learner:
Research-based insights into teaching and learning (pp. 11–22). Bergen: Fagbokforlaget.
Vygotsky, L. (1978). Mind in society: The development of higher psychological processes.
Cambridge, MA: Harvard University Press.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Oxford, UK:
Palgrave.
Zheng, Y., & De Jong, J. (2011). Establishing construct and concurrent validity of Pearson Test of
English Academic (1–47). Retrieved May 20, 2014, from http://pearsonpte.com/research/
Pages/ResearchSummaries.aspx.
A Framework for Young EFL Learners’
Diagnostic Assessment: ‘Can Do Statements’
and Task Types
Marianne Nikolov
Abstract The aim of this chapter is to present a framework for assessing young
learners of foreign languages for diagnostic purposes. The first section outlines the
most important trends in language assessment and describes the educational context
where the project was implemented. Then, the chapter discusses how children
between the ages of 6 and 12 develop in a foreign language and outlines the most
important principles of assessing young language learners. The actual framework
was designed for the four skills; it aimed to cover the first 6 years of primary educa-
tion in Hungarian public schools. The document used the Common European
Framework of Reference (CEFR, 2001) as a point of departure and includes age-
specific ‘can do statements’ and task types corresponding to them. Readers are
encouraged to critically reflect on how the findings could be adopted in their own
contexts.
1 Introduction
The chapter presents some of the results of a national project conducted in Hungary
in the hope that readers may find them useful in their own contexts. The first part of
the chapter embeds the project in recent trends in educational and language assess-
ment and the educational context where the project was implemented. In order to
develop age-appropriate diagnostic tests for learners of English as a foreign lan-
guage (EFL) in the first 6 years of primary school (ages 6–12) in the four skills, a
framework was designed in line with the Common European Framework of
Reference (CEFR, 2001), including ‘can do statements’ and task types correspond-
ing to them (Nikolov, 2011). As a next step, diagnostic tests were developed and
validated (Nikolov & Szabó, 2012a, 2012b; Szabó & Nikolov, 2013). These
M. Nikolov (*)
Institute of English Studies, University of Pécs, Pécs, Hungary
e-mail: [email protected]
calibrated tasks are meant to be available to teachers for their classroom use. This
chapter focuses on the main features of the framework, what can do statements and
various task types were specified and what lessons were learned from various phases
of the project.
Recent trends in educational research are highly relevant to early language learning,
since they have opened new avenues on how different approaches to assessment,
diagnostic and dynamic testing as well as peer and self-assessment, can boost learn-
ers’ learning potential (Alderson, 2005; Rixon, 2016, Hung, Samuelson, & Chen,
2016 in this volume; Sternberg & Grigorenko, 2002) and also offer teachers feed-
back on their own work and where students are in their development. Besides tradi-
tional ways of assessment of learning, the need to focus on assessment for learning
has been widely emphasized not only in language learning but in other domains as
well (Assessment Reform Group, 2002; Black & Wiliam, 1998; Davison & Leung,
2009; Leung & Scott, 2009; McKay, 2006; Teasdale & Leung, 2000; also see Rixon,
2016 in this volume). These shifts in emphasis on how children can benefit from
classroom testing, and how teachers can scaffold their development have resulted in
new studies. Assessment should be sensitive to the issue of readiness to develop
(McNamara & Roever, 2006, pp. 251–252); this is an area where more research is
needed to find out how learners can benefit from different kinds of interaction
(Nikolov & Mihaljević Djigunović, 2011, p. 111) and how their teachers can use
diagnostic information. These points are crucial for young learners, as their prog-
ress in their new language depends on their classroom experiences and feedback
from their teachers and peers. Techniques applied in diagnostic assessment may
also open new avenues for developing learner autonomy by involving students’ in
their own development.
Before moving on we need to discuss how diagnostic assessment is defined, what
the key characteristics are, and how the concept fits the picture outlined so far.
Definitions of diagnostic assessment share the following features:
(1) “diagnostic tests seek to identify those areas in which a student needs further help”
(Alderson, Clapham & Wall, 1995, p. 12);
(2) records on diagnostic assessments indicate “specific areas of strengths and weaknesses
in language ability” (Bachman & Palmer, 2010, p. 196);
(3) diagnostic tests can be theory or syllabus-based (Bachman, 1990, p. 60);
(4) tests developed for other purposes, for example, for progress, proficiency or placement,
can be and are often used diagnostically (Alderson, 2005; Bachman & Palmer, 2010);
(5) information on learners’ strengths and weakness can lead to action: teachers can use
results to tune their teaching to learners’ needs and learners may seek out more oppor-
tunities to practice in the problem areas;
(6) diagnostic tests are hard to develop and are rarely investigated (Alderson, 2005, p. 6).
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 67
consider the quality of foreign language instruction at schools when they choose an
institution. Since the 1990s, some important changes have emerged in foreign lan-
guage education: (1) the demand for English as a lingua franca has dynamically
increased and, in contrast, German has lost some of its appeal (Medgyes & Nikolov,
2014); (2) due to parental pressure, the age when children start learning a FL has
decreased, despite the fact that language policy documents have maintained grade 4
as the mandatory start of FL learning (Nikolov, 2009). As a result of this controver-
sial regulation, parents who are very keen on their children’s early learning of a
foreign language press schools to lower the time of starting a FL. Schools are sup-
ported per capita by the ministry, therefore, it is their interest to satisfy needs by
launching early language programs to attract students.
This situation is further complicated by the increasingly higher value attached
to English than to German, and the fact that teachers are tenured in their jobs and
German classes also have to be filled. As English is a lot more popular, schools
stream students in different language groups. More able and socially more privi-
leged students tend to start learning a FL earlier and the ratio of English learners
is higher than that of learners in German classes. Also, students with higher socio-
economic background and better achievements in other school subjects attend
more intensive programs, whereas less able students, often coming from poorer
and less educated families, tend to start later, they are taught in fewer weekly
classes and are often placed in German classes, although they would prefer to
learn English.
Due to these interrelated reasons, in various large-scale testing projects involving
representative samples of students in years 6, 8, 10 and 12, significant differences
have been found in students’ proficiency levels studying English and German:
results tend to be higher in English (Csapó & Nikolov, 2009; Nikolov, 2011; Nikolov
& Józsa, 2006). Another important outcome is that a very wide range of achieve-
ments is typical across all levels of education and the differences increase as stu-
dents make progress in their studies, thus, many children are left behind. Learners
of English tend to achieve higher scores and their attitudes and motivation are con-
sistently more favorable than those of their peers learning German (Dörnyei, Csizér,
& Németh, 2006; Nikolov, 2003). Classroom practice, however, is typically charac-
terized by similar practices often focusing on form and applying grammar-translation
type of drills rather than focusing on meaning even in the younger age groups
(Nikolov, 2003, 2008).
As for how much it matters when children start learning a foreign language,
minimal contributions were found of an early start in a national project involving
representative samples of English and German learners in their 6th and 10th grades
(age 12 and 16). As the results of regression analyses indicate in Table 1, the num-
ber of years students studied English and German explains 3 and 4 % of variance in
their scores, whereas the number of weekly classes between 10 and 14; however,
students’ socio-economic status explains 25–24 % of variance in English and
18–17 % in German achievements. In other words, whether students started early or
late, made hardly any difference in their levels of proficiency in any of the 2 years
or languages.
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 69
Table 1 Variables contributing to Hungarian learners’ performances in English and German (rβ%)
(Nikolov & Józsa, 2006, p. 211)
Year 6 Year 10
Independent variables English German English German
Parents’ education (SES) 25 18 24 17
Weekly classes 13 10 14 13
Years of language study 3 3 4 4
Private tuition ns ns 2 2
Variance explained (%) 41 31 44 36
The aims and achievement targets of the Hungarian diagnostic assessment project
had to be in line with theories on how children learn a FL, curricular requirements,
and realities in schools. For FLs, various versions of National Core Curricula (NCC;
for a critical overview see Medgyes & Nikolov, 2010) preceded the version pub-
lished in 2007. This was the version the diagnostic project had to be in line with.
Despite the fact that in 2006 every fourth school started teaching a FL before the
mandatory grade 4 (Nikolov, 2009), the official curriculum maintained that all
schools had to offer students at least one FL from fourth grade (age 10) and it
allowed them to start earlier upon parents’ requests. However, no official curricu-
lum was available for the first three grades (ages 6–9), and no goals and achieve-
ment targets were set for the first years (Nikolov, 2011). Therefore, one of the aims
was to outline a framework for EFL for the first six grades of public schools.
The NCC (2007) prescribed dual levels of achievement targets for the 9 years of
compulsory FL learning between grades 4 and 12 (age 10 and 18), depending on
long term goals: whether students aimed to take an intermediate (B1 level) or
advanced (B2) level school-leaving examination at the end of their education at age
18. The NCC explicitly stated that the construct was communicative competence
(useful language ability) in the four skills (listening, speaking, reading, and writing)
and the required levels were in line with the CEFR (2001); the levels students had
to achieve were independent of when they started learning a FL and how intensive
their courses were. By the end of year 6, students were expected to be at the A1- or
A1 level, whereas at the end of year 8, at the A1+ or A2- level in the four skills. The
NCC specified provision in loose terms: in grades 1 to 4, 2–6 % of the overall
classes (1or 2 per week) could be devoted to teaching a FL, whereas in grades 5 to
8, 12–20 % (2–6 classes). However, some schools could also launch content and
language integrated learning type of dual-language classes, teaching some subjects
in the target language, but achievement targets were not specified until a new ver-
sion of NCC (2012) was published.
Besides achievement targets in the foreign language, the NCC (2007) specified
some further aims: they included the development of learners’ positive attitudes
towards language learning and towards other cultures, their motivation to improve
their proficiency and to learn about the target culture as well as other cultures, and
70 M. Nikolov
their language learning strategies. Therefore, these were also included in the
framework.
The language testing background to our study is based on the conceptualization
of communicative competence and language ability (Council of Europe, 2001):
learners’ performances are assessed in their four language skills. In the choice of
task and text types, piloting and validating tests, and evaluating results, we followed
the principles of communicative language testing in general (Alderson, 2005;
Alderson et al., 1995; Bachman & Palmer, 2010), and assessing young learners in
particular (McKay, 2006; Nikolov, 2011; Nikolov & Szabó, 2011a, 2011b).
The aims for the first phase of the diagnostic assessment project were (1) to design
a framework based on research into how young learners of a FL develop and the
main principles of teaching and assessing them; (2) to draw up a list of can do state-
ments for young learners in the first six grades of public schools for the levels
required in the curriculum; (3) to identify topics, text types and task types that
would allow valid, reliable and age-appropriate diagnostic assessments of the target
age group in the four skills in EFL in line with curricular requirements. In the fol-
lowing sections these three points are discussed.
As a first step, a detailed analysis of the literature was conducted with the follow-
ing focal points: (1) how young learners of various first languages, including
Hungarian as L1, develop in English as a foreign language, (2) the main principles
of teaching and assessing children in their new language in the first six grades, and
(3) what is known about classroom practice in the first 6 years of EFL in Hungarian
public schools. In addition to these, a small-scale focused project was implemented
to explore (4) what teaching materials and tests are used in EFL classes and how
teachers apply them for assessment.
In this section we summarize the main points related to how children between the
ages of 6 and 12 develop in a FL and outline the most important principles for
assessing their development. This short overview is based on a range of handbooks
and empirical studies on early language learning and teaching (e.g., Nikolov &
Mihaljevic-Djigunovic, 2006, 2011).
It has been widely accepted that the younger the learners are, the more similar
the process of their FL development tends to be to the acquisition of their first
language(s) and the less able they are to learn and apply language rules consciously.
Language learning is based on two distinct processes (MacWhinney, 2005, Paradis,
2004, 2009, Skehan, 1998, Ullman, 2001). Implicit learning is based on memoriz-
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 71
Errors are typical and they indicate where the children are in their process of
learning the target language; similarly to L1 development, errors emerge and then
tend to disappear with time if enough learning opportunities are offered. Certain
features of interlanguage indicate the developmental stages children are at. Many
young learners progress from a silent period in their foreign language class and they
may be willing to respond by movements or body language or in their FL, indicating
their level of listening comprehension. Typical developmental stages are marked,
for example, by the use one-word or two-word utterances, or omission of certain
words (e.g., copula) or the use of external no in negation (no dog) in speaking. They
often transfer their L1 pronunciation in the case of cognates (e.g., elephant, televi-
sion, computer) or intonation patterns in questions, for example. These develop-
mental errors indicate the learning process and they tend to disappear over time or
with the help of tasks helping children notice gaps (Schmidt, 1990) at a further
developmental stage when they are ready.
The distinction between basic interpersonal communication skills and cognitive
academic skills (Cummins, 2000) can highlight yet another important principle:
most children develop along similar lines in their oral and aural skills, but more vis-
ible individual differences tend to emerge in their literacy development. These dif-
ferences are related to children’s aptitude, literacy skills in their first and other
languages and these interact with their socio-economic status. Several empirical
studies revealed important relationships between young learners’ level of aptitude,
their L1 skills and their socio-economic status in the Hungarian educational context
(Bacsa & Csíkos, 2016 in this volume; Csapó & Nikolov, 2009; Kiss & Nikolov,
2005 Nikolov & Csapó, 2010, Nikolov & Józsa, 2006) and in other countries as well
(e.g., Alexiu, 2009; Mihaljević Djigunović, 2012; see also findings by Wilden &
Porsch, 2016 in the present volume on multilingual young learners’ receptive skills
in English and German).
The interaction between young learners’ languages is further underpinned by
findings in classroom research. In a language, like Hungarian, with a highly trans-
parent sound – letter correspondence all children who can read words in their L1
will apply their L1 phonetic rule in English and read out words phonetically. This
strategy may support memorizing the spelling of words phonetically in L1 but may
negatively impact reading (Nikolov, 2002). Hungarian learners of all ages who can
spell and write well tend to apply this strategy.
The younger the learners are the slower their development is in their new lan-
guage compared to older learners (Krashen, 1985, Nikolov & Mihaljevic-
Djigunovic, 2006). Findings of two longitudinal studies provide evidence in
European EFL contexts (García Mayo & García Lecumberri, 2003; Muñoz, 2006),
whereas studies in English as a second language (ESL) contexts have found that 5–7
years are necessary for children to achieve native-like proficiency in immersion
programs (Wong Fillmore, 1998) where the teachers and many of the peers are
native speakers and the language of instruction is English. This slow speed of prog-
ress has important implications for teaching and assessment.
The main argument for an early start is often the critical period hypothesis; the
assumption that language acquisition has to start before a certain time in one’s life,
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 73
able to understand what they can and cannot do well. An emphasis on positive
outcomes and encouragement are crucial when assessing young learners; as they
need to feel successful, tasks should be doable to avoid frustration. It is also impor-
tant to bear in mind how performing in front of others may induce anxiety in chil-
dren, so working in pairs or small groups should be alternatives (Nikolov &
Mihaljević Djigunović, 2011).
Tasks should focus on meaning (not form) and allow young learners to commu-
nicate with their peers and their teacher (Nikolov, 2011). As at the early stages of
language learning children are not proficient in their literacy skills in their mother
tongue, both teaching and assessment should focus on their listening comprehen-
sion and speaking skills; and reading comprehension and writing should be intro-
duced gradually when they are ready for them.
Tasks used in course books often integrate more than one language skill; how-
ever, during assessment it is important to try to focus on skills separately so that the
skill and subskill is specified where children’s strengths and weaknesses are identi-
fied (Alderson, 2005; McKay, 2006).
Feedback and evaluation must always come right after students’ performance, it
should be individualized and also motivating for further learning (Nikolov, 2011).
Diagnostic assessment should be regular, it should tap into the small develop-
mental steps and should provide clear feedback so that young learners can feel that
they are making progress and achieving what they are expected to (Nikolov, 2011).
Both self- and peer-assessment can be effectively used in diagnostic assessment,
as they may contribute to learner autonomy, encourage the use of learning strategies
and children can scaffold one another’s FL learning (for detailed discussions see
Rixon, 2016 and Hung et al., 2016 in this volume and McKay, 2006).
As for the content of assessment tasks, themes and topics listed in curricula and
discussed in typically used teaching materials should be drawn on bearing in mind
both the children’s local and the target language cultures.
The first draft of the above framework for English as a foreign language was one
of the documents used in the project and then, after piloting diagnostic tests, inte-
grated into the final framework published in Hungarian (Nikolov, 2011).
Prior to the project a lot of data were available on how teachers develop but less on
how they assess their young EFL learners in primary schools. Observations and
interviews were conducted (Bors, Lugossy, & Nikolov, 2001; Nikolov, 2008) and
questionnaire data were also collected from students in large-scale national testing
projects (Csapó & Nikolov, 2009; Nikolov, 2003; Nikolov & Józsa, 2006). The
main findings indicated that the most typical classroom activities were rarely in line
with age-appropriate teaching methodology; teachers tended to focus on grammar,
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 75
and translation and reading out loud were the most often applied techniques of
meaning making and testing. These activities were the most disliked ones among
students in addition to other written tests, whereas the most favored classroom
activities included watching videos, acting out role plays, and other oral tasks; these
were the least often applied. Overall, these classroom-based studies shed some light
on why the efficiency of early start programs was low and the need for further
research.
As these surveys did not directly focus on teachers’ assessment practices, a
small-scale project was designed to explore what specific tests highly experienced
teachers of young learners used and how they assessed their learners with the help
of these test tasks in their classrooms in the first six grades (Hild & Nikolov, 2011).
A convenience sample of twelve Hungarian teachers of English volunteered (for
payment) to choose and characterize tests they often used in their lessons for diag-
nostic assessment purposes. The respondents were asked to describe and attach the
actual tasks and to fill in a short questionnaire on them to reveal how they actually
diagnosed their students’ strengths and weaknesses, how they gave them feedback,
and what level the tasks were in their views. Teachers analyzed 119 tasks; most of
them integrated various skills or comprised a sequence of tasks building on one
another. The largest category of tasks integrated reading comprehension and writing
skills; tasks in the second main category integrated listening comprehension and
speaking skills, whereas the third group integrated three skills. Five tests were
meant to develop listening, speaking and writing; four reading, writing and speak-
ing; two listening, reading and speaking; and one listening, reading and writing.
Two of these tasks assessed surprising domains: reading comprehension, practice of
punctuation and negative forms; listening, lip reading, and speaking. Twelve tasks
assessed speaking exclusively. The fifth category comprised eleven tasks that inte-
grated reading and speaking, whereas eleven tasks assessed writing and nine listen-
ing skills. The last three categories comprised seven speaking and writing tasks,
seven reading and two other tasks (listening and reading; reading and vocabulary)
(Hild & Nikolov, 2011).
In sum, the tests teachers used varied to a great extent and the main findings were
that (1) teachers found it hard to apply the categories we clarified in the data collec-
tion instrument and they were supposed to be familiar with; (2) they applied fuzzy
terms for assessing learners’ performances and not criteria; (3) the tests either
tapped into two or more skills in an integrated manner, thus it was not possible to
find out which skill they measured, or they comprised sequences of tasks where the
outcomes of the first part determined how well students could perform on the next
ones; (4) they tended to focus on errors, accuracy and what students cannot do
rather than fluency, vocabulary and what students can do; (4) the feedback teachers
gave learners typically meant rewards for best performances, but no reward for less
good performances, thus only top achievers got feedback. These techniques could
demotivate less able learners and rewards did not give information on what areas
needed improvement (Hild & Nikolov, 2011).
As a result of the above small-scale survey and an extensive analysis of the task
and text types used in teaching materials, an exhaustive list of test and text types was
76 M. Nikolov
compiled. Then, these were compared and contrasted with can do statements in
CEFR at A1 and A2 levels in a two-day workshop in Pécs in June 2010. Participants
included highly experienced primary-school teachers of English, and a team of
Hungarian and international experts on researching and testing young learners (see
Acknowledgements). The themes and topics in the teaching materials were also
overviewed and matched with the ones listed in the NCC (2007) before the final list
was drawn up.
In the final list of task types, the following criteria were used for inclusion: (1)
task was age-appropriate; (2) task was in line with how children develop in a L2; (3)
it reflected good practice; (4) children’ performance on the task could be measured
(quantified); (5) task was appropriate both for developing and testing one or more
clearly specified skills or subskills in ‘can do statements’ listed in the framework;
(6) task was within the attention span of the target group; (7) task was expected to
be intrinsically motivating for young learners. In the next sections the results are
presented: first the ‘can do statements’, then the topics, text and task types are
discussed.
One of the many challenges in drawing up what children can do concerns their slow
progress in the first few years of their learning of a new language. Some previous
work was available on how the CEFR had adapted to accommodate young learners’
needs (e.g., Hasselgren, 2005; Pižorn, 2009; Papp & Salamoura, 2009); these were
consulted before the actual list of can do statements were drawn up.
As the teachers we intended to involve in the project needed reference points to
guide them in estimating the level of their students in an educational context where
children may start learning a foreign language in any of the grades, we tried to
establish three levels within the continuum specified in the curriculum for grades 1
to 6. The following criteria were used to define these levels and we labelled them as
(1) beginner, (2) beginner plus, and (3) elementary levels, corresponding to the A1-,
A1, and A2- levels in the CEFR (2001).
An important point concerned how teachers who joined the piloting phase of the
diagnostic assessment project could decide which level their classes should target.
The list of can do statements were meant for them, too, to help them estimate the
level of difficulty of the tasks. The following criteria were drawn up to help teachers
decide in terms of how much instruction was most probably in line with the levels.
A1- Beginner: This level describes what children can realistically be expected to do
by the end of 4th grade (age 10), after studying EFL for 1–4 years, in 1–3 h per
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 77
week. Included in this level are absolute beginners (with no previous exposure to
English at all) as well as false beginners (who may have been exposed to some
English by hearing it from their parents, in kindergarten, in private lessons, on
television, in computer games or while staying abroad).
A1 Beginner Plus: This level describes what learners can realistically be expected to
do by the end of 5th grade (age 11), after studying EFL for 2–5 years in 1–3 h per
week.
A2- Elementary: This level is assumed to be what learners can realistically achieve
by the end of 6th grade (age 12), after studying EFL for 3–6 years in 1–3 h per
week.
In addition to these points, it was clarified that as children starting to learn
English at age 6 are at a very low level in their literacy skills in their L1, the can do
statements in reading and writing are not relevant in their case, only the listening
comprehension, speaking and interaction ones are. In other words, the levels in the
various skills can vary. Thus, young learners are not expected to achieve the same
level in the four skills, as curricula may vary a lot.
As Table 2 shows, the can do statements are arranged in three skill areas. In the
first one listening comprehension, speaking and interaction are listed together,
whereas reading and writing are put in two groups. There are many more statements
in the first group, as this is where at this very low level (A1-) young learners are
expected to be able to do more in listening comprehension, speaking and interaction
than in their literacy skills.
As Table 3 shows, the list of can do statements is longer, and in some only a
single word or expression is different from the wording in Table 2. The statements
are listed in the same order as in Table 2 in order to allow users to notice the differ-
ences. Some of the can do statements are specific to the teaching traditions of
Hungarian learners, for example, spelling is included under reading. This is level
A1 in CEFR.
As Table 4 shows, can do statements for the elementary (A2-) level expand the
ones in the previous two tables. In some of them references to classroom contexts
are included, for example, “Can ask a question or help peers when they are stuck.”
An additional feature refers to accuracy: at this level students are expected to be able
to do what they could not do very well without mistakes. It was felt that this was a
necessary addition in order to avoid fossilization and the typical complaint on the
part of teachers in later years that there is hardly anything to rely on when young
learners enter secondary schools.
learning strategies are considered crucial; they are important across all skills and
have to be developed systematically during the long process of learning English.
Teachers should consciously focus on these strategies from the earliest stages of
language development.
Learners should be able to
1. distinguish familiar words and expressions from unfamiliar ones;
2. guess meanings of words and expressions (in L1 and L2) in context by relying
on their background knowledge of the world;
3. use visual and other contextual information for guessing meaning;
4. help their peers if they do not understand something;
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 79
As the Hungarian NCC (2007) includes hints at what young learners should know
about the target language cultures, it seemed reasonable to include some guidance
in this domain at the three levels (Table 5).
The following themes and topics were typically found in teaching materials and
considered relevant for developing diagnostic tests
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 81
Table 5 What young learners should be familiar with in the target language cultures
Beginner Learners know some English rhymes, songs, games, stories, and tales.
They know 1–2 holidays and customs related to L2 cultures.
Beginner plus Learners know several English rhymes, songs, games, stories, and tales.
They know 3–4 holidays and customs related to L2 cultures.
They are familiar with a few objects, expressions, books, and places related to
the L2 culture.
They know that English speaking cultures are different from Hungarian
culture in a few areas.
Elementary Learners are familiar with a few objects, expressions, stories, tales, heroes,
and places related to the L2 cultures.
They know of many ways in which English speaking cultures are both similar
to and different from Hungarian culture.
• Website, blog
• Dialogue and conversation
• Telephone conversation
• Interview
• Oral description
• Announcement
This final section includes the task types recommended for the assessment of young
learners in their four skills. Some general principles were agreed on. All tasks
should include an example (the first item). In all multiple matching tasks there are
one or two more options than necessary. All multiple choice tasks include four
options. Most tasks include six to nine items. No task should take longer than
5–7 min. All performances on tasks can be quantified. Children should get feedback
on their performances right after taking the task. All tasks are appropriate for teach-
ing as well as diagnostic assessment.
A total of 26 task types were identified. Some are variations, for example, one ver-
sion is multiple choice, and the other one is multiple matching. Some tasks integrate
other skills with listening.
1. Listen and do. Listen to the instructions and do what you are asked to do. Voice
on tape gives instructions and students act accordingly.
2. Listen and do. Listen to the instructions. Color the pictures according to what
you hear.
3. Listen and do. Circle the things you hear in the instructions.
4. Listen and point. Point to the items you hear (separate pictures or realia placed
in various places in the classroom or on a worksheet).
5. Listen and point: point to the items you hear in a larger picture (e.g., large pic-
ture showing scene with details).
6. Listen and tick what you hear: tick the items you hear on a worksheet (words or
short sentences).
7. Listen to numbers and put them down.
8. Listen and write down words spelt out. (integrated with writing)
9. Listen to short definitions and choose which picture they match.
10. Listen to short definitions and guess what they mean by putting a number next
to or crossing the item in a picture (large ones with details).
11. Listen to short dialogues and choose where the dialogues take place (multiple
choice items of small pictures on of? places).
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 83
12. Listen to short dialogues and choose where the dialogues take place (multiple
matching)
13. Listen to short dialogues and choose who are talking (multiple choice items of
small pictures).
14. Listen to short dialogues and choose who are talking (multiple choice items of
short texts).
15. Listen to picture descriptions. Choose what or who they are talking about in the
pictures. Multiple matching of pictures.
16. Listen to picture descriptions. Choose what or who they are talking about in a
picture. Multiple matching of words or expressions.
17. Listen to a picture description and look at short sentences about the picture.
There is a mistake in every item. Correct the mistakes. (Integrating listening,
reading, writing.)
18. Listen to short dialogues and choose the correct answers from options 1, 2, 3 or
4. (Items on specific information.)
19. Listen to a short dialogue and tick the things you hear.
20. This is a picture dictation task. Listen and draw a picture of what you hear.
21. Fill in chart, diary, timetable, or number according to the information you hear.
Write down words and numbers in context.
22. Listen to short definitions and guess what they mean. Choose words from a list
(multiple matching).
23. Listen to short definitions and guess what they mean. Put down the words.
24. Listen to a story and look at the pictures. Correct the mistakes you hear and put
down the correct versions. (e.g., in text: three monkeys are going for a walk; in
picture: two. In text man is happy, in picture unhappy.) Writing words.
25. Listen to a short story and look at the pictures. Match the pictures with what
you hear by putting the number in the box in the picture.
26. Listen to a picture description. Something is wrong in every sentence. Correct
the mistakes by filling in not ……, but ……Writing words.
5.6.2 Speaking
1. Look at this picture and answer my questions. (Is this a ....? Are there any …?
How many ....? What is this? What is the bear doing? Where is the ....?)
2. Tell a rhyme or sing a song (in small group, or pairs, or individually).
3. Here are some picture cards facing down. Guess what’s on them. Ask ques-
tions. (Child is to guess what is in the picture cards not seen by asking, for
example, Is it a fruit? Is it an animal? Is it green? Does it have two legs? Limited
choice of items known to children. Another variation: instead of picture cards
children guess what the objects are under a cover.)
4. Look at this picture. I’m thinking of one of the ....s (vehicles, plants, animals,
people, objects…). Ask me questions and guess what I’m thinking of. (Is it
a…? Is it yellow? Is it next to ....? Limited choice of items known to children…
in context. E.g.: an object in a kitchen, a room in a house, a person in a crowded
street or park, a fruit at a market…)
84 M. Nikolov
5. Look at these two pictures. One is mine, the other one is yours. They are not the
same. There are X differences. Let’s find them. (I start by saying e.g., In my
picture there are two houses. How about your picture? …. or In my picture
there are three. In my picture a boy is going home….) Task in pairs (first with
teacher). Both of you can see both pictures.
6. Here are two pictures (facing down). One is mine, the other one is yours. They
are not the same. There are X differences. Let’s find them. Let me start: (e.g.,
how many cars are there in your picture? .... Person A asks a question, B
answers it. Then B asks a similar question. How many dogs are in your picture?
What color is the biggest....?). You cannot see one another’s pictures.
7. Look at this picture (with several items like in a picture dictionary). Tell me five
things you like to eat and five you don’t. or Name four yellow things and three
red items, or five animals and five objects… in the picture. (One- or two-word
answers are expected.)
8. Short role play in pairs. (E.g., You are at the market. You have X pounds and
you’d like to buy three things. Look at the picture and the prices. OR Act out
role play on a topic or from a story. Exchange 4–5 utterances. E.g., shopping,
asking the way, offering food at birthday party, packing for holiday, school, …)
9. Ask and answer personal/interview questions in pairs. Look at your cards with
a (famous) character on it. (Some data are written: Name, age, address, number
of sisters, brothers, pets, hobbies, etc.: What‘s your name? How old are you?
Where do you live?)
10. This is a board game played by 2–4 learners. They use two dice and a list of (11
or 36) questions (personal or quiz). Questions should be written one by one on
numbered cards (random choice). Throw two dice and add up (2–12) or multi-
ply (1–36) numbers on top of dice. Person throwing dice must answer the ques-
tion of that number on a list. This can be a paired task or 3–4 students can take
turns. Reading the questions is also part of the task. One person throws dice,
other person reads question, next one answers, etc., for example:
• What’s your friend’s name?
• Could you spell your surname, please?
• What’s your favorite school subject?
• What subject do you dislike if any?
• What’s your favorite food?
• What are your hobbies?
• How many sisters or brothers do you have?
• What does your (older) sister/brother do?
• What pets do you have?
• What TV programs do you watch?
• How often do you do sports? Etc.
11. This is a paired task. Think about a famous person. Introduce the person by
telling five important things about them (their age, nationality, hobbies,
where they live, etc.). The other person should guess who it is. Then it is his/
her turn.
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 85
12. Students choose one picture (from picture dictionary) of a choice of, for
example, six. They are asked the following questions: 1. Please, describe the
picture you chose. What can you see in it? 2. Who are the people in the picture?
3. What are they doing? 4. How is this home (kitchen/garden/town/village/
supermarket) similar to your home (kitchen, etc.)? 5. What are the differences
between your home (kitchen, etc.) and the home in this picture?
13. This is a paired task. There is a list of 99 questions and slips of numbers with
1–99 on them. Students take turns picking numbers from slips facing down,
read the question corresponding to the number on the list and they answer it.
Then they take turns. It could be also used with an adult interlocutor.
14. Describing pictures to one another. Students work in pairs. They both look at
the same nine pictures (for example about a girl’s hobbies). They take turns and
their partners need to point to the picture they describe (so listening and speak-
ing are integrated in the task).
15. Tell a story shown in pictures. For example, nine small pictures show a story:
The Story of a Giraffe Family. This is a paired or individual task. By describing
the pictures the story unfolds.
1. Match pictures and words. Read out the words as you match them. Pictures and
words are printed on one page in random order. It can be an individual or a
paired task.
2. Match picture cards and word cards. Read out the words as you match them. It
is a paired task.
3. Read out the words on word cards. Paired task with turn taking.
4. Find words with similar meanings. Read the words and find their synonyms in
a list.
5. Find opposites of the words. Read the words and find their opposites in a list.
6. Read out familiar short sentences under pictures in a picture story. Reading
aloud task.
7. Look at pictures and match them with short texts describing them.
8. Read short definitions/descriptions and match them with words.
9. Read the sentences and match them with pictures from the story.
10. Read out short instructions on slips one by one. Your pair should act accord-
ingly. Drink your tea! Brush your hair! (reading aloud task)
11. Read questions of a short dialogue. Match them with answers to them (multiple
choice).
12. Read questions of a short dialogue. Match them with answers to them (multiple
matching).
13. Read a short text with a title. Answer questions by finding specific information
in the text. Multiple choice short answers.
14. Read short texts with no titles for holistic understanding of texts. Choose titles
from four options.
86 M. Nikolov
15. Read a short text. Answer questions on specific information in the text. Write
short answers.
16. Read a short gapped text with missing words (form, invitation, letter, story,
description). Fill in missing words from given list. Multiple matching – more
items than gaps.
17. Read text with missing phrases/expression. Fill in missing phrases from given
list. Multiple matching.
18. Read text with missing sentences. Fill in missing sentences from given list.
Multiple matching.
19. Match the titles of books with pictures on book covers.
20. Match titles of books, stories, films with short ads or descriptions on them
(about 20–30 words). Multiple matching task.
21. Match quiz questions (where, why, what, who, which, how, how many) with
answers. Multiple matching task.
22. Match public signs with where they can be found. Multiple choice or multiple
matching tasks.
23. Match short texts on postcards with pictures on them (where they come from,
pictures of places, what people are doing, etc).
24. Read a text and complete a timetable or chart with the information in the text.
25. Read a text and fill in the missing information in a picture, map, or diagram.
26. Draw lines between words in a list and things in a picture (e.g., a bathroom or
market).
27. Choose pictures showing the place where short written dialogues take place –
multiple matching.
28. Choose places (cinema, swimming pool, at home) where short written dia-
logues take place – multiple matching.
5.6.4 Writing
1. Copy words in categories. Look at the list of nine words. Copy the words under
the category (umbrella) where they belong. E.g.: foods and drinks; plants and
animals; black, white, other colors.
2. Look at pictures and words in random order (e.g., fruits). Copy the names of the
fruits under the pictures.
3. Fill in missing letters in words (1 line = 1 letter): ele_ _ ant, hors _, crocod_l_,
do_, etc. Choose letters from the list: g, e, p, e, h,
4. Fill in missing letters in words: no letters are given, but, for example, all are
drinks or animals.
5. Write down ten words after dictation. All of them are colors or part of the body,
etc.
6. Write down five short sentences after dictation (text is a story or description
with a title). Every sentence is dictated twice, then all once more.
7. Look at a picture of a house/park..... Some animals/people are hiding there.
Finish sentences by adding words.
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 87
8. Fill in words in gapped story or description. Choose from list of items. Multiple
matching.
9. Fill in words in timetable, chart, shopping list, where a lot of info is in place,
the rest of items should be chosen from list (multiple matching) e.g.: school
subjects, breakfast, lunch, dinner.
10. Read a short text. Answer questions with specific information in the text. Write
short answers.
11. Fill in personal data in a form. Short text is given on person whose data are to
be filled in. Integrating reading.
12. Picture description: write short sentences about a picture. For example, what
are children doing in a park? Write as much as you can about what they are
doing.
13. Picture description: compare two pictures. Write about five differences.
14. Write a short personal letter/card in response to a letter/card worded similarly.
15. Put down some information after dictation. E.g., shopping list.
16. Error correction, based on pictures (reading integrated). Look at the pictures
and the sentences. Something is wrong in every sentence, correct them.
17. Write down what animals/vehicles/foods/drinks/sports you can see in the
pictures.
The aim of this chapter was to share findings of a national project implemented in
Hungary. At the beginning stage, we looked for sources to draw on and found some
useful materials and ideas; however, it took a lot of work and effort to design and
create what we finally managed to come up with. Now that we developed a frame-
work, a list of can do statements, topics and task types, and by doing so we have
learnt a lot of lessons, we assume that colleagues developing frameworks and tests
for young learners may be interested in them and after critically reviewing them,
some of these ideas might be useful and relevant in other situations. We hope some
of the outcomes can be adopted in new educational contexts and readers may find
them relevant not only for EFL but also for other foreign languages.
The chapter gave insights into the outcomes of a diagnostic assessment project,
where an assessment for learning approach was applied; the tasks, however, could
be also considered for other assessment purposes. The chapter presented the most
important characteristics of young language learners and how they learn a FL; it
also outlined the main principles of assessing children. As was shown, based on the
framework and the lists of can do statements, text types and task types, over 200
new diagnostic tests were developed and piloted in the second phase of the project.
Findings were published in English on various aspects of the piloting phase involv-
ing a large sample of young learners and their teachers of EFL in the first few grades
of primary schools. Publications explored teachers’ views on tasks that work (Hild
& Nikolov, 2011), how the tests were piloted and the difficulty levels were estab-
88 M. Nikolov
lished (Nikolov & Szabó, 2011a, 2011b; 2012a) and children’s feedback on the
actual tests was also analyzed (Szabó & Nikolov, 2013). As a next phase these cali-
brated diagnostic tests are going to be made available to teachers for their classroom
use in the online database.
In addition to these ideas, the framework and the task types could be used in
teacher education programs to explore to what extent they would meet the needs of
children and their teachers in various contexts. Also, the actual tasks could serve as
excellent materials for small-scale classroom research; both in-service and pre-
service teachers could experiment with them and explore how they work with their
learners in the specific contexts and why. The tasks could be further developed and
similar tasks could be designed and piloted on new topics, etc. Finally, yet another
perspective is offered for further classroom research by asking learners after doing
tasks about the extent to which they liked or disliked, were familiar with, and found
the tasks easy or difficult. By involving learners in these discussions after complet-
ing tasks teachers may gain valuable insights into their learners’ experiences, they
may be able to tailor their teaching to their needs, and they may also develop their
young language learners’ self-assessment and autonomy.
Acknowledgements Special thanks go to members of the team for developing the can do state-
ments, drawing up the list of topics and task types based on curricula, teaching materials and
research findings. The team included twelve anonymous classroom teachers with over a decade of
teaching experience in Hungarian classrooms and the following experienced teachers of young
learners and experts on assessment and early language learning research: Lidia Bors, Judit Font,
Gabriella Hild, Csilla Kiss, Réka Lugossy, Ildikó Pathó, Gábor Szabó, Zsófia Turányi (Hungary)
and four international experts: Heini-Marja Järvinen (Finnland), Lucilla Lopriore (Italy), Jelena
Mihaljević Djigunović (Croatia), and Karmen Prizorn (Slovenia)
I am grateful to Jelena Mihaljević Djigunović for her helpful comments on the first draft of this
chapter.
References
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning
and assessment. London: Continuum.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation.
Cambridge: Cambridge University Press.
Alexiu, T. (2009). Young learners’ cognitive skills and their role in foreign language vocabulary
learning. In M. Nikolov (Ed.), Early learning of modern foreign languages: Processes and
outcomes (pp. 46–61). Clevedon/Avon: Multilingual Matters.
Assessment Reform Group. (2002). Assessment for learning. Retrieved from http://www.assess-
mentforlearning.edu.au/default.asp
Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University
Press.
Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford: Oxford University
Press.
Bacsa, É., & Csíkos, C. (2016). The role of individual differences in the development of listening
comprehension in the early stages of language learning. In M. Nikolov (Ed.), Assessing young
learners of English: Global and local perspectives. New York: Springer.
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 89
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education,
5(1), 7–71.
Bors, L., Lugossy, R., & Nikolov, M. (2001). Az angol nyelv oktatásának átfogó értékelése pécsi
általános iskolákban [A comprehensive evaluation study of the teaching of English in Pecs
primary schools]. Iskolakultúra, 11(4), 73–88.
Council of Europe. (2001). Common European framework of reference for languages: Learning,
teaching, assessment. Cambridge: Cambridge University Press.
Csapó, B., & Csépe, V. (2012). Framework for diagnostic assessment of reading. Budapest:
Nemzeti Tankönyvkiadó.
Csapó, B., & Nikolov, M. (2009). The cognitive contribution to the development of proficiency in
a foreign language. Learning and Individual Differences, 19, 203–218.
Csapó, B., & Szendrei, M. (Eds.). (2011). Framework for diagnostic assessment of mathematics.
Budapest: Nemzeti Tankönyvkiadó.
Csapó, B., & Szabó, G. (Eds.). (2012). Framework for diagnostic assessment of science. Budapest:
Nemzeti Tankönyvkiadó.
Csapó, B., & Zsolnai, A. (Eds.). (2011). A kognitív és affektív fejlődés diagnosztikus mérése az
iskola kezdő szakaszában. Budapest: Nemzeti Tankönyvkiadó.
Cummins, J. (2000). Language, power and pedagogy: Bilingual children in the crossfire. Clevedon/
Avon: Multilingual Matters.
Curtain, H. A., & Dahlberg, C. A. (2010). Languages and children – Making the match: New lan-
guages for young learners (4th ed.). Needham Heights, MA: Pearson Allyn & Bacon.
Davison, C., & Leung, C. (2009). Current issues in English language teacher-based assessment.
TESOL Quarterly, 43, 393–415.
DeKeyser, R., & Larson-Hall, J. (2005). What does the critical period really mean? In J. F. Kroll &
A. M. B. De Groot (Eds.), Handbook of bilingualism: Psycholinguistic approaches
(pp. 88–108). Oxford: Oxford University Press.
Dörnyei, Z., Csizér, K., & Németh, N. (2006). Motivation, language attitudes and globalisation: A
Hungarian perspective. Clevedon, England: Multilingual Matters.
Dweck, C. (2006). Mindset: The new psychology of success. New York: Ballantine Books.
Eurobarometer. (2006). Europeans and their languages. Brussels: European Commission.
Eurobarometer. (2012). Europeans and their languages. Brussels: European Commission.
García Mayo, M. P., & García Lecumberri, M. L. (Eds.). (2003). Age and the acquisition of English
as a foreign language. Clevedon/Avon: Multilingual Matters.
Hasselgren, A. (2005). Assessing the language of young learners. Language Testing, 22,
337–354.
Hild, G., & Nikolov, M. (2011). Teachers’ views on tasks that work with primary school EFL
learners. In M. Lehmann, R. Lugossy, & J. Horváth (Eds.), UPRT 2010: Empirical studies in
English applied linguistics (pp. 47–62). Pécs: Lingua Franca Csoport. Retrieved from http://
mek.oszk.hu/10100/10158
Hung, Y.-J., Samuelson, B. L., & Chen, S.-C. (2016). The relationships between peer- and self-
assessment and teacher assessment of young EFL learners’ oral presentations. In M. Nikolov
(Ed.), Assessing young learners of English: Global and local perspectives. New York: Springer.
Kiss, C., & Nikolov, M. (2005). Preparing, piloting and validating an instrument to measure young
learners’ aptitude. Language Learning, 55, 99–150.
Krashen, S. (1985). The input hypothesis: Issues and implications. New York: Longman.
Leung, C., & Scott, C. (2009). Formative assessment in language education policies: Emerging
lessons from Wales and Scotland. Annual Review of Applied Linguistics, 29, 64–79.
MacWhinney, B. (2005). A unified model of language development. In J. F. Kroll & A. M. B. De
Groot (Eds.), Handbook of bilingualism: Psycholinguistic approaches (pp. 49–67). Oxford:
Oxford University Press.
McKay, P. (2006). Assessing young language learners. Cambridge: Cambridge University Press.
McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Oxford: Blackwell
Publishing.
90 M. Nikolov
Medgyes, P., & Nikolov, M. (2010). Curriculum development: The interface between political and
professional decisions. In R. Kaplan (Ed.), The Oxford handbook of applied linguistics (2nd
ed., pp. 264–274). Oxford: Oxford University Press.
Medgyes, P., & Nikolov, M. (2014). Foreign language learning and teaching in Hungary: A review
of empirical research literature from 2006 to 2012. Language Teaching, 47(4), 504–537.
Mihaljević Djigunović, J. (2001). Do young learners know how to learn a foreign language? In
Y. Vrhovac (Ed.), Children and Foreign Languages III (pp. 57–71). Zagreb: Faculty of Philosophy.
Mihaljević Djigunović, J. (2009a). Impact of learning conditions on young FL learners’ motiva-
tion. In M. Nikolov (Ed.), Early learning of modern foreign languages. Processes and out-
comes (pp. 75–89). Bristol, UK: Multilingual Matters.
Mihaljević Djigunović, J. (2009b). Individual differences in early language programmes. In
M. Nikolov (Ed.), The age factor and early language learning (pp. 199–225). Berlin/New
York: Mouton de Gruyter.
Mihaljević Djigunović, J. (2012). Early EFL learning in context – Evidence from a country case
study. London: The British Council.
Muñoz, C. (Ed.). (2006). Age and the rate of foreign language learning. Clevedon/Avon:
Multilingual Matters.
National core curriculum (NCC). (2007). Budapest: Oktatási és Művelődési Minisztérium.
National core curriculum (NCC). (2012). Budapest: EMMI.
Nikolov, M. (1999). “Why do you learn English?” “Because the teacher is short”. A study of
Hungarian children’s foreign language learning motivation. Language Teaching Research,
3(1), 33–56.
Nikolov, M. (2001). A study of unsuccessful language learners. In Z. Dörnyei & R. Schmidt (Eds.),
Motivation and second language acquisition (pp. 149–170). Honolulu, HI: The University of
Hawaii, Second Language Teaching and Curriculum Center.
Nikolov, M. (2002). Issues in English language education. Bern: Peter Lang AG.
Nikolov, M. (2003). Angolul és németül tanuló diákok nyelvtanulási attitűdje és motivációja
[Attitudes and motivation of English and German learners]. Iskolakultúra, XIII(8), 61–73.
Nikolov, M. (2008). “Az általános iskola, az módszertan!” Alsó tagozatos angolórák empirikus
vizsgálata [“Primary school means methodology!” An empirical study of lower-primary EFL
classes]. Modern Nyelvoktatás, 10(1–2), 3–19.
Nikolov, M. (2009). Early modern foreign language programmes and outcomes: Factors contribut-
ing to Hungarian learners’ proficiency. In M. Nikolov (Ed.), Early learning of modern foreign
languages: Processes and outcomes (pp. 90–107). Clevedon/Avon: Multilingual Matters.
Nikolov, M. (2011). Az angol nyelvtudás fejlesztésének és értékelésének keretei az általános iskola
első hat évfolyamán [A framework for developing and assessing proficiency in English as a
foreign language in the first six years of primary school]. Modern Nyelvoktatás, XVII(1), 9–31.
Nikolov, M., & Csapó, B. (2010). The relationship between reading skills in early English as a
foreign language and Hungarian as a first language. International Journal of Bilingualism, 14,
315–329.
Nikolov, M., & Józsa, K. (2006). Relationships between language achievements in English and
German and classroom-related variables. In M. Nikolov & J. Horváth (Eds.), UPRT 2006:
Empirical studies in English applied linguistics (pp. 197–224). Pécs: Lingua Franca Csoport,
PTE.
Nikolov, M., & Mihaljevic-Djigunovic, J. (2006). Recent research on age, second language acqui-
sition, and early foreign language learning. Annual Review of Applied Linguistics, 26,
234–260.
Nikolov, M., & Mihaljević Djigunović, J. (2011). All shades of every color: An overview of early
teaching and learning of foreign languages. Annual Review of Applied Linguistics, 31, 95–119.
Nikolov, M. & Szabó, G. (2011a). Az angol nyelvtudás diagnosztikus mérésének és fejlesztésének
lehetőségei az általános iskola 1–6. évfolyamán [Possibilities of developing English diagnostic
tests for years 1–6 in the primary school]. In Csapó Benő & Zsolnai Anikó (szerk.) A kognitív
A Framework for Young EFL Learners’ Diagnostic Assessment: ‘Can Do Statements’… 91
és affektív fejlődés diagnosztikus mérése az iskola kezdő szakaszában (pp. 13–40). Budapest:
Nemzeti Tankönyvkiadó.
Nikolov, M., & Szabó, G. (2011b). Establishing difficulty levels of diagnostic listening compre-
hension tests for young learners of English. In J. Horváth (Ed.), UPRT 2011: Empirical studies
in English applied linguistics (pp. 73–82). Pécs: Lingua Franca Csoport. Retrieved from http://
mek.oszk.hu/10300/10396
Nikolov, M., & Szabó, G. (2012a). Developing diagnostic tests for young learners of EFL in grades
1 to 6. In E. D. Galaczi & C. J. Weir (Eds.), Voices in language assessment: Exploring the
impact of language frameworks on learning, teaching and assessment – Policies, procedures
and challenges, Proceedings of the ALTE Krakow Conference, July 2011 (pp. 347–363).
Cambridge: UCLES/Cambridge University Press.
Nikolov, M., & Szabó, G. (2012b). Assessing young learners’ writing skills: A pilot study of
developing diagnostic tests in EFL. In G. Pusztai, Z. Tóth, & I. Csépes (Eds.), Current
research in the field of disciplinary didactics (Hungarian Educational Research Journal,
Special Issue, Vol. 2, pp. 50–62). Retrieved from http://herj.hu/2012/08/
marianne-nikolov-and-gabor-szabo-assessing-young-learners%E2%80%99-writing-
skills-a-pilot-study-of-developing-diagnostic-tests-in-efl/
Papp, S., & Salamoura, A. (2009). An exploratory study into linking young learners’ examinations
to the CEFR (Research Notes, 37, pp. 15–22). Cambridge: Cambridge ESOL.
Paradis, M. (2004). A neurolinguistic theory of bilingualism. Amsterdam: John Benjamins.
Paradis, M. (2009). Declarative and procedural determinants of second languages. Amsterdam:
John Benjamins.
Pinter, A. (2006). Verbal evidence of task-related strategies: Child versus adult interactions.
System, 34, 615–630.
Pinter, A. (2007a). Benefits of peer-peer interaction: 10-year-old children practising with a com-
munication task. Language Teaching Research, 11, 189–207.
Pinter, A. (2007b). What children say: Benefits of task repetition. In K. Van den Branden, K. Van
Gorp, & M. Verhelst (Eds.), Task-based language education from a classroom-based perspec-
tive (pp. 126–149). Cambridge: Cambridge Scholars Publishing.
Pižorn, K. (2009). Designing proficiency levels for English for primary and secondary school
students and the impact of the CEFR. In N. Figueras & J. Noijons (Eds.), Linking to the CEFR
levels: Research perspectives (pp. 87–102). Arnhem: Cito/EALTA.
Rixon, S. (2016). Do developments in assessment represent the ‘coming of age’ of young learners
English language teaching initiatives? The international picture. In M. Nikolov (Ed.), Assessing
young learners of English: Global and local perspectives. New York: Springer.
Schmidt, R. (1990). The role of consciousness in second language learning. Applied Lingustics,
11, 129–158.
Scovel, T. (2000). A critical review of the critical period research. Annual Review of Applied
Linguistics, 20, 213–223.
Singleton, D., & Ryan, L. (2004). Language acquisition: The age factor (2nd ed.). Clevedon/Avon:
Multilingual Matters.
Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press.
Szabó, G., & Nikolov, M. (2013). An analysis of young learners’ feedback on diagnostic listening
comprehension tests. In J. Mihaljevic Djigunovic & M. Medved Krajnovic (Eds.), UZRT 2012:
Empirical studies in English applied linguistics (pp. 7–21). Zagreb: FF Press. Retrieved from
http://books.google.hu/books?id=&printsec=frontcover&source=gbs_ge_
summary_r&cad=0#v=onepage&q&f=false
Sternberg, R. J., & Grigorenko, E. L. (2002). Dynamic testing: The nature and measurement of
learning potential. Cambridge: Cambridge University Press.
Swain, M. (2000). The output hypothesis and beyond: Mediating acquistion through collaborative
dialogue. In J. P. Lantolf (Ed.), Sociocultural theory and second language learning (pp. 97–114).
Oxford: Oxford University Press.
92 M. Nikolov
Teasdale, A., & Leung, C. (2000). Teacher assessment and psychometric theory: A case of para-
digm crossing? Language Testing, 17(2), 163–184.
Ullman, M. (2001). The neural basis of lexicon and grammar in first and second language: The
declarative/procedural model. Bilingualism: Language and Cognition, 4, 105–122.
Wilden, E., & Porsch, R. (2016). Learning EFL from Year 1 or Year 3? A Comparative study on
children’s EFL listening and reading comprehension at the end of primary education. In
M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives.
New York: Springer.
Wong Fillmore, L. (1998). Supplemental declaration of Lily Wong Fillmore. Retrieved from http://
www.humnet.ucla.edu/humnet/linguistics/people/grads/macswan/fillmor2.htm
Examining Content Representativeness
of a Young Learner Language Assessment:
EFL Teachers’ Perspectives
Ching-Ni Hsieh
Abstract This study aims to provide content validity evidence for the new young
language learner assessment—TOEFL Primary—a test designed for young learners
ages 8 and above who are learning English in English as a Foreign Language (EFL)
contexts. The test focuses on core communication goals and enabling language
knowledge and skills represented in various EFL curricula. A panel of 17 experi-
enced EFL teachers, representing 15 countries, participated in the study. The teach-
ers evaluated the relevance and importance of the knowledge, skills, and abilities
(KSAs) assessed by the reading and listening items of TOEFL Primary. Content
Validity Indices (CVIs) (Popham, Appl Meas Educ 5(4):285–301, 1992) was used
to determine the degree of match between the test contents and the target constructs
and the importance of the KSAs assessed for successful classroom performance.
Results showed that the majority of the items had an average CVI above the cut-off
value of .80, indicating that the items measured what they were intended to measure
and that the KSAs assessed were important for effective classroom performance,
supporting the claim about using the test scores to support language teaching and
learning.
1 Introduction
measured and the validity of the inferences drawn from the test scores (D’Agostino,
Karpinski, & Welsh, 2011; Haynes, Richard, & Kubany, 1995; So, 2014; Yalow &
Popham, 1983). The study reported here examines the degree of content representa-
tiveness within the context of a new young learner language assessment, TOEFL
Primary, with the goal of providing an important piece of content validity evidence
for the test.
As the number of young English language learners worldwide continues to grow,
so too does the need for language assessments designed to target this population
(McKay, 2006; Nikolov, 2016, in this volume). While several language assessments
have been developed to serve the needs of these learners (e.g., Cambridge English:
Young Learners English Tests; TOEFL Primary; TOEFL Junior), theoretical and
empirical knowledge about the assessment of young language learners remains
underdeveloped. For instance, relatively little is known about the target language
use (TLU) domains for English communication among young learners. What is
clear, however, is that language tasks designed for young learners need to take into
consideration factors such as learners’ shorter attention span (Robert, Borella,
Fagot, Lecerf, & De Ribaupierre, 2009), memory capacity (Cho & So, 2014), longer
processing time (Berk, 2012), developing literacy, and limited exposure to and
experience of the world—factors that are distinct from those relevant to the assess-
ments of adult learners of English as a Second (or Foreign) Language (ESL/EFL).
Given these differences, it is critical for language test developers and researchers to
better comprehend how the test contents of young learner assessments reflect and
meet the communication needs of young learners and how individual characteristics
of students should influence test design.
TOEFL Primary is a new young learner language assessment developed by
Educational Testing Service (ETS). The test is designed for young learners ages
eight and above who are learning English in EFL contexts. The test measures three
English language skills: listening, reading, and speaking. Listening and reading are
offered in two steps, i.e. Step 1 (low level) and Step 2 (high level), to reflect the wide
range of language proficiency exhibited among the target population. The speaking
test is designed for language learners at many different proficiency levels of English,
from beginners to more proficient speakers, and thus is not separated into different
steps. The test items of TOEFL Primary cover a set of communication goals, a range
of difficulty, and various item types. The test is intended to support language teach-
ing and learning by providing meaningful information for the test takers’ current
English ability. EFL teachers can use the test to guide their teaching goals, monitor
student progress, and identify students’ strengths and weaknesses in different areas
of language use. The test scores can also be used for placement purposes if the test
content corresponds to or is relevant to the content of the EFL curriculum that the
students are exposed to. However, the test is not intended to support high-stakes
decisions such as to inform admission decisions or to evaluate teachers’
performances.
Examining Content Representativeness of a Young Learner Language Assessment… 95
2 Literature Review
The link between test content and EFL curricula is an important facet in establishing
content validity for tests that are developed to provide instructional support. Two
studies that examined the relationships between test contents and course contents
(Fleurquin, 2003; Wu & Lo, 2011) have specific implications for the current study.
Fleurquin reported the process of developing and validating Alianza Certificate of
Elementary Competence in English (ACECE), a standardized test of American
English that measures young learners’ English communication skills within the
context of elementary schools in Uruguay. To examine content validity of the
ACECE, the research team enlisted experienced EFL teachers to compare the gram-
mar structures and vocabulary categories assessed in the test with the contents of
three textbooks used with the target population in local schools. The comparison
showed that the majority of the grammar structures and vocabulary assessed in the
test matched those presented in the textbooks that the students had used during their
school years, providing evidence to support the alignment between the content of
the ACECE and the three textbooks. Specific comments regarding the test items and
stimulus materials provided by the EFL teachers were also used to inform test
revisions.
Wu and Lo (2011) investigated the relationship between a standardized English
language proficiency test for young children, the Cambridge English: Young
Learners English (YLE) Tests, and the EFL teaching practices at the elementary
level in Taiwan. The study aimed to inform local teachers regarding whether the
YLE tests were suitable for young learners in Taiwan. The researchers compared the
Grades 1–9 Curriculum Guidelines published by the Ministry of Education in
Taiwan and a popular series of English textbooks published by a local publisher
with the content of the YLE. The comparison was conducted in six aspects: topics,
grammar and structures, communication functions, competence indicators, vocabu-
lary, and tasks. Results showed a moderate to high degree of alignment between the
YLE and the local teaching practices with regard to the six aspects of the compari-
son and highlighted a gap between the two in terms of cultural differences between
Taiwan and the UK as manifested in the wordlists introduced. Taken together, the
use of expert teacher judgments in Fleurquin (2003) and Wu and Lo (2011) has
proven useful in helping researchers and test developers determine content align-
ment between young learner language assessments and EFL curricula in different
EFL contexts and identify aspects of misalignment to inform test revisions.
It needs to be noted that in content validation studies that use expert judgments,
a criterion (i.e., cut-off point) is required to ensure the quality of the judgments.
While both Fleurquin (2003) and Wu and Lo (2011) used expert teachers to evaluate
the alignment between test content and local teaching practices, neither study
employed a definite cut-off value, leaving open a determination of the test’s content
representativeness. Since one major purpose of content validation studies is to
ensure that the test contents reflect what they are intended to measure, a criterion for
making that decision is critical to represent the quality of the test content. The more
96 C.-N. Hsieh
stringent the criterion is, the more confidence that can be placed in positive apprais-
als of the test content (Popham, 1992).
In this study, I examined the content representativeness of TOEFL Primary using
a traditional content validity approach based on the computation of a Content
Validity Index (CVI) (Davis, 1992; Lynn, 1986) with a predetermined criterion. The
CVI approach entails a panel of expert judges evaluating whether the relevance of
each test item on an assessment instrument is relevant to the target construct being
measured. The percentage of items rated as relevant by each judge and the average
of the percentages across the judges are reported as an indication of the degree of
“content validity”, or more appropriately, content representativeness in this case.
The use of CVIs to determine content representativeness is widely cited in test
development literature for teacher licensure tests (Crocker, Miller, & Franks, 1989;
Popham, 1992), nursing research (Davis, 1992; Polit & Beck, 2006) and social work
research (Rubio, Berg-Weger, Tebb, Lee, & Rauch, 2003), but to the best of my
knowledge, they have not been widely used for tests of second language
proficiency.
During the initial stage of test development of TOEFL Primary, the researchers and
test developers at ETS had set out to conduct a two-stage process for establishing
the content validity of the test (Lynn, 1986; Sireci, 1998, 2007). The first stage, or
‘Developmental Stage,’ involves the identification of the domain of content through
a comprehensive review of relevant literature and domain analysis of language use
in EFL classrooms—the TLU domain. The domain descriptions were enhanced by
the development team’s review of EFL curricula and textbooks used in nine coun-
tries, including Brazil, Chile, China, Egypt, Japan, Korea, the Philippines, Qatar,
and Singapore (Turkan & Adler, 2011). Results of the domain analysis helped
define the construct of English communication for young learners. A set of com-
munication goals that are unique to young learners’ communicative needs and the
language knowledge and skills required to fulfill these communication goals are
incorporated in the construct definitions. The communication goals targeted also
helped test developers identify specific text types that young learners encounter in
their EFL reading and listening materials and the various types of speaking activi-
ties that young learners engage in the EFL classrooms. A variety of test tasks associ-
ated with specific communication goals are developed for the test.
The second stage of content validation, the ‘Judgment/Quantification’ stage of
content validation (Lynn, 1986), for TOEFL Primary is twofold, involving a teacher
survey on the pilot-test items and a panel judgment of the operational test items—
i.e. the current study. During pilot testing of TOEFL Primary, a teacher survey study
was conducted at local testing sites where TOEFL Primary was piloted. The survey
aimed to gather EFL teachers’ feedback on the importance and relevance of the set
Examining Content Representativeness of a Young Learner Language Assessment… 97
4 Method
4.1 Participants
A panel of 17 EFL teachers served as the expert judges in this study. The panel of
judges was formed, to the extent possible with a relatively small sample, to have
representation by gender, professional background, and geographic location.
Participants were selected from a large pool of EFL teachers based on their exper-
tise in young learner EFL curricula and professional experience. All teachers had
experience teaching young learners similar to the target population for TOEFL
Primary, i.e. ages eight and above. Fifteen countries (Brazil, China, France, Greece,
Japan, Jordan, Kazakhstan, Mexico, Peru, Russia, Slovakia, South Korea, Spain,
Sweden, and Vietnam) were represented. The teachers were between the ages of 25
and 52 (Mean = 38.9, SD = 7.3). Their years of teaching EFL ranged from 3 to 29
years (Mean = 14.9 years, SD = 7.0). Table 1 shows the demographic information of
the teachers.
98 C.-N. Hsieh
The rating materials used in this study consisted of operational listening (N = 57)
and reading (N = 57) test items of TOEFL Primary. These items were carefully cho-
sen by the test developers at ETS to cover all the targeted communication goals of
TOEFL Primary, the full range of difficulty, and all item types (see Table 2). The
number of items per item type reflected that of the operational form. The total num-
ber of the listening and reading items included in the study was larger than the
number in an operational form because these items covered the two difficulty levels
of TOEFL Primary. The inclusion of items from both steps was considered impor-
tant to ensure a comprehensive coverage of the difficulty range of the test. Including
more items in the study was also thought to produce more stable judgments overall.
The speaking section was not included in the study due to time and resource con-
straints in data collection.
4.3 Instrument
participants during the alignment exercise, the questionnaire response formats and
scales underwent multiple rounds of trials and revisions prior to data collection. The
final survey instrument consisted of two subsections. Section I included seven parts,
each corresponding to one listening item type. Section II included eight parts, each
corresponding to one reading item type. The KSAs assessed in each item type were
provided in the questionnaire to facilitate the evaluation process.
4.4 Procedures
The 17 EFL teachers were invited from their countries to ETS campus in Princeton,
New Jersey, to participate in the study. Each teacher was supplied with (a) a back-
ground questionnaire that was used to gather the teachers’ biographical informa-
tion, (b) a test booklet that contained the 57 listening and 57 reading test items, (c)
a copy of the scripts for the listening items, and (d) the content alignment question-
naire for the evaluation of the test items. Prior to the day of the content alignment
exercise, all teachers took the TOEFL Primary test and reviewed documents on the
test design framework and scoring guidelines to become familiar with the test con-
structs, design, and scoring criteria. On the day of data collection, the teachers first
completed the background questionnaire and then were instructed to make
100 C.-N. Hsieh
judgments on two aspects of the content representativeness of each item using the
content alignment questionnaire. The two aspects were content relevance of and the
importance of the KSAs assessed by the TOEFL Primary test items. In addition to
the content alignment exercise, five teachers (from France, Jordan, Mexico, Peru,
and Spain) agreed to participate in follow-up interviews that were conducted after
the analyses of the rating data. The interviews focused on (1) the teachers’ views
about specific aspects of the test contents that the teachers considered less important
or relevant to their own teaching practices and (2) how the teachers used the differ-
ent types of texts and item types in their respective EFL classrooms.
The two aspects of content alignment judgments the teachers were asked to perform
are described as follows.
(1) Content relevance
The first judgment asked the teachers to evaluate the degree to which the content
of each item reflected the target construct it is intended to measure. Congruent with
Lynn’s (1986) item relevance rating rules, judges were asked to provide the rele-
vance ratings on a Likert scale with four possible responses: no reflection, slight
reflection, moderate reflection and strong reflection. Responses of ‘moderate reflec-
tion’ and ‘strong reflection’ were regarded as indications of teachers’ endorsement
of the content relevance of the items, whereas responses of ‘no reflection’ and
‘slight reflection’ indicated the opposite. The responses were dichotomized in this
fashion in order to facilitate summary evaluations.
(2) The importance of the KSAs assessed
The second judgment required the teachers to rate the importance of the KSAs
required of young EFL learners for successful classroom performance in their own
teaching contexts. The importance ratings, also on a 4-point Likert scale (Lynn,
1986), had four different labels: not important, somewhat important, important and
very important. Responses of ‘important’ and ‘very important’ indicated teachers’
agreement on the importance of the KSAs assessed, whereas responses of ‘not
important’ and ‘somewhat important’ indicated the opposite. As with the content
relevance ratings, the importance ratings were also dichotomized.
4.6 Analysis
To answer the research questions, individual ratings provided by the 17 judges were
pooled and the CVIs for each item were calculated for evaluating the degree of
content relevance and importance of the KSAs assessed in the TOEFL Primary test
Examining Content Representativeness of a Young Learner Language Assessment… 101
items (Davis, 1992; Lynn, 1986; Polit & Beck, 2006). The analyses of the degree of
content representativeness of the test items are described below.
(1) CVIs for content relevance
For the content relevance ratings, the CVI for each item was calculated by count-
ing the number of judges who rated that item as either ‘moderate reflection’ or
‘strong reflection’ and dividing that number by the total number of judges. The CVI
calculated for each item provided information about the proportion of judges who
considered an item as content relevant. The CVIs for the listening and reading sec-
tions were defined as the proportion of items on the section that achieved a rating of
‘moderate reflection’ or ‘strong reflection’ across all judges. The CVIs for listening
and reading sections were derived, respectively, by averaging the CVIs across the 57
items for each section.
(2) CVIs for the importance of the KSAs assessed
For the importance of the KSAs assessed, the CVI for each item was calculated
by counting the number of judges who rated the item as either ‘important’ or ‘very
important’ and dividing that number by the total number of judges. The CVI calcu-
lated for each item provided information about the proportion of judges who con-
sidered the KSAs assessed by an item as important for successful classroom
performance. The CVIs for the listening and reading sections were defined as the
proportion of items on the section that achieved a rating of ‘important’ or ‘very
important’ across all judges. The CVIs for listening and reading sections were
derived, respectively, by averaging the CVIs across the 57 items for each section.
To determine the degree to which TOEFL Primary test items reflect the target
constructs and assess the important KSAs required of young learners, a CVI of .80
was used as the acceptable criterion, following Davis (1992). This criterion is
widely used in the literature for determining content representativeness of new
assessments (e.g., Rubio et al., 2003). This cut-off value indicates that, when a total
of 17 judges are considered, at least 14 agree that the items reflect the intended tar-
get constructs or that the KSAs assessed are important for successful classroom
performance.
5 Results
Descriptive statistics of the content relevance ratings and the average CVIs for each
item type are provided in Table 3. As the table shows, all listening item types had an
average CVI above .80. The CVI for the Listening section was .95, clearly above the
cut-off criterion. Similarly, all the reading items and item types had a CVI above the
cut-off value of .80. The CVI for the Reading section was .95, indicating excellent
content relevance.
102 C.-N. Hsieh
Descriptive statistics of the importance ratings and the average CVIs for each item
type are provided in Table 4. The table shows that six listening item types had an
average CVI above .80, with the exception of ‘Academic Monologue.’ The
‘Academic Monologue’ item type is only present in Step 2 of TOEFL Primary. The
item type requires test takers to listen to a monologue spoken by a teacher or another
adult instructing academic content to students. The test takers then answer three
multiple-choice comprehension questions. These questions assess the students’
abilities to understand spoken informational texts and require test takers to have
knowledge of organization features of expository texts and the ability to understand
key information in a monologue.
A similar degree of agreement among the judges is seen in the Reading section.
The majority of the reading item types had a CVI above .80, with the exception of
‘Telegraphic Sets’ that had a borderline CVI of .79. The ‘Telegraphic Sets’ item type
is present both in Step 1 and Step 2 of TOEFL Primary. This item type asks test tak-
ers to answer multiple-choice questions by locating the relevant information in tele-
graphic texts in which language is presented in single, phrasal, and short sentence
form. Commonly used stimulus materials include posters, menus, schedules, and
advertisements. The slightly lower CVI of .79 was considered negligible given that
the majority still rated the KSAs assessed in the ‘Telegraphic Sets’ important.
To summarize, the results of the importance of the KSAs assessed by TOEFL
Primary indicate high agreement among the judges. The Listening and Reading sec-
tions both had an average CVI of .89, suggesting that the majority of the teachers
Examining Content Representativeness of a Young Learner Language Assessment… 103
Table 4 Descriptive statistics and average CVIs for the importance of the KSAs assessed
Listening item type Mean S.D. CVI
Listen and match 3.55 0.22 0.94
Follow instructions 3.55 0.14 0.92
Question/response 3.37 0.18 0.82
Dialogue 3.55 0.07 0.96
Social-navigational monologue 3.61 0.09 0.90
Narrative set 3.70 0.11 0.95
Academic monologue 3.26 0.05 0.72
Reading item type Mean S.D. CVI
Match picture to word 3.69 0.05 0.91
Match picture to sentence 3.76 0.12 0.97
Sentence clues 3.61 0.14 0.92
Telegraphic sets 3.79 0.93 0.79
Correspondence 3.48 0.11 0.84
Instructional texts 3.50 0.09 0.86
Narrative sets 3.68 0.12 0.97
Expository paragraph 3.49 0.07 0.88
considered that the KSAs assessed were important for their respective language
teaching contexts.
6 Discussion
This study used CVIs as a research methodology to evaluate the degree of content
representativeness of TOEFL Primary. A representative panel of experts was con-
vened to evaluate the degree of match between the test construct and the content of
the listening and reading items of the test and to evaluate the importance of the
KSAs assessed. The expert teachers’ judgments were used as the criterion on which
the content-related evidence of validity was based. Results of the study suggest that
TOEFL Primary test content largely reflects the target construct being measured and
covers the important domains of language knowledge and skills EFL learners are
required to possess in order to perform successfully in EFL classrooms.
The content alignment exercise performed by the expert judges identified one
listening item type, ‘Academic Monologue,’ that had slightly lower agreement
among the judges, warranting further discussion. As described earlier, the “Academic
Monologue” items assess test takers’ ability to understand expository texts in a
lecture and are more difficult items for the target population. These items were per-
ceived to be less important may be because the listening input was relatively long
and for younger learners or lower-proficiency students, the cognitive load of the
stimulus materials posed might be overwhelming. It may also be the case that the
“Academic Monologue” is designed for learners with higher proficiency level—a
104 C.-N. Hsieh
level that is higher than the one that the participating teachers were familiar with or
currently teaching and thus was considered less important or relevant to their given
contexts. Follow-up interviews with the EFL teachers lend a hand to explain the
results seen here. One Peruvian teacher, who had 21 years of experience teaching
beginner to intermediate English for young learners, indicated that her students had
limited exposure to this type of listening input and thought that the academic mono-
logues were too demanding for her students. She said: “We do not have that kind of
exercise in the textbook or any other listening task we use in class; we consider this
kind of exercise a bit demanding for our students who do not have access to that
kind of input neither in their schools nor in their daily lives.”
Other teachers interviewed generally had a positive view about the inclusion of
the academic monologues; however, three suggested that the choice of topics should
take into consideration young learners’ age and life experience. A French teacher,
who had 16 years of experience teaching beginner to intermediate young EFL learn-
ers, commented that:
My students are never exposed to this kind of listening, except when it has to deal with the
culture of an English speaking country, such as the life of Nelson Mandela, the religious
wars in Ireland, the pilgrim fathers, the constitution in 1776, etc., but not things about
insects or for example the earth. Or it would be very general, like not how a volcano works,
but the different types of natural catastrophe that you can experience. That is to say, the
topic should not be too technical.
This comment indicated that the French teacher’s students, in fact, had exposure
to Academic Monologues; however, they were not familiar with the topics included
in TOEFL Primary. While this comment highlights the importance of selecting top-
ics that are accessible for young learners who have limited exposure to complex or
abstract concepts, it needs to be noted that the teachers’ perceptions of the topic
choice might have been influenced by the two academic monologues given to them
for evaluation, since both of them were science-related topics. TOEFL Primary
encompasses a wide range of topics that represent a variety of disciplines, both in
social and natural sciences. The teachers’ views about the topic choice would have
been different if different topics had been chosen. Another interesting point worth
discussing relates to the French teacher’s remark on introducing topics such as a
prominent historical figure from South Africa or the constitution of the United States.
These topics, albeit culturally relevant in the French context, may appear to be less
familiar for young EFL learners in different parts of the world or EFL contexts.
The teachers’ comments also bring out an important issue in the content design
of young learner assessments—topic effects. Whereas the majority of the teachers
considered that the Academic Monologue measures what it is intended to measure,
the topics of the monologues appear to impact how the teachers perceived the
importance of the KSAs assessed with respect to their teaching contexts. This result
suggests that there might be a topic effect on the perceived difficulty of task types
and potentially on test performance—an effect that can introduce construct-
irrelevant variance (Cho & So, 2014). The impact of topics on test performance thus
warrants further investigation to inform the choice of topics for the academic
monologues.
Examining Content Representativeness of a Young Learner Language Assessment… 105
In terms of research methodology, the investigation suggests that the use of CVIs
and an acceptable standard for the CVIs are useful in estimating the degree of con-
tent representativeness of newly developed young learner language assessments. On
the basis of the results obtained and previous research (Davis, 1992; Lynn, 1986), it
appears that content validation of young learner language assessments can be per-
formed by a judiciously selected panel of expert judges who are familiar with the
target population and that the experts’ judgments can be analyzed using the CVI
approach. Emphasis needs to be placed, however, on the careful adoption of a cut-
off point that can be used to determine a good degree of content alignment.
A few limitations of the study need to be pointed out. First of all, while the panelists
were experienced, representative EFL teachers judiciously selected from varying
EFL contexts, the sample size remains small and thus the findings might only apply
to the participating teachers’ contexts. Future research in validating content repre-
sentativeness of newly developed young learner language assessments should
include expert judges with more diverse nationalities and larger sample size so as to
ensure the generalizability of the study results. Secondly, this study evaluated the
reading and listening items of the TOEFL Primary test. The computer-delivered
speaking test was not included in the evaluation, leaving open the question of the
content representativeness of the speaking tasks and the importance of the speaking
communication goals for young EFL learners. Subsequent research should investi-
gate the content representativeness of the speaking tasks so that a more comprehen-
sive evaluation of the TOEFL Primary test can be made available to interested EFL
teachers and test users. In addition, future research should also investigate whether
the mode of test delivery, i.e. paper-based versus computer-delivered, plays a role in
how young language learners process input materials and test prompts in order to
inform test design. Finally, the study used information from the EFL teachers’ judg-
ments of the test items. Other sources of information (e.g., empirical response data)
were not available at the time of data collection; however, they should be considered
as potential data sources in the future.
8 Conclusion
Results of the study have provided an important piece of empirical evidence to sup-
port the content validity of TOEFL Primary and the intended uses of the test. The
KSAs assessed by TOEFL Primary listening and reading items were judged to be
important and relevant to the content of the different EFL curricula the panelists
were familiar with. This finding corroborates with findings from the domain analy-
ses of EFL textbooks conducted in the initial stage of test development and the
106 C.-N. Hsieh
results of the teacher survey discussed earlier. The multi-stages of test validation
have yielded convergent results, consolidating the claims made about the test uses
by providing meaningful feedback to support language teaching and learning. In
addition, this study presented an evaluative process that can be applied to investigate
content representativeness of similar language assessments. Equally important, it
suggests a significant role for EFL teachers in the development of new tests for
young English language learners.
References
So, Y. (2014). Are teacher perspectives useful? Incorporating EFL teacher feedback in the develop-
ment of a large-scale international English test. Language Assessment Quarterly, 11(3),
283–303.
Turkan, S. & Adler, R. (2011). Conceptual framework for the assessment of young learners of
English as a foreign language. Unpublished manuscript. Educational Testing Service,
Princeton, NJ.
Wu, J., & Lo, H.-Y. (2011). The YLE tests and teaching in the Taiwanese content. Research Notes,
46, 2–6.
Yalow, E. S., & Popham, W. J. (1983). Content validity at the crossroads. Educational Researcher,
12(8), 10–21.
Developing and Piloting Proficiency Tests
for Polish Young Learners
1 Introduction
& Copple, 1997). This propensity is an important signal, conditioning initiation into
formal testing.
Other important cognitive factors requiring consideration in language test devel-
opment include the ability to retrieve items from memory (e.g. words, numbers) and
correct interpretation of the test layout and symbols used (e.g., icons). Perception is
yet another important aspect of cognition at this age. As Vernon and Vernon (1976)
state, children’s ability to notice and recall details from a picture is greater than their
ability to interpret the whole picture. Therefore, test items should favour a series of
smaller pictures over a large picture, in which children might become lost.
Affective characteristics are also critical to test performance. Although children’s
attitudes towards a foreign language are generally positive (Mihaljević Djigunović
& Lopriore, 2011; Mihaljević Djigunović & Vilke, 2000), motivation to participate
in language tasks is related to classroom atmosphere and the sense of security
achieved by the rapport established with the teacher and other learners. Test admin-
istration and test characteristics, which do not mimic regular daily school activities
and thus do not engender procedure and task familiarity, are likely to cause stress,
result in apathy or even loss of motivation. To avoid this, a test might be supervised
by the class teacher or, if considered inappropriate, other teachers should be present
during the test. A familiar teacher, present during externally administered tests
might in many cases re-establish children’s sense of security and this provides solid
grounds to justify their participation.
Among the challenges to the development of proficiency tests for children is
their language content (see Hsieh, 2016 in this volume). This is largely determined
by the curriculum and course books used. In Poland, the National Curriculum
(2008) consists of several descriptors formulated as expected learning outcomes at
every stage of school education. The document was designed to be suitable for all
foreign languages and does not list language items for a target language. The list of
topics to be covered within each stage is available for all stages, with the exception
of stage one (age 6–8). Table 1 shows the expected learning outcomes for foreign
language education at stage 1 (age 9).
In Poland, as in many other European countries, child target language exposure
is often limited to school. Contact with the foreign language outside school, through
television, digital media or native speakers is sporadic (Muñoz & Lindgren, 2011,
2013). For this reason, language competence is largely circumscribed by course
book content. For young learners, the content of course books is usually planned
around common topics while the choice of lexical items and phrases is often deter-
mined by the storylines used (Rixon, 1999). This results in relatively few lexical
items common between course books used nationally (Alexiou & Konstantakis,
2007; Kulas, 2012). The absence of a common point of reference manifests itself in
a situation in which children’s lexicon varies from one school to another, depending
on choice of course book. It is, therefore, rather difficult to describe a common core
of items shared by course books for a child population of the same age.
Rate of development for literacy in the mother tongue is important in determin-
ing how foreign language skills and achievement can be tested. In Poland, it is rec-
ommended that reading and writing should not be taught before children are aged
112 M. Szpotowicz and D.E. Campfield
Table 1 Expected learning outcomes in a foreign language at educational stage 1 (age 9) in the
National Core Curriculum (MEN, 2008, p. 216)
A pupil who has accomplished 3 years of FL instruction (age 9)
Listening distinguishes between words which sound similar
recognizes everyday phrases and can use them
understands the gist of short stories told with the help of pictures and gestures
understands the gist of simple dialogues in picture comic strips (also in audio
and video recordings)
Speaking responds verbally and non-verbally to simple instructions
asks questions and responds using formulaic phrases, says rhymes, chants and
sings songs, names objects in the learning environment and describes them,
participates in drama activities
Reading understands the gist of dialogues in picture comic strips
understands simple words and sentences in reading tasks
Writing copies words and sentences
Non-linguistic uses picture dictionaries, readers and multimedia
skills cooperates with peers
6–7. Since ability to read and write in a foreign language follows the development
of literacy in L1, children are introduced to reading and writing in a foreign lan-
guage a few years later, usually when they are aged 8–9. Before this age neither
mother tongue nor foreign language skills are formally tested. Development of L1
and L2 literacy can be compared for listening and reading at the age of 9. Table 2
shows that age 9 achievement targets in the mother tongue are considerably higher
than for the foreign language (Table 1). The foreign language skills of young learn-
ers at this age are closer to those acquired in the mother tongue 2 years earlier
(Table 2).
The difference between expected learning outcomes for mother tongue and the
foreign language highlights the later onset of literacy in L2. This poses an obstacle
to parallel test design for mother tongue and a foreign language. Since literacy in L2
is less developed, tests and tasks may necessarily appear ‘childish’ and below learn-
ers’ levels of cognitive ability. For example, while children are exposed to longer
written instructions and passages of text in their mother tongue, in the foreign lan-
guage they are only ready to respond to short sentences supported by pictures or
icons which they may conceive as more appropriate for preschool.
In view of these key considerations, the challenges of test item development for
large-scale measurement of children’s foreign language need to be regarded from
the perspective of test usefulness which is “an overriding consideration in design-
ing, developing and using tests” (Bachman, 2004, p. 5). According to Bachmann
and Palmer (1996), this engenders vital qualities, including: reliability, construct
validity, authenticity, interaction, impact and practicality. McKay (2006) notes that
these qualities should be observed from the design phase. Each is discussed below
from the perspective of test item development for children aged 9.
To reduce compromising reliability of large scale testing for children’s language
skills, as in the example presented in this study, the administration stage for the test
Developing and Piloting Proficiency Tests for Polish Young Learners 113
Table 2 Learning outcomes in the mother tongue for educational stage 1 – translation of the
National Core Curriculum (MEN, 2008)
A pupil who has completed 1 year of mother A pupil who has accomplished 3 years of
tongue instruction (aged 7) mother tongue instruction (aged 9)
Listening pays attention to peer and adult listens attentively and can respond
contributions and is willing to appropriately to the information obtained
understand them
Speaking communicates their reflections, makes contributions a few sentences long,
needs and feelings in a clear way tells short stories, describes objects and
people
addresses the interlocutor in a participates in conversations, asks and
respectful manner, speaks to the answers questions, presenting their
point, asks and answers question, personal point of view, expanding lexis and
adjusts their tone of voice to the syntax
situation
participates in conversation about pays attention to register of the
family, school and literature conversation, uses correct pronunciation,
stress and intonation in affirmative,
interrogative and negative sentences, uses
pleasantries
Reading understands the sense of coding and reads and understands age-appropriate texts
decoding information, understands and draws conclusions
simplified pictures, pictograms, selects specific information from texts,
signs and headings referring to young learner dictionaries or
knows all letters of the alphabet, encyclopaedias as required
reads and understands short and is familiar with genres such as: greetings,
simple texts invitations, announcements, letters or notes
and can respond appropriately
Writing writes short, simple sentences, writes stories a few sentences long, letters,
copies, writes from memory greetings and invitations
writes clearly and follows the rules produces clear and legible handwriting
of handwriting pays attention to grammar, spelling and
punctuation rules
copies and writes text from memory and
can formulate individual contributions
demands rigorous attention. Among the requirements for test procedures for lan-
guage learners of English as a second language recommended by Butler and Stevens
(2001, p. 413), some were particularly apposite to the present study. These included:
testing spread over several sessions, administration to small groups in separate
rooms, breaks during testing, native language instructions given orally, questions
read aloud in English, answers inserted directly in a specially prepared test booklet
and the instructions explained.
Construct validity should be ensured by extensive literature review covering
child socio-psychological and cognitive development, foreign language learning at
an early age and local teaching and assessment practices (McKay, 2005; Taylor &
Saville, 2002). Test developers should acquire knowledge of the constructs to be
assessed, supported by in-depth analysis of curricula and course books (Inbar-Lourie
114 M. Szpotowicz and D.E. Campfield
& Shohamy, 2009). Taylor and Saville stress the primacy of spoken over written
language with respect to young learners – hence the focus on oral/aural skills in
tests for young learners, such as the Cambridge Young Learners’ English Tests.
Task authenticity, defined as the “degree to which test tasks resemble target lan-
guage use (TLU) tasks” (Carr, 2011, p. 314) is easier to achieve during informal
classroom assessment than in large-scale external tests. To select authentic tasks
appropriate for young learners in a national context, test item writers need an appre-
ciation of the tasks used during lessons, offered by course books and other materials
supplied by teachers or materials, such as comic strips or cartoons, which children
may read or look at in their spare time.
McKay (2006) asserts that only interactive tasks which require children to use
the language knowledge and skills that are being assessed can provide useful evi-
dence for inference of children’s level of language competence. In a pen-and-paper
test, listening and reading skills can be assessed if the format of the tasks and con-
tent are familiar through prior classroom exposure.
Espinoza and Lopez (2007) give a critical overview of current assessment mea-
sures for young English language learners and point out the scarcity of appropriate
standardized tests.
When testing young learners it is vital to ensure positive impact and to avoid
children – the test-takers – experiencing any negative consequences. According to
Messick’s (1989) work on validity theory, “consequences of tests must be suffi-
ciently positive to justify the use of the test”. Carr (2011, p. 55) argues that wash-
back, the effect of a test on teaching and learning, is the most commonly discussed
aspect of impact. In high-stakes tests washback may include the curriculum, materi-
als, teaching approaches and how students prepare for tests. “Trying to plan tests
that seem likely to cause positive washback is important, because teachers will wind
up teaching to the test, at least to some extent” (Carr, p. 55).
Social consequences should also be considered when designing external tests for
young learners, especially with regard to test fairness and ethical considerations.
According to Kunnan’s Test Fairness Framework (2004), apart from being valid, a
test should be free from bias (e.g., standard setting and analysis of differential item
functioning), ensure uniform security for administration and provide equal access to
students (e.g., familiarity with equipment, conditions and the opportunity to learn
from the test) (cited in Carr, 2011, p. 155). With reference to ethical considerations,
anonymity in test administration is crucial and needs to be guaranteed by design of
suitable test procedures at the planning stage. It is paramount that neither children
nor their teachers can be identified either during transport or coding of scripts or
later from the database. The most delicate issue, however, concerns publication of
test results to be shared with teachers, schools or authorities. Reporting requires tact
and extreme care to present the results in an informative and useful way without risk
of any detrimental washback on learners or their teachers.
Developing and Piloting Proficiency Tests for Polish Young Learners 115
The aim of the present study was to assess children’s foreign language abilities after
completion of the first stage of foreign language education in primary school, Grade
3 (age 10). To conform to this, the research population was defined as those pupils
who had completed the first phase of primary education and who at the beginning of
the study had just started Grade 4. These children started school in 2008 at the age
of 7 when English as a foreign language was made compulsory in primary schools.
Since town size has been shown to be a significant factor in educational research in
Poland, to obtain a representative sample of the population, a stratified random sam-
pling framework was adopted to reflect the range of settlement size from cities and
large towns, through market towns serving farming populations to villages. As a
result, 172 primary schools were randomly selected. In schools with one or two
Grade 4 classes, all pupils were selected for the study, whilst in schools with more
than two Grade 4 classes, two classes were randomly selected. This sampling pro-
cedure resulted in 4717 pupils qualifying for the study frame.
The pen-and-paper test was administered to the full study sample to assess listen-
ing and reading comprehension. The choice of these two skills for assessment was
informed mainly by practical considerations; since it is possible to assess them
using pen-and-paper tests which, given the sample size, was deemed practically and
logistically feasible (Szpotowicz & Lindgren, 2011). Written production skills were
assessed in the second phase of the study when pupils were at the end of Grade 6
(age 12, not reported in this chapter). Oral production skills were not assessed but
an Elicited Imitation task was carried out on a sub-sample of 665 children
(Campfield, in preparation).
The constructs for listening and reading comprehension were suggested by the
National Foreign Language Curriculum (Ministerstwo Edukacji Narodowej (MEN),
2002, 2008) and the European Language Portfolio for children aged 6–10 (Pamuła,
Bajorek, Bartosz-Przybyło, & Sikora-Banasik, 2006). For children completing the
first phase of primary foreign language instruction, listening comprehension was
defined as:
(a) ability to comprehend lexical items (e.g., names of foods, animals, rooms and
items of furniture, body parts, sport and leisure activities) and simple everyday
expressions (e.g., classroom language),
(b) ability to follow the general gist of simple dialogues supported by visual
prompts/materials.
Reading comprehension was defined as:
(a) ability to comprehend single words and simple everyday expressions,
(b) ability to follow the general gist of simple texts, such as stories.
116 M. Szpotowicz and D.E. Campfield
4 Method
The specific focus of this chapter is the description of the various stages of design
for the pen-and-paper listening and reading comprehension tests, through the pilot
stage to the final choice of test items with the best psychometric parameters.
4.1 Participants
The research population were 10-year old children who had completed Grade 3 and
were just starting Grade 4. The study materials were piloted on a convenience sam-
ple of the target age group. The pilot sample was drawn from three geographic
areas: the North-East, South-East and central Poland, covering radii of 50 km from
the biggest town in each area, principally for economies of travel and cost for
researchers. Within each area, primary schools were selected to reflect the socio-
economic character of the area: eight schools in the North- and South-East and six
schools in central Poland. This resulted in selection of 22 schools from larger cities,
smaller towns as well as market towns serving the farming population. Care was
taken to ensure that no schools were at the extremes of the socio-economic or aca-
demic ability spectrum. Since in the course of their research careers the researchers
involved in this study had established contact with these schools, this encouraged
them to be willing to participate in the pilot. From the 22 schools chosen for pilot,
42 Grade 4 classes were selected. A total of 829 pupils took part.
4.2 Materials
The design and development of the pen-and-paper test followed the preparation of
an assessment task specification formulated with reference to Carr (2011, p. 50) and
McKay (2006). The final goal of the study was to formulate recommendations con-
cerning foreign language instruction for the Ministry of Education, school heads,
teachers, parents and pupils. The aim of the assessment, therefore, was to generate
potential for a large positive impact on the acquisition of foreign language by young
Developing and Piloting Proficiency Tests for Polish Young Learners 117
learners with all effects judged as being desirable and using a test considered fair by
all stakeholders.
To satisfy the criterion of fairness, it was important that (a) children had been
previously exposed to the proposed types of assessment task and (b) the target lan-
guage used was drawn from familiar vocabulary and structures. Therefore, for the
test to be fair, the assessment tasks had to reflect children’s classroom experience.
However, a positive washback effect was also an important aim for the assessment.
For this reason, the specification required task developers to place emphasis on
authentic language and turn of phrase and use listening material which was as real-
istic as possible. To reiterate, the aim was to be able to describe the extent to which
children had understood words and simple expressions used in situations they might
expect to encounter every day.
Test items were constructed within the Institute by a team of experienced test
developers, researchers with experience in child second language acquisition, lan-
guage teaching for young children and teacher training. The team included a native
speaker of British English who also monitored that authenticity of language and
turn of phrase was satisfied. An internal and an external expert on language testing
were consulted on all materials on a continuous basis as an integral part of the task
development process.
The team of item developers were working according to a set of jointly-drawn
guidelines, such as authenticity of language and delivery, in the case of the listening
material and the avoidance of incorrect English, contrived or peculiar expressions
and trick questions. The language and contexts were expected to be universally
familiar, requiring unambiguous interpretation. Furthermore, responses to items
could not be made on the basis of single lexical items. The test materials had to be
conceptually and visually pleasing with clear and ample instructions supported by
sufficient examples. Finally, test items needed to be at appropriate levels of diffi-
culty to allow them to potentially function as anchor items for the second assess-
ment, at the end of Grade 6 (age 12, not reported in this chapter).
Item construction was preceded by the analysis of vocabulary and structures in
the English language textbooks approved by the Polish Ministry of Education and
available on the market in the autumn of 2010 for Grades 2 (age 8–9) and 3 (age
9–10) of primary school (Kulas, 2012). This analysis demonstrated great variance
between textbooks in terms of both the range and commonality of vocabulary but
allowed the selection of 177 lexical items common to all textbooks. Rixon (1999)
had commented on the paucity of common vocabulary between children’s textbooks
which bears scant resemblance to what would be expected for learners in the target
language environment.
In the present study it was not possible to obtain a measure of the frequency of
exposure to each of the 177 lexical items because the frequency of a word’s appear-
ance in any book does not impute its frequency of use in the classroom. To obtain
this data it would be necessary to conduct a large observation study. In the absence
of knowledge about exposure, piloting at a later stage was expected to be the best
predictor for suitability of choice of vocabulary.
118 M. Szpotowicz and D.E. Campfield
Table 3 Piloted versions of listening and reading comprehension tasks with number of items in
each task
Number of
Instrument Pilot version Type test items
Listening 1 1 Multiple choice 19
2
Listening 2 1 True/False (Family at home) 11
3 True/False (In the park)
4 True/False (In the classroom)
Reading 1 1 Multiple choice 18
2
Reading 2 1 Picture and text matching (The story of cat 10
and mouse)
2 Picture and text matching (Computer)
4 Picture and text matching (TV)
Reading 3 1 Title and text matching (Too many sweets) 5
2 Title and text matching (Play with animals
every day)
4 Title and text matching (Holiday hobby)
In the first listening task, children listened to an utterance or a brief exchange and
were asked to indicate which of the three illustrations best fitted what they had heard
(Fig. 1). In the second listening task, children looked at an illustration depicting a
lively scene and heard utterances or brief dialogues requiring them to identify
whether what they heard was a true representation of the scene (Fig. 2). The tasks
were prepared in a way which avoided possible guessing based on familiarity with
any single individual word.
Translation of the instruction in Polish: Indicate which picture matches the
recording. You will hear the recording twice .
120 M. Szpotowicz and D.E. Campfield
Translation of the instruction in Polish: Look carefully at the scene. Listen to the
sentences or brief dialogues and mark the appropriate box according to whether
what you hear is True or False with a cross (x). You will hear the recording twice.
Materials for the listening comprehension tasks were recorded by a male and
female pair of native British English teachers of children with relevant studio expe-
rience. Recordings were made using a normal speaking voice and natural intonation.
Care was taken to ensure that the recorded material was delivered with the stress,
rhythm and intonation of natural British English.
In the first of the three reading comprehension tasks children were presented
with three sentences and a picture to illustrate one of these sentences (Fig. 3). The
second reading task presented a brief story using 11 consecutive cartoon-like illus-
trations (Fig. 4). Below the sequence of pictures, sentences or brief exchanges/dia-
logues were presented in the wrong order, ten matched the illustrations and one
extra text did not match any of the illustrations. The task was to match sentences
with the illustrations.
Translation of the instruction in Polish: There are three sentences below each
picture. Choose the sentence which describes the picture and tick the box next to it.
Translation of the instruction in Polish: Look carefully at the pictures in the story.
There are 10 pictures in the correct order. Match the sentences with the pictures.
Write the number of the picture next to the correct sentence. There are 11 sentences,
so one is extra.
Developing and Piloting Proficiency Tests for Polish Young Learners 121
In the final reading task children were presented with five brief texts with eight
possible titles to match to these texts (Fig. 5). Two examples were given: one as an
example of a correct match and the other an example of a title that did not match any
of the texts, marked appropriately as ‘0’. With eight titles to choose from, the task
offered five items. This task was included following the advice of the external expert
and after much deliberation by the team of authors. The rationale for including this
task was twofold. First, it allowed for the assessment of a reading sub-skill: under-
standing the main idea. Additionally, as with the second reading task (picture and
sentence matching to follow a story), the aim was to introduce an important wash-
back effect on classroom practice to encourage teachers to expose young learners to
stretches of text. Particular effort was made to ensure that such texts were interest-
ing, age-appropriate and as with all other tasks, responses required reading of the
whole text and could not be guessed from individual words.
Although the authors were aware of the need to avoid item interdependence, this
was not always possible, given the narrow range of options (see Figs. 4 and 5).
There were difficulties allowing for task variety without including some requiring
reordering of sentences to match a story line or the titles with texts. It was hoped
that additional items provided with these tasks helped mitigate this shortcoming in
the last two reading tasks.
Additionally and encouraged by Nikolov and Szabó (2012, also see Nikolov,
2016 in this volume) each task was followed by three multiple choice items to
enquire about how participants rated task difficulty, familiarity and attractiveness
(see Fig. 6). The aim was to find out how children themselves reacted to the tasks,
to assess perception of task features in relation to ability to tackle the challenge.
122 M. Szpotowicz and D.E. Campfield
1 2 3
4 5 7
9 10
It is time to go to bed.
Mum says: ‘Please, come and
have something to eat.’
But Tom is very busy. Tom looks at the dog.
It’s too late to go for a walk!
Tom is at home. 1
Dad goes into Tom’s room and says:
He is playing on the computer.
‘Let’s go and play football!’
5 Results
Since children’s perspectives and opinions were considered vital to the creation of
suitable test materials, pre-pilot cognitive laboratories with target-age children were
organised. A cognitive laboratory aims at reconstructing possible problems with
interpretations of instructions and questions, evaluating tasks and the level or
sources of difficulty to complete the test. It is organised in the form of a cognitive
interview (Beatty & Willis, 2007), involving the administration of draft survey
questions while collecting additional verbal information to evaluate the quality of
responses the questions generate. The procedures most often used are based on two
approaches (Beatty & Willis, 2007). In the first approach the researcher’s role is “to
facilitate the participants thought processes” (p. 289) and to follow a strict think-
aloud protocol which the researcher records. The other approach is internally var-
ied, including a group of methods, referred to as probing and derives from the
practice of intensive interview followed by probes. The researcher asks participants
about specific items in a test or questionnaire. These questions may be flexible to
allow exploration of opinions or structured for comparability of results between dif-
ferent researchers.
The Beatty and Willis (2007) review describes the advantages of both approaches,
yet they see more benefits of probing over thinking aloud. The chief drawback of the
latter approach is that less able participants more frequently become confused and
less tolerant of the procedure (Redline, Smiley, Lee, DeMaio, & Dillman, 1998).
This is an important consideration with child participants who tend to require indi-
vidual attention.
In this study the cognitive laboratories were in the form of interviews which fol-
lowed a relatively strict protocol but allowed some flexibility, including asking chil-
dren for additional explanation. The aims were to explore how children
• understood instructions: to ensure they had been formulated in an age-appropriate
and comprehensible way
• responded to test items: in order to estimate their level of difficulty
• felt about the illustrations: in order to check if the style and aesthetics appealed
to young learners’ tastes
• commented on the difficulty and user-friendliness of the whole test and individ-
ual items.
Sample selection aimed to obtain interviews with children of varying abilities in
English. The 36 children chosen were 9 years old and attended schools in three
geographically distinct Polish regions (Podlasie, North-Eastern, Mazowsze, Central,
Dolnośląskie, South-Western). Schools were located in rural, urban and suburban
areas with varying socio-economic characteristics. School and parental consent for
the interviews was previously obtained.
Interviews were carried out by three researchers following the same procedure
and took place with groups of four to six children in quiet classrooms. Children
Developing and Piloting Proficiency Tests for Polish Young Learners 125
were presented with the tasks sequentially and separately, so that they could attempt
to complete each task and were able to comment immediately. The researcher noted
the times children needed to complete each task. The same probe procedure was
used with all participants. It involved the following steps:
• The researcher introduced herself and explained the children’s role as advisors
for the creation of tasks for other children which would be used as teaching and
test materials.
• Copies of tasks were distributed and children were encouraged to attempt the
tasks.
• After they completed each task the researcher asked questions and recorded
answers. Children were first asked to respond spontaneously and those who did
not volunteer were approached individually and asked to share their opinions.
The questions asked during interviews were as follows:
1. Was the task easy or difficult? What made it easy or difficult?
2. Was the task interesting or boring? What made it interesting or boring?
3. Did you like the illustration, its layout and design of the page?
4. Were the instructions clear?
5. Would you change anything in the task? What and how?
On reflection on one’s performance in language tasks and self-assessment tech-
niques used in assessing young language learners (see also Butler 2016 and Nikolov
2016, both in this volume).
often tried to guess which title matched a text without reading it and sometimes they
found a few key words which were sufficient to provide the correct answer without
the need to understand the whole text.
Task type – picture matching with text (reading comprehension)
The task in which children matched jumbled speech bubbles to scenes in a comic
strip and which seemed to be both age-appropriate and interesting, emerged as a
serious challenge to develop. The text often appeared ambiguous and sometimes
one speech bubble matched more than one picture. On other occasions children
could number the jumbled text for a story without looking at the comic. As with the
task described above, the problem of related items remained.
Task type: Marking statements about one picture as true or false (listening
comprehension)
This task presented a relatively complex picture containing many elements and a
few people, e.g., a living room or a classroom. Next to the picture there was a chart
with item numbers and spaces to indicate the truth of the statements about the illus-
tration which children listened to in the recording. Although seemingly age-
appropriate, the task was confusing and was of low reliability. Primarily, it required
quick aural and visual processing of information (recording to picture). Although
the recording of each statement was played twice, some children needed longer to
respond.
• children’s practical advice for improving the items (e.g., changing vocabulary
items which determined comprehension of the whole reading passage): I didn’t
need to read the whole text, just the first two sentences. It was enough to know
these two words.
• corrections of inconsistencies between pictures and texts: Grandpa in the picture
is not wearing a jacket which we heard in the recording, but a sweater!
The extracts below show selected reactions and opinions expressed spontane-
ously during the cognitive interviews.
1. A boy who read the following text in reading task 3 in the cognitive laboratory
reacted as follows:
Text: “Who are you going to write about?” asks Mark. “Bella, my sister. She
is my best friend” answers Suzy. “That’s nice!”
The boy (genuinely surprised with the above text):
A sister who is the best friend? I’ve never heard of anything like that before.
2. A girl’s reaction to the artist’s illustration of a sentence describing a child doing
her homework:
The girl cannot be doing her homework. If she is sitting at the computer, she
must be playing computer games.
A letter with a broad description of the study and its aims was sent to heads of the
schools that agreed to take part in the pilot. Parents and caretakers were also sent an
information letter and were asked to consent to their child taking part in the study.
The school heads were made aware that participation in the pilot was anonymous
and confidential in that no information specific to a particular child could be easily
traced back to that child and that no person other than the researcher was to be pres-
ent during the test or able to see any element of it. It is worth pointing out that per-
formance on tasks, the reliability of which the pilot served to assess, could not form
the basis for pupil assessment, although some useful general suggestions could be
made in the form of constructive feedback.
Four staff from the Educational Research Institute supervised the pilot during
May 2011 after an internal training session. Training was intended to ensure that the
guidelines and procedures were followed in the same way at all schools. This train-
ing was a prelude for training of test administrators recruited specially for the main
study for whom a training video and simulation scenarios were prepared. In the
pilot, each version of the tasks shown in Table 4 was administered at least 320 times.
Researchers were instructed to avoid planning pilot sessions on busy school days
which might be predicted as likely to introduce distraction or disturbance. Testing
during lessons immediately before lunch was also to be avoided, although it was
important that no child was hungry, thirsty, upset in any way or needed the toilet.
128 M. Szpotowicz and D.E. Campfield
Researchers (during the pilot) and administrators (during the main study) were
encouraged to adopt the role and demeanour of a facilitator, supporting children
through the experience, being helpful and friendly, smiling and looking at the chil-
dren when talking to them, establishing eye contact and immediate rapport. While
they were asked to administer the test efficiently, they were also requested to avoid
looking officious, behaving formally or creating an exam atmosphere. This included
not dressing in a way that children might associate with authority.
Information the children received about the test itself and particularly about their
roles was considered vital to the success of the assessment. It was important to
thank them for agreeing to take part and emphasise their importance as helpers in
the research since their participation would provide information aimed to improve
foreign language learning for all school children in the country. The research aims
were explained to them in age-appropriate language.
Whilst there may be exceptions, the general climate in Polish schools encourages
competitiveness between children who are used to a degree of continuous assess-
ment, having their work graded and often being compared to their peers. It was
important, therefore, to emphasise that this was not the aim of this research and that
the children’s performance would not be similarly judged, nor would they receive
any points or marks for their performance. They were encouraged, however, to do
their best, without being upset if they found something difficult. They were asked to
respond to each test task reasonably quickly, to the best of their ability, before pro-
ceeding to the next. It was suggested that they could return to any problematic items
at the end, i.e., they should not spend too long on one question since they could
return to parts of the test they found more difficult. They were told how long the test
would take, that it was not a race and that there would be plenty of time to answer
every question. Since the children might not have done a test like this before, they
were encouraged to understand the task first and look at the questions carefully
before answering. As a result of the pilot, it was decided that in the main study a
training exercise of about 10 min would be used to introduce children to the test (see
Appendix).
Children were asked not to talk during the test but to raise their hand if they had
any questions or still found aspects of tasks unclear. It was stressed that since only
what they could do themselves was of interest, they should not be tempted to look
at what other children were doing. For reasons of timing and logistics, the pilot was
Developing and Piloting Proficiency Tests for Polish Young Learners 129
Table 5 demonstrates the sequence of events followed leading to the final version of
the test.
Following the pilot, the theoretical framework applied to design the measure-
ments of ability relied on Item Response Theory (IRT) as guidance for suitability of
candidate tasks. IRT yielded detailed descriptions of the relationship between
pupils’ ability and the likelihood of their being able to approach the task items.
Descriptions of item difficulty and their discrimination indices suggested a task
construction which ensured discrimination between pupils of different levels of
ability over the expected ability range. It was important that items avoided ceiling
effects and also to offer the weakest pupils an opportunity to derive a sense of
achievement from the assessment. A sufficient number of items of appropriate dif-
ficulty were required to measure ability in the second study phase, when the same
pupils would be tested again at the end of Grade 6.
The aim of the pilot was to (a) assess psychometric characteristics both of tasks
and items, (b) obtain reliability indices for all tasks and test versions and (c) evalu-
ate the task administration procedures intended for the main study. The task ver-
sions (see Table 3) were organised into 13 possible test versions (see Table 4) with
130 M. Szpotowicz and D.E. Campfield
Table 6 Pilot reliability indices: test versions (Cronbach’s alpha and IRT Rasch modelling)
Test version
Task A B C D E F G H I J K L M
Cronbach’s alpha .60 .63 .76 .69 .76 .71 .68 .64 .57 .80 .20 .61 .78
Person reliability .50 .72 .64 .81 .72 .66 .52 .81 .80 .78 .58 .60 .55
(Rasch)
Item reliability .99 .99 .99 .99 .98 .99 .99 .99 .99 .99 .99 .99 .99
(Rasch)
each child taking one test comprised of two listening and three reading comprehen-
sion tasks.
Reliability analysis was carried out using both Classical Test Theory and Item
Response Theory (IRT) with the use of Rasch modelling in Winsteps v. 3.74.
Reliability indices were obtained for individual tasks and for the 13 test versions (A
to M, Table 6). Cronbach’s alpha ranging from .60 to .70 is considered ‘acceptable’
and from .70 to .90, ‘good’ for low-stakes testing. Table 6 shows that some sets of
tasks, i.e., test versions, demonstrated good reliability indices. The person reliabil-
ity index represents the replicability of rank order that could be expected if the
sample of participants were given another set of items measuring the same construct
Developing and Piloting Proficiency Tests for Polish Young Learners 131
whilst the item reliability index indicates the replicability of item ranking that could
be expected if the same items were given to the same-sized sample with different
participants behaving in the same way (Wright & Masters, 1982). Table 6 demon-
strates that all sets of tasks had very high item reliability indices but in some cases
considerably lower person reliability indices, suggesting that learners were guess-
ing or that their responses were influenced by other children’s responses.
Apart from providing reliability indices, IRT allowed assessment of
(a) the extent to which each item difficulty matched participant ability,
(b) how well each item fitted the single parameter Rasch model by providing infit
and outfit values,
(c) the behaviour of distracter items,
(d) difference between expected and observed item measures, with an additional
map, allowing unexpected responses (an indication of possible guessing) to be
identified,
(e) differential item functioning (DIF) demonstrating the extent to which different
sample sub-sets (e.g., boys and girls) responded differently to certain items.
This analysis allowed suitability of each item for measurement to be assessed,
indicating items that needed modification or rejection.
To illustrate the usefulness of IRT analysis, Fig. 7 shows the Person/Item map for
one version of the first listening task (version 1 of the multiple-choice Listening 1
task in Table 3). Participants are placed on the left of the dividing line, from less
able at the bottom to more able placed towards the top of the map. The items are
placed on the right, from the easiest at the bottom to more difficult to the top of the
map. The mean measure of item difficulty at 0.00 logit was only slightly lower than
the mean measure for person ability, suggesting a good match between task diffi-
culty and participant ability. Ability ranged from −3 to +4 logits, whilst item mea-
sures ranged from −1.26 to +2.03. This suggests that there were participants whose
ability exceeded the difficulty of most difficult items and some whose ability fell
below the difficulty of the easiest items. The map allows identification of these
items and to assess the number of participants outside the task range. In the case of
this version of the first listening task, the map shows that almost everyone answered
item 3 correctly, whilst items 5, 8 and 10 were difficult. The map illustrates how 6 %
of children in the upper range of ability were above the range of the test, i.e., over
scale, and almost 3 % of children were below the ability required for the easiest
item.
As a result of the analysis, two items were removed from this task: a difficult
item 10 and item 18, of average difficulty. Although the infit and outfit values for all
items fell within the range of 0.5–1.5 which, according to Linacre (2012), is deemed
productive for measurement, both items had the highest infit and outfit values: 1.12
and 1.26 for item 18 and 1.10 and 1.27 for item 10. According to Classical Test
Theory, these items also had the lowest discrimination values: .08 for item 18 and
.12 for item 10, suggesting that both qualified for rejection or substantial change.
Additionally, item 10 was scored correctly by a number of participants whose scores
were otherwise weak.
132 M. Szpotowicz and D.E. Campfield
Following the pilot and re-piloting of certain items, the finished product could be
regarded as not only the task versions demonstrating the best reliability and pupil
differentiation but also the plan and instructions for test administrator recruitment
and training, the procedures, collection of scripts, coding and quality control.
Analysis of the nationwide test was to follow a strategy similar to the one employed
to assess the candidate versions. The same statistical tools and methods for item
analysis were to be used. The same criteria were to be applied to items as in the
pilot, since on a larger scale anomalies might be observed which would not be vis-
ible at the smaller pilot scale. Final dissemination of the findings is planned to
coincide with a conference together with a published report written with all stake-
holders in mind. Sound database design is needed for the final results and associated
contextual data. The tools required for this should be based on relational database
technology to allow the use of SQL to select subsamples of pupil and teacher data
according to chosen selection criteria.
8 Conclusions
This chapter described some solutions to the problems associated with the creation
of a large-scale language test designed, piloted and administered to young learners
as part of an empirical study. Beyond the general difficulty of ensuring the useful-
ness of a language test from the perspective of the young learners, the team of test
developers faced the following challenges: (1) How to create interesting and age-
appropriate test items from a very limited volume of common vocabulary; (2) How
to reconcile learners’ well-developed cognitive skills with their low level of foreign
language knowledge in order to create test materials; (3) How to encourage willing
134 M. Szpotowicz and D.E. Campfield
This study has highlighted the importance of the child perspective in terms of lin-
guistic, visual and pragmatic content of test item, the need for target-age group
consultation and careful piloting of items and test procedures. Future research
should give attention to these aspects of large-scale measurement of children’s for-
eign language and attempt to explore ways how such measurement could better
account for the variety of lesson content, course materials and learning experiences
of young foreign language learners in instructional settings. Full verification of
assessment should include follow up, particularly of outliers.
Developing and Piloting Proficiency Tests for Polish Young Learners 135
Appendix
136 M. Szpotowicz and D.E. Campfield
References
Alexiou, T., & Konstantakis, N. (2007, July). Vocabulary in Greek EFL young learners’ course
books. Paper delivered to ESCR Seminar: Models and concepts, practical needs and theoreti-
cal approaches in modelling and measuring vocabulary knowledge. Swansea University,
Swansea, Wales.
Ashman, A. F., & Conway, R. N. F. (1997). An introduction to cognitive education. London:
Routledge.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge, UK: Cambridge
University Press.
Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford, NY: Oxford University
Press.
Beatty, P. C., & Willis, G. B. (2007). Research synthesis: The practice of cognitive interviewing.
Public Opinion Quarterly, 71(2), 287–311.
Bredekamp, S., & Copple, C. (Eds.). (1997). Developmentally appropriate practice in early child-
hood programs. Washington, DC: National Association for the Education of Young Children.
Butler, Y. G. (2016). Self-assessment of and for young learners’ foreign language learning. In
M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives.
New York: Springer.
Butler, F. A., & Stevens, R. (2001). Standardized assessment of the content knowledge of English
language learners K-12: Current trends and old dilemmas. Language Testing, 18(4), 409–427.
Campfield, D. E. (in preparation). Function words and lexical difficulty – Using Elicited imitation
to study child L2.
Carr, N. T. (2011). Designing and analyzing language tests. Oxford, NY: Oxford University Press.
Enns, J. T., & Trick, L. M. (2006). Four modes of selection. In E. Bialystok & F. I. M. Craik (Eds.),
Lifespan cognition: Mechanisms of change (pp. 43–56). New York: Oxford University Press.
Espinoza, L. M., & Lopez. M. L. (2007, August). Assessment considerations for young English
language learners across different levels of accountability. Paper prepared for The National
Early Childhood Accountability Task Force and First 5 LA. Retrieved from http://www.first5la.
org/files/AssessmentConsiderationsEnglishLearners.pdf
Hsieh, C.-N. (2016). Examining content representativeness of a young learner language assess-
ment: EFL teachers’ perspectives. In M. Nikolov (Ed.), Assessing young learners of English:
Global and local perspectives. New York: Springer.
Inbar-Lourie, O., & Shohamy, E. (2009). Assessing young language learners: What is the con-
struct? In M. Nikolov (Ed.), Contextualizing the age factor: Issues in early foreign language
learning (pp. 83–96). New York: Mouton de Gruyter.
Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. J. Weir (Eds.), European language test-
ing in a global context: Proceedings of the ALTE Barcelona conference, July 2001 (pp. 262–
284). Cambridge, UK: Cambridge University Press.
Kulas, K. (2012, July). The selection of vocabulary for EFL lower-primary school textbooks. In
10th Teaching and language corpora conference, The Institute of Applied Linguistics,
University of Warsaw, Warsaw, Poland.
Linacre, J. (2012). Practical Rasch measurement. Retrieved from www.winsteps.com/tutorials.
htm
McKay, P. (2005). Research into the assessment of school-age language learners. Annual Review
of Applied Linguistics, 25, 243–263.
McKay, P. (2006). Assessing young language learners. Cambridge, UK: Cambridge University
Press.
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment.
Educational Researcher, 18(2), 5–11.
Mihaljević Djigunović, J., & Lopriore, L. (2011). The learner: Do individual differences matter?
In J. Enever (Ed.), ELLiE. Early language learning in Europe (pp. 43–60). London: British
Council.
Developing and Piloting Proficiency Tests for Polish Young Learners 137
Mihaljević Djigunović, J., & Vilke, M. (2000). Eight years after: Wishful thinking vs. the facts of
life. In J. Moon & M. Nikolov (Eds.), Research into teaching English to young learners
(pp. 67–86). Pécs, Hungary: University Press Pécs.
Ministerstwo Edukacji Narodowej (MEN). (2002). Rozporządzenie Ministra Edukacji Narodowej
i Sportu z dnia 26 lutego 2002 r. w sprawie podstawy programowej wychowania przedszkol-
nego oraz kształcenia ogólnego w poszczególnych typach szkół (Dz. U. z 9 maja 2002 r. Nr 51,
poz. 458).
Ministerstwo Edukacji Narodowej (MEN). (2008). Rozporządzenie Ministra Edukacji Narodowej
z dnia 23 grudnia 2008 r. w sprawie podstawy programowej wychowania przedszkolnego oraz
kształcenia ogólnego w poszczególnych typach szkół. Dz.U. nr 4 z dn. 15 stycznia 2009.
Warszawa, Poland: Kancelaria Prezesa Rady Ministrów.
Muñoz, C., & Lindgren, E. (2011). Out-of-school factors: The home. In J. Enever (Ed.),
ELLiE. Early language learning in Europe (pp. 103–124). London: British Council.
Muñoz, C., & Lindgren, E. (2013). The influence of exposure, parents, and linguistic distance on
young European learners’ foreign language comprehension. International Journal of
Multilingualism, 10, 105–129.
Nikolov, M. (2016). A framework for young EFL learners’ diagnostic assessment: Can do state-
ments and task types. In M. Nikolov (Ed.), Assessing young learners of English: Global and
local perspectives. New York: Springer.
Nikolov, M., & Szabó, G. (2012). Developing diagnostic tests for young learners of EFL in grades
1 to 6. In E. D. Galaczi & C. J. Weir (Eds.), Voices in language assessment: Exploring the
impact of language frameworks on learning, teaching and assessment – Policies, procedures
and challenges, Proceedings of the ALTE Krakow Conference, July 2011 (pp. 347–363).
Cambridge, UK: UCLES/Cambridge University Press.
Pamuła, M., Bajorek, A., Bartosz-Przybyło, I., & Sikora-Banasik, D. (2006). Europejskie portfolio
językowe dla dzieci od 6 do 10 lat. Warszawa, Poland: Centralny Ośrodek Doskonalenia
Nauczycieli.
Redline, C., Smiley, R., Lee, M., DeMaio, T., & Dillman, D. (1998). Beyond concurrent inter-
views: An evaluation of cognitive interviewing techniques for self-administered question-
naires. Proceedings of the section on survey research methods (pp. 900–905), Alexandria, VA:
American Statistical Association. Retrieved from https://www.amstat.org/sections/SRMS/
Proceedings/papers/1998_155.pdf
Rixon, S. (1999). Where do the words in EYL textbooks come from? In S. Rixon (Ed.), Young
learners of English: Some research perspectives (pp. 55–71). Harlow, UK: Longman.
Schaffer, H. R. (2004). Introducing child psychology. Oxford, UK: Blackwell.
Szpotowicz, M., & Lindgren, E. (2011). Language achievements: A longitudinal perspective. In
J. Enever (Ed.), ELLiE. Early language learning in Europe (pp. 125–143). London: British
Council.
Taylor, L., & Saville, N. (2002). Developing English language tests for young learners (Research
Notes 7, pp. 3–6). Cambridge, UK: UCLES.
Vernon, H., & Vernon, M. (Eds.). (1976). The development of cognitive processes. London:
Academic.
Wesson, K. (2011). Attention span revisited. Retrieved from http://sciencemaster77.blogspot.
com/2011/01/attention-spans-revisited.htm
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
The Development and Validation
of a Computer-Based Test of English
for Young Learners: Cambridge English
Young Learners
Abstract This chapter summarises the rationale for the development and validation
work that took place over 2.5 years before the launch of the computer-based (CB)
format of the Cambridge English Young Learners English tests (YLE). Several
rounds of trials were carried out in a cyclical way, in a number of different locations
across various countries, to ensure data was collected from a representative sample
of candidates in terms of geographical location, age, L1, language ability, familiar-
ity with the YLE tests, and experience of using different computer devices – PC,
laptop and tablet. Validity evidence is presented from an empirical study, using a
convergent mixed methods design to explore candidate performance in and reaction
to the CB YLE tests. Regression analyses were conducted to investigate which indi-
vidual test taker characteristics contribute to candidate performance in CB YLE
tests. The results indicate that CB delivery presents a genuine choice for candidates
in line with the Cambridge English ‘bias for best’ principle. Positive feedback from
trial candidates, parents and examiners suggests that CB YLE tests offer a contem-
porary, fun, and accessible alternative to paper-based (PB) YLE tests to assess chil-
dren’s English language ability.
1 Introduction
ages of 7 and 12. The tests are available in three levels: Starters, Movers and Flyers,
set at levels pre-A1 to A2 of the Council of Europe’s Common European Framework
of Reference (CEFR, Council of Europe, 2001). YLE tests measure achievement in
four skills in 3 papers: (a) Listening, (b) Speaking, (c) combined Reading and
Writing. Candidates receive a certificate that indicates their level of success in the
test through showing a number of shields for each section of the test. The maximum
number of shields awarded for each section is five so a candidate could score a
maximum of fifteen shields per test. Achieving five shields indicates a very strong
performance on the test. A score of three shields indicates that candidates are per-
forming at the level intended by the test. In order to provide motivation to the young
children taking the test, all candidates are awarded at least one shield for each sec-
tion. It is not possible to ‘fail’ the test.
First, we discuss the rationale for developing a CB version of the tests. Next, we
discuss what methodological issues were considered in the trialling and validation
of the CB format. Then, in a mixed methods enquiry, we report on some validation
evidence generated by investigating how candidates’ performances are related to
individual differences (age, gender and preference for, and experience of, computer
use), and what candidates, parents and observers said about CB YLE tests. The
convergent mixed methods design allows us to triangulate the results and consider
evidence from various sources to mutually inform our interpretations of the data.
Cambridge English has produced CB tests in CD-ROM format since 1995, for
instance, the adaptive CB BULATS (Business Language Testing Service) or the QPT
(Quick Placement Test). Cambridge English initially used the CB format in low-
stakes testing, typically for shorter tests that were not certificated and where the test
administration was not supervised (Jones, 2000). However, higher-stakes tests have
also been delivered in CB format, including Cambridge English Skills for Life,
Teaching Knowledge Test (TKT), Cambridge English Key (KET), Cambridge
English Preliminary (PET), Business English Certificate (BEC) Preliminary and
Business English Certificate (BEC) Vantage. CB delivery of the for Schools ver-
sions of KET and PET was introduced in April 2010, only a year after their launch
in PB format in March 2009. Similarly, soon after Cambridge English First for
Schools was launched, its CB format was introduced in March 2012. The develop-
ment of CB delivery of Cambridge English Young Learners started in late 2011,
with a series of trials carried out between 2012 and 2014. These CB tests are
computer-mediated linear tests. Cambridge English continues to be engaged in
developing a range of computer-adaptive tests, such as Business Language Testing
Service (BULATS), and various placement tests in progress (e.g., Cambridge
English Placement Test (CEPT), Cambridge English Placement Test for Young
Learners).
Before developing a CB version of a test to offer an alternative to the PB delivery
mode, test providers need to carry out research to investigate comparability of the
two delivery methods, which we discuss in the next section.
The extent to which PB and CB formats of a test measure the same trait determines
whether they can replace each other (Clariana & Wallace, 2002; McDonald, 2002;
Neuman & Baydoun, 1998; Pommerich, 2004; Pomplun, Frey, & Becker, 2000;
Wang, Jiao, Young, Brooks, & Olson, 2007; Zandvliet, 1997). Jones and Maycock
(2007, p. 11) note that the goal of comparability studies can be to inform test
users that
1. the PB and CB format can be used interchangeably
2. they differ to some extent for practical reasons inherent to the PB and CB
formats
3. their designs differ so that one may be considered better than the other for some
purposes.
In all comparability studies between Cambridge English PB and CB test formats,
Rasch modelling has been used as a measurement tool. Item banking techniques
The Development and Validation of a Computer-Based Test of English for Young… 143
generally ensure that when items are made available for use in a CB test, their
difficulty is known as they have been calibrated (i.e. their difficulty has been
estimated) on a scale. Thus, it is possible to compare the difficulty of items in the
two formats (Jones & Maycock, 2007, p. 11).
In experimental conditions, where the two test formats are completed one after
the other, the sequence effect may produce variations in performance due to fatigue,
inattention, etc. Hence test order is always controlled for in a counterbalanced
research design.
In order to gauge attitudinal and preference data on each delivery format, candi-
dates are usually asked to fill in a questionnaire or to participate in a focus group
covering their perception of test difficulty in the two formats, the appropriateness of
the length of the test, and background variables such as their attitudes (likes and
preferences) as well as their familiarity with, ability, experience and confidence in
using computers (Jones, 2000; Maycock & Green, 2005).
Candidate perceptions, preferences and attitudes are revealing as they reflect the
extent to which candidates feel at ease with either format and which format they feel
allows them to best demonstrate their language ability. Research, however, has
found that these perceptions, preferences and attitudes tend not to have an effect on
candidate scores in either format (Jones, 2000; Maycock & Green, 2005; O’Sullivan,
Weir, & Yan, 2004; Taylor, Jamieson, Eignor, & Kirsch, 1998).
be more familiar with reading and typing on the computer than with reading and
writing on paper, due to the frequency of online activities in learners’ lives. Russell
and colleagues have repeatedly found that students in US schools perform better on
computers (e.g., Russell & Haney, 1997). This has led them to consider whether
writing on paper is less of a ‘real world’ task (cf, Chapelle & Douglas, 2006; Lee,
2004; Li, 2006). It is worth noting that some findings in European schools differed
from this: Endres (2012) found that, while 12–16 year-old Spanish learners of
English tend to use computers for leisure and informal communication, they do not
use it as much for schoolwork and homework.
Apart from the design features common to all comparability studies noted above,
studies among young learners need to use methods of enquiry familiar to and widely
accepted by early childhood professionals. Thus, all methods used in the validation
of CB YLE were modelled on “best practices”, complying with relevant ethical
guidelines on research with children (e.g., British Educational Research Association,
2011; British Psychological Society, 2009; Economic and Social Research Council,
2012; European Commission Information Society Technologies, n.d.; National
Association for the Education of Young Children, 2009; Social Research Association,
2003). For instance, it was considered that children may need help filling in ques-
tionnaires even if delivered in their L1. A focus group discussion may be more
appropriate. Alternative ways of eliciting data from children were also considered
(e.g., see Sim, Holifield, & Brown, 2004; Sim & Horton, 2005). For those children
who may not feel comfortable responding verbally, drawing may be an alternative
way of eliciting responses (Wall, Higgins, & Tiplady, 2009). In addition, individual
debriefing interview sessions may be more suitable with younger children where
open-ended questions allow children to respond using their own words (Barnes,
2010a, 2010b).
Products
-Examiner
Products Procedures feedback
Procedures Quantitative -Questionnaire
responses
-Open-ended
Questions in Qualitative documents
-Questionnaires -Candidate
-CBY LE test Data Collection -Numerical
item,task and
questionnaires
-Testimonial
Data Collection comments
demo versions component templates -Candidate
scores -Interviews/ testimonials
focus groups -Transcripts
with candidates
Procedures
-Group
Products
comparisons by -Percentages
CIS categories -Mean,Mode,
-Descriptive stats St Dev,Min,
-Frequencies Max figures
-Analysis of -Quantification
Procedures
candidates Quantitative of spelling and
typing errors -Document
Qualitative Products
responses
-Analysis of Data Analysis -Timings of
responses and
analysis with
grounded
Data Analysis -Major themes
and categories
response length
and audibility hesitations theory
-IRT Rasch -Rasch IRT
analysis analysis output
-Many faceted -Facets output
Rasch analysis tables
Fig. 1 Mixed method research design procedures and products in CB YLE development trials
(Adapted from Creswell & Plano Clark, 2011, p. 118)
Table 4 shows the percentages of candidates who took the YLE test on an ipad
(tablet) and on a PC. Combining Movers and Flyers, 23 % of candidates (N = 90)
took the test on an ipad, while 28 % of candidates (N = 110) took the test on a PC.
Mixed method research involved the following specific steps in the CB/PB YLE
comparability study:
1. Correlations among PB and CB scores by level (Starters, Movers, Flyers) and
component (Listening; L, Reading & Writing; RW, Speaking; S).
2. Regression analyses (Fox, 2002, 2008) on combined exam score data and candi-
date background and attitudinal data collected through questionnaires (332 can-
didates in Mexico, Spain, Italy and Turkey, see Appendix A for candidate
questionnaire).
3. Analysis of verbal feedback and drawings in questionnaires and testimonials
provided by trial candidates and their parents (126 candidates and their parents
from Hong Kong, Mexico, Spain).
4. Analysis of trial observer feedback (64 observers from Hong Kong, Spain, Italy,
see Appendix B for observer checklist).
Table 5 summarises the techniques of data collection and analysis in the PB/CB
YLE comparability study.
The regression analyses reported below used both quantitative (exam score data,
candidate background information) and qualitative data (experiential and attitudinal
data related to computer use and CB tests). The following variables were used in the
quantitative regression analyses:
– Dependent variables:
• total score in CB test for Starters (Model 0),
• total score in PB test for Flyers (Model 1),
148
Table 5 Overview of research areas, data types and sources, instruments and analyses in PB/CB YLE comparability studies
Research area Data type Data source and instruments Data analysis
1. Candidate and parent Quan 1. Candidate questionnaires 1. Frequencies of questionnaire
attitude to CB YLE Qual 2. Focus group interviews with trial candidates 2. closed responses
3. Candidate and parental testimonials 3. Regression analyses
4. Candidate drawings 4. Document thematic analysis
5. Photos and video footage of trials
2. Observations Quan 1. Observer checklist 1. Frequencies of checklist closed
from trials responses
Qual 2. Summary of observer verbal comments for action 2. Document thematic analysis
3. Examiner attitude Quan 1. Examiner survey 1. Frequencies of closed responses
to CB YLE Qual 2. Soft feedback in reports and emails 2. Document thematic analysis
4. Candidate background Quan 1. Candidate background data elicited in questionnaires 1. Analysis of background data in
information and testimonials questionnaires and testimonials
2. Candidate information sheet (CIS) provided on exam 2. Analysis of CIS data
entry forms 3. Descriptive statistics
4. Regression analyses
5. Candidate exam Quan 1. Recorded candidate responses (L, RW) 1. Candidate written response analysis
performance (CB and PB) Qual (CB response files and PB scripts) (typing errors, spelling)
2. Transcription of audio-recorded Recorded candidate 2. Analysis of transcripts for hesitation and
responses (L, RW) (CB response files and PB scripts) examiner feedback/support
3. Timings of candidate responses 3. Analysis of length of responses
S. Papp and A. Walczak
6. Scoring and marking Quan 1. Marking keys (L, RW) 1. Document analysis
(CB and PB) Qual 2. Test scores at item, task and component level (L, RW) 2. Descriptive and Classical analysis of
(CB and PB) score data, e.g., facilities, discrimination
(CB and PB)
3. Speaking examiner scores (at criterion level and total) (S) 3. IRT Rasch analysis (PB L, PB RW)
4. Correlations (PB vs CB)
5. Regression analyses
6. Multi-faceted Rasch analysis with
Facets (S)
7. Changes in CB YLE Qual 1. Test development and trialling procedural documents 1. Procedural document analysis
test content and marking 2. Successive demo versions of CB YLE tests 2. Documentation on changes made
during test development to test content and test delivery systems
(e.g., entry portal, examiner portal)
Caption: Quan = quantitative data type, Qual = qualitative data type
The Development and Validation of a Computer-Based Test of English for Young…
149
150 S. Papp and A. Walczak
FLYERS - Correlation b/w PB and CB Total Scores =0.60 MOVERS - Correlation b/w PB and CB Total Scores =0.69
70
80
60
70
CB Total Scores
CB Total Scores
50
60
50
40
40
30
30
20
20
10
20 30 40 50 60 70 80 20 30 40 50 60 70
Fig. 2 Correlation between PB and CB total scores for Flyers and Movers
First the relationship between the total scores in PB and CB exams was investigated.
This analysis is based on PB and CB scores for Movers and Flyers (N = 274). Figure 2
shows that for Flyers the correlation was 0.60, and for Movers it was 0.69. This
provides evidence of the extent of comparability between PB and CB YLE tests dur-
ing the trials. Please note that during the trials candidates had not been familiarised
with the computer-based delivery of YLE, so these otherwise modest correlations
were encouraging for when sample practice tests (now freely available as apps on
AppleStore) were made available for candidates. These offer guidance on how to
take CB YLE and provide advance practice on functionality for candidates.
Figures 3 and 4 show the distribution of CB total scores for Starters and Flyers by
country. The data is presented in the form of boxplots. Boxplots show the distribu-
tion of data for each category. The rectangles show the distribution of data from the
1st to 3rd quartile, where the bottom side of the rectangle represents the 1st quartile
(25th percentile) and the upper line represents the 3rd quartile (75th percentile). The
thick black horizontal line shows the median in the data. The vertical dashed lines –
the whiskers – show the range of the data. Outliers are indicated with dots beyond
the whiskers.
152 S. Papp and A. Walczak
100
80
CB Total Scores
60
40
20
0
ES IT TR
Country
FLYERS - Distribution of CB Total Scores for Spain, Italy, Mexico and Turkey
80
60
CB Total Scores
40
20
0
ES IT MX TR
Country
5.2.1 Country
There were candidates from Spain, Italy, Turkey taking CB Starters, plus candidates
from Mexico taking CB Flyers. Figure 3 shows that Starters candidates’ CB scores
vary slightly according to country. There is evidence of differences between
countries in results of general educational assessments among school learners
(e.g., Merrell & Tymms, 2007; Tymms & Merrell, 2009). When we look at the range
of scores in Fig. 4 for CB Flyers, we see that Turkish candidates perform the best in
the sample, followed by candidates from Spain. In order to account for differences
in performance across countries we included dummy variables for each country in
all the regression analyses below.
Next, we report the results of regression analyses carried out on the data for
Starters and Flyers, with the aim of identifying variables that explain candidate
performance in each test. The dependent variable in these regressions is (1) total
score on the PB test and (2) total score on the CB test.
5.2.2 Age
To investigate the effect of age on PB and CB test performance, scores and candi-
date age were plotted against each other. As can be seen in Fig. 5 for Flyers there
was a clear curvilinear relationship between age and scores. This shows that the
older the candidates are after age 11, the lower their scores are in both PB and CB
tests during the comparability trial. The target candidature for Cambridge English
Young Learners is up to age 12. Here we see evidence that candidates older than age
11 and a half may have been inadvertently affected by motivational and affective
variables: they may not have taken the tests seriously. Due to the curvilinear rela-
tionship between candidate age and performance on PB and CB scores in Flyers,
the regression analyses include a variable Age Squared to account for this.
70
70
CB Total Scores
60
60
PB Total Scores
50
50
40
40
30
30
20
20
9 10 11 12 13 14 9 10 11 12 13 14
Age in Years Age in Years
According to Brown and McNamara (2004), the relationship between gender and
test performance is not linear. Historically, in PB YLE, gender tends to affect test
performance. Girls tend to achieve slightly higher than boys in terms of average
shield in each skill and at each level. A slightly higher standard deviation for boys
indicates a wider spread of ability among boys as compared with girls in each skill
at all levels. The PB/CB YLE comparability trials provided an opportunity to check
for the effect of gender on candidate performance in the CB version of the tests.
First, we investigated the influence of age, gender, years of instruction and pref-
erence for delivery mode on candidate performance in the CB test for Starters. The
variable ‘Preference for delivery mode’ describes candidate preference for delivery
mode for taking an exam – either on paper, on computer or no difference. In the
model we used preference on paper as the baseline for comparison for other groups.
The graphs in Appendix C illustrate the effects of all regression analyses presented
in this section. As Table 2 shows, in Starters, years of English instruction have a
statistically significant effect on CB scores – the longer candidates have been receiv-
ing English instruction the better they perform in the Starters CB test. Table 6 shows
that the effects of gender and age are not statistically significant – there seems to be
no difference in the performance of male and female candidates and there are no
differences in performance across age. The results show, however, that performance
of Starters in the CB test is affected by the preference to take exams on computers
rather than on paper. Candidates who prefer to take the exam on computer perform
significantly better than candidates who prefer to take the exam on paper (the mag-
nitude of the effect is 3.05). Graphical effect plots can be seen in Appendix C for all
regression analyses.
To investigate which characteristics explain candidate performance in PB and
CB in Flyers two models were tested. In the first model, the effects on candidate
performance in the PB test were investigated while in the second model the perfor-
mance in the CB test was investigated.
Table 8 displays the results of Models 3 and 4 in Flyers where we investigated the
effects on CB scores of the following individual background variables and
preferences:
(5) Frequency of computer use
(6) Reason for computer use
(7) Type of computer at home.
Model 3 includes individual background variables and frequency of computer
usage. The results in Table 4 in Model 3 show that the frequency of computer use
does not influence candidate performance on CB Flyers, whereas years of English
156
Table 7 Effects of individual background variables on PB and CB total test scores for Flyers
Model 1: PB total scores Model 2: CB total scores
Estimate Std. error t value Pr(>|t|) Estimate Std. error t value Pr(>|t|)
(Intercept) −106.07 124.68 −0.85 0.40 −182.75 111.05 −1.65 0.10
Age in years 32.00 22.97 1.39 0.17 46.78* 20.46 2.29 0.02
Age in years squared −1.59 1.05 −1.51 0.13 −2.34* 0.94 −2.49 0.01
Gender (baseline: Female)
Male 4.91 2.48 1.97 0.05 5.38** 1.93 2.79 0.01
Years of English instruction 1.89*** 0.49 3.82 0.00 1.45*** 0.41 3.56 0.00
Preference for delivery mode (baseline: On paper)
No difference 6.66 3.47 1.92 0.06 5.07 2.88 1.76 0.08
On computer 10.05*** 3.30 3.05 0.00 9.11** 2.75 3.31 0.00
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
N = 115
Adjusted R-squared: 0.48
F-statistic: 16.11 on 7 and 106 DF, p-value: 2.8e-14
N = 148
Adjusted R-squared: 0.47
F-statistic: 17.36 on 8 and 138 DF, p-value: <2.2e-16
S. Papp and A. Walczak
Table 8 Models 3–4 Effects of individual background variables and individual preferences for Flyers
Model 3: CB total scores Model 4: CB total scores
Estimate Std. error t value Pr(>|t|) Estimate Std. error t value Pr(>|t|)
(Intercept) −180.67 111.85 −1.62 0.11 −76.63 117.94 −0.65 0.52
Age in years 46.48* 20.58 2.26 0.03 26.96 21.77 1.24 0.22
Age in years squared −2.32* 0.94 −2.46 0.02 −1.43 1.00 −1.43 0.15
Gender (baseline: Female)
Male 5.31** 1.97 2.70 0.01 5.28* 2.04 2.59 0.01
Years of English instruction 1.42*** 0.41 3.42 0.00 1.33** 0.42 3.17 0.00
Preference of delivery mode (Baseline: On paper)
No difference 5.02 2.90 1.73 0.09 4.55 2.99 1.52 0.13
On computer 9.29** 2.77 3.35 0.00 8.20** 2.91 2.82 0.01
Frequency of computer use (Baseline: Only at weekends)
Every day −2.28 2.82 −0.81 0.42 −2.01 2.87 −0.70 0.48
Once or twice a week −1.05 2.71 −0.39 0.70 −1.42 2.76 −0.51 0.61
Using computers for English homework −0.05 2.00 −0.03 0.98
The Development and Validation of a Computer-Based Test of English for Young…
(continued)
Table 8 (continued)
158
instruction and preference for taking exams on computer do. There is also an effect
of gender – boys score significantly higher than girls on the CB Flyers test. We also
see an age effect – the older the candidates the better they score on the CB Flyers
test, but this effect reverses at a certain point (the curvilinear effect of age noted
earlier). This model explains 47 % of variance in candidate performance.
In Model 4 we included additional individual preference variables – the reason
for computer use and type of computer at home. The results show that, apart from
the variables that were significant in Model 3, candidates who only have a tablet at
home perform significantly better than candidates who have a PC. Model 4 explains
48 % of variance in candidate performance.
Since CB YLE is available on PC, laptop and tablet, in order to make sure that the
type of computer used for the test does not affect performance in CB YLE, we inves-
tigated whether using an iPad or PC creates a difference to candidates’ total score in
the CB test. For this, a combined Flyers and Movers dataset was used in order to
gain a considerable number of observations.
Figure 6 shows descriptive statistics of candidates who took the CB exam on an
iPad and a PC. The figure shows that the median score for candidates taking the CB
50
40
30
20
IPAD PC
Device - IPAD vs. PC
Table 9 Effect of computer device on CB total scores for Movers and Flyers
Model 5: CB total scores
Estimate Std. error t value Pr(>|t|)
(Intercept) −112.06 80.36 −1.39 0.17
Age at test date 30.70* 14.73 2.08 0.04
Age at test date squared −1.44* 0.67 −2.14 0.03
Gender Male 3.66 2.0 1.83 0.07
Years of English Instruction 1.06* 0.45 2.35 0.02
Preference for delivery mode
No difference 3.50 3.61 0.97 0.33
On computer 5.40 3.32 1.63 0.11
Frequency of computer usage
Every day 2.29 3.21 0.71 0.48
Once or twice a week −2.35 3.14 −0.75 0.46
Reason for computer use
English homework 3.83 2.11 1.81 0.07
Games 0.27 2.56 0.11 0.92
Email/chat −0.17 2.17 −0.08 0.94
Other 1.45 2.31 0.63 0.53
Type of computer at home
PC/laptop/tablet −0.48 2.66 −0.18 0.86
Tablet 2.43 2.26 1.08 0.28
Device used PC −5.23 2.97 −1.76 0.08
Signif. codes: ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
N = 203
Adjusted R-squared: 0.23
F-statistic: 4.067 on 16 and 150 DF, p-value: 2.033e-06
test on an iPad is close to the median score for candidates that took the exam on a
PC. Whether the difference in performance between those two groups is statistically
significant was then tested in a regression analysis, presented below.
The computer device variable was introduced in the regression model with indi-
vidual characteristics of the candidates and variables on computer usage. The
dependent variable here is the total score in the CB test. As Table 9 shows, there is
no statistically significant difference in CB total scores between candidates who
took the test using an iPad and those using a PC when individual characteristics of
candidates and their preferences for computer usage are controlled for.
Of the 322 candidates who completed the questionnaire, altogether 126 candidates
and their parents from Hong Kong, Mexico and Spain gave testimonials (in their L1
or English) during the trials. In the testimonials candidates were asked three ques-
tions in their L1:
The Development and Validation of a Computer-Based Test of English for Young… 161
1. Did you enjoy taking the Cambridge English Young Learners test on computer?
Why?
2. What did you like most about the test?
3. Would you recommend the Cambridge English Young Learners test on computer
to your friends? How would you describe it to them?
On a similar form, candidates’ parents were asked the following parallel ques-
tions in their L1:
1. Why did your child take the Cambridge English Young Learners test on
computer?
2. What did your child like most about taking the test on computer?
3. Would you recommend the Cambridge English Young Learners test on computer
to other parents? Why?
All feedback from trial candidates and parents was overwhelmingly positive,
confirming the suitability of the CB delivery mode for the target candidature.
Candidate feedback indicates the CB YLE exams are very popular among young
learners, as exemplified by their comments translated into English. In addition to
verbal comments in questionnaires and testimonials, candidate pictures and related
written comments add another perspective on their views and experiences of taking
CB YLE.
These additional qualitative sources of evidence (i.e., testimonials from candi-
dates and parents, and verbal and graphical comments from candidates) were care-
fully examined for common themes emerging. They were categorised by the same
candidate background variables (i.e., age, gender) that were investigated by the sta-
tistical analysis from the questionnaire data. This was done in order to look for
confirmation of findings or interpretation of the results, as is conventionally done in
a mixed methods design. Below we exemplify some of the recurring themes emerg-
ing with typical candidate, parental and observer comments and candidate
drawings.
Candidates and their parents especially appreciated the innovative nature of the
computer-based exam delivery and the new technology involved:
“I enjoyed taking the test on computer, because it’s more interactive. I liked that the
questions were oral. I would recommend it, and say: take it, it’s nice.”
(Movers trial candidate, boy, age 8, Mexico)
“I enjoyed taking the test on computer, because of the technology it uses and its
effectiveness. I like the most that it was on an iPad. I would recommend it to my
friends, as it represents a step forward for exams.”
(Flyers trial candidate, boy, age 12, Mexico)
162 S. Papp and A. Walczak
“I enjoyed it because I’ve never done an exam on a computer. I liked the speaking
questions the most. I would recommend it to friends because it is very fast.”
(Starters trial candidate, boy, age 10, Spain)
“I think it’s an innovative method that is going to help her in the future. My child
enjoyed the interaction with the computer the best. I would recommend it to
other parents because children are becoming more familiar with this type of
technology.”
(Parent of Movers trial candidate, girl, age 8, Mexico)
Candidates and their parents thought the CB YLE tests are fun and enjoyable and
game-like, and therefore have a strong motivational effect on children. Some observ-
ers’ comments confirmed this:
“I enjoyed taking the test because it was funny and very entertaining. I liked the
Speaking test the most.”
(Flyers trial candidate, girl, age 9, Mexico)
“I like it – it’s quicker and more fun. To tell you the truth I liked all of it, but if I had
to choose one part it would be the speaking. I would recommend it to my friends,
I would tell them: try it, it’s fun and not boring!”
(Movers trial candidate, girl, age 11, Spain)
“I liked it because it was like a game and fun. I would tell my friends to do the tests
because they are like games and are fun.”
(Starters trial candidate, girl, age 8, Spain)
“I enjoyed taking the test on the computer – it’s very fun. I would tell my friends to
do the exam because it’s fun, cool and entertaining.”
(Starters trial candidate, boy, age 8, Spain)
“My child took the test to gain more knowledge. She said it was like a game and as
a mother I have seen more motivation with the computer and overall.”
(Parent of Starters trial candidate, girl, age 8, Spain)
“Children like computers! It’s funnier.” (observer’s comment)
“Children’s comments ranged from ‘more modern’ to ‘fun’. (observer’s comment)
Candidates said that the level of the CB YLE tests is appropriate even though chal-
lenging for some:
“Yes I enjoyed it, because it is not easy and not too hard, it just right.”
(Flyers trial candidate, boy, age 8, Hong Kong)
Starters trial candidate, girl, age 10, Italy: ‘I enjoyed doing this test to test my level of English, I
didn’t find it that difficult’
Candidates mentioned that the CB format reduces the stress conventionally associ-
ated with tests:
“I enjoyed taking the exam on the computer because you don’t get as nervous and it
is more fun. The best bit was the listening exercise. I would recommend it to my
friends because it’s a difficult exam that’s fun at the same time.”
(Movers trial candidate, girl, age 9, Spain)
The Development and Validation of a Computer-Based Test of English for Young… 165
Starters trial candidate, girl, age 10, Italy: ‘At first, I thought it was more difficult and I was
nervous. But I enjoyed it very much doing it on the computer.’
Parental testimonials are a rich source of information to explain why parents would
prefer their child to take YLE on the computer. Parents see the value of the CB test
in checking their children’s progress in learning English:
“Our child took the test because we would like to know his knowledge in English so
that we can continue to help him in the future. Evaluating people’s knowledge is
the only way of guaranteeing quality in their knowledge and education.”
(Parent of Movers trial candidate, boy, age 9, Mexico)
“Our child took the test because it seemed a good experience and you could learn
how good your child is with language. She liked the listening exercises because
you can hear really well with the headphones, it’s easier to concentrate.”
(Parent of Flyers trial candidate, girl, age 11, Spain)
One parent mentioned the educational value of the CB YLE test for her child with
special needs:
“Since my son suffers from ADD, it is difficult for him to take regular exams that do
not take into account the added difficulties that his attention disorder and hyper-
kinesis represent in terms of writing activities. I would recommend the test to
other parents, because there is a wide variety of children with special needs
among those taking the exam and it might be the most suitable option for many
of them.”
(Parent of Flyers trial candidate, boy, age 12, Mexico)
The qualitative analysis revealed that younger candidates (aged 12 and under)
and boys showed slightly more explicit positive attitudes towards the new CB for-
mat. Parental and candidate feedback was also confirmed by observes.
Observers also made some general comments on their checklists. Again, this source
of evidence was used to inform the interpretations of the findings from the other
sources of evidence in the trials.
168 S. Papp and A. Walczak
According to observers, the very high level of engagement that children exhibited in
CB YLE tests can be attributed to the following features:
“Children seemed to be engaged and motivated by the pictures, sound and interac-
tive activities.”
“In general computer based is more fun as the candidates enjoy using computers
and it’s more visual”.
Observers confirmed that children are very capable of using computers, and they
especially like using ipads/tablets. However, feedback from candidates and observ-
ers were very useful in improving the tests during the development phase:
The Development and Validation of a Computer-Based Test of English for Young… 169
“In general there were no problems. With some practice all the small problems that
the students had could be ironed out.”
There was a clear preference for tablet delivery among candidates, which was
confirmed by observer comments:
“In general, the candidates used the hardware capably and interacted well with the
software. Engagement levels were high and they clearly enjoyed doing the tests.”
“They have no problems managing PCs and iPads at all. The candidates were hap-
pier when they were told they could do the exam with iPads.”
Of course, some candidates still prefer the paper-based YLE. Some candidate opin-
ion was divided between paper and computer-based delivery as the following draw-
ings indicate, mainly by girls:
The Development and Validation of a Computer-Based Test of English for Young… 171
Starters trial candidate, girl, age 10, Italy “The exam I would like to do is on the computer”
6.2.7 Conclusion
The results of the mixed methods validation study we reported on show that
paper-based and computer-based versions of Cambridge English Young Learners
are comparable alternatives and present a genuine choice for candidates to choose
the exam delivery mode they feel most comfortable with. Regression analyses have
shown that the number of years of English language instruction is the main factor in
explaining candidates’ performance both in PB and CB tests, which is in line with
expectations. Candidates who prefer to take the test on computer performed signifi-
cantly better both in PB and CB versions than those who prefer PB tests. This may
be related to personal and motivational characteristics that this study did not explore.
This result may also be related to the other interesting finding that boys were found
to perform better than girls in the trials. We can speculate that perhaps this is a result
of a set of personality and affective factors such as enthusiasm for computers com-
bined with an effective use of computers and the internet to benefit from extra expo-
sure to the English language. This interpretation was corroborated by the data
collected in the testimonials as well as the verbal and pictorial feedback from ques-
tionnaires. Candidates who revealed positive attitudes to the novelty and game-like
nature of the new test format tended to show stronger performance.
Importantly, during the trials, no statistically significant difference was observed
in CB exam performance between candidates who took the test using an iPad and
those who used a PC, confirming that which device candidates take the test on will
not have an effect on their performance. However, it was very clear from the chil-
dren’s feedback that they prefer touch screen devices (iPads/tablets) to mouse oper-
ated devices (laptops and PCs).
In sum, overwhelmingly positive feedback from trial observers, candidates, and
parents indicates that CB delivery presents a contemporary, fun, accessible and
alternative way to assess children’s language ability. In addition, CB YLE tests cap-
ture invaluable response and performance data for the on-going review and quality
assurance of both the test material and assessment criteria employed by Cambridge
English to assess children’s English language ability.
The development of computer-based assessments provides young learners with
an opportunity to choose the format that they prefer: PB/face-to-face or
CB. Following the ‘bias for best’ approach that Cambridge English subscribes to,
YLE candidates are allowed to choose whichever format (PB or CB) they want to
take YLE tests to demonstrate the best of their language ability. The purpose of test
use will determine which format is chosen: whether candidates’ language skills are
to be demonstrated on the computer or in a PB/face-to-face test.
176 S. Papp and A. Walczak
In spite of the wide range of countries CB YLE was trialled in, at the time of this
study data was available for analysis from only four countries. This may have an
effect on the generalizability of the findings to the whole YLE population which is
taken in 86 countries in the world.
In the future it would be worth exploring what causes the difference in perfor-
mance between boys and girls in paper-based and computer-based language tests.
Research on L2 learning in a CLIL approach also found that gender differences are
cancelled out between boys and girls. It would be interesting to investigate what
contributes to boys’ improved attitude and motivation towards L2 learning and
improved performance in these studies.
As this study only looked at self-reported computer use, further investigation
could be conducted using objective measurement of children’s computer use in rela-
tion to language learning. Exploratory research reported on in this volume and else-
where in the emerging literature on young learners’ English language development
could be complemented by more empirical studies isolating and controlling for
intervening factors to better understand the causal relationship between variables
and their effect on learning outcomes.
In the future, further impact studies need to be conducted to investigate reasons
for choice of delivery mode (paper-based or computer based) by test takers, parents,
teachers, school heads and policy makers.
Finally, this study has shown that on-going validation studies need to be carried
out throughout various phases of CB test development for young learners. This
‘change as usual’ perspective is important in order to keep up with the changing
nature of the effect of technology on learning, teaching and assessment. As Bennett
(1998) has predicted, with the increasing role of technology in assessment, the
boundaries between learning, teaching and assessment will ultimately be blurred,
and assessment will truly be part of the teaching and learning processes, unobtru-
sively monitoring and guiding both (Jones, 2006).
Appendix A: Candidate Questionnaire (English Version)
∋ ∋ ∋
10. What do you use computers for? English homework Email or chat with friends Anything else?
in English
∋ ∋ ..........................∋
177
(continued)
178
11. Which type of computer do you use most at home? Desktop (PC/Mac) Tablet Laptop
∋ ∋ ∋
12. Do you prefer taking tests on paper or on computer? On paper No difference On computer
∋ ∋ ∋
∋ ∋ ∋
10. I liked the pictures in the test on the computer. ∋ ∋ ∋
11. The examples et the start of each task helped me understand what to do in ∋ ∋ ∋
the test on the computer.
12. I liked taking the Listening and Reading & Writing tests on the ∋ ∋ ∋
COMPUTER.
Speaking test on COMPUTER
10. What do your students use computers for? English homework Email or chat with Anything else?
friends in English
∋ ∋ ……………………
∋
181
(continued)
182
11. Which type of computer do your students use most at school? Desktop (PC/Mac) Tablet Laptop
∋ ∋ ∋
12. Do your students prefer taking tests on paper or on computer? On paper Not sure On computer
∋ ∋ ∋
13. Which level of CB YLE tests have you observed? Starters Movers Flyers
∋ ∋ ∋
14. What type of computer were the candidates Desktop (PC/Mac) Tablet Laptop
using during the test you have observed? ∋ ∋ ∋
Your observations on the CB YLE Speaking test
Yes Not sure No
27. The candidates checked the microphone in the speaking test. ∋ ∋ ∋
28. The candidates understood clearly what they had to do in the ∋ ∋ ∋
speaking test on the computer.
29. The candidates knew when to start talking in the speaking test ∋ ∋ ∋
on the computer.
30. The candidates knew when to stop talking in the speaking test ∋ ∋ ∋
on the computer.
31. The animations were helpful for the candidates to know how ∋ ∋ ∋
and when to start talking.
32. The animations were helpful for the candidates to know how ∋ ∋ ∋
and when to finish talking.
33. The candidates checked the timer to see how much ∋ ∋ ∋
time they had to speak.
34 I noticed some candidates rushing their answer in response ∋ ∋ ∋
to the timer.
35. The candidates had enough time to think about their answers ∋ ∋ ∋
in the speaking test on the computer.
36. The candidates had enough time to give their answers in the ∋ ∋ ∋
speaking test on the computer.
S. Papp and A. Walczak
37. I noticed candidates were nervous while taking the speaking test ∋ ∋ ∋
on the computer, e.g. they hesitated, looked confused or
distracted.
38. The candidates seemed to like speaking to a computer. ∋ ∋ ∋
39. Lack of human examiner support did not prevent candidates ∋ ∋ ∋
from providing responses.
Your observations on the CB YLE Listening and Reading & Writing tests
Yes Not sure No
40. The candidates changed the volume in the listening test. ∋ ∋ ∋
41. The candidates understood what they needed to do in the ∋ ∋ ∋
Listening test on the computer.
42. The candidates understood what they needed to do in the ∋ ∋ ∋
Reading and Writing test on the computer.
43. The candidates were able to click/tap to write their answers ∋ ∋ ∋
on the computer.
44. The candidates were able to select their multiple choice answers ∋ ∋ ∋
on the computer.
45. The candidates were able to colour their answers on the ∋ ∋ ∋
computer.
46. The candidates were able to move easily between questions. ∋ ∋ ∋
47. The candidates were able to move easily between tasks. ∋ ∋ ∋
48. The candidates had enough time to answer all the questions ∋ ∋ ∋
in the Listening test on the computer.
49. The candidates had enough time to answer all the questions ∋ ∋ ∋
in the Reading and Writing test on the computer.
The Development and Validation of a Computer-Based Test of English for Young…
(continued)
52. The candidates liked taking the Listening and Reading & ∋ ∋ ∋
184
References
Barnes, S. K. (2010a). Using computer-based testing with young children. In Proceedings of the
NERA conference 2010: Paper 22. Retrieved from http://digitalcommons.uconn.edu/
nera_2010/22
Barnes, S. K. (2010b). Using computer-based testing with young children. PhD dissertation,
Number: AAT 3403029, ProQuest Dissertations and Theses database.
Becker, H. J. (2000). Who’s wired and who’s not: Children’s access to and use of computer tech-
nology. The Future of Children: Children and Computer Technology., 10, 3–31.
Bennett, R. E. (1998). Reinventing assessment: Speculations on the future of large-scale educa-
tional testing. Princeton, NJ: Policy Information Center, Educational Testing Service.
British Educational Research Association. (2011). Ethical guidelines for educational research.
London: BERA. Retrieved from http://content.yudu.com/Library/A2xnp5/Bera/resources/
index.htm?referrerUrl=http://free.yudu.com/item/details/2023387/Bera
British Psychological Society. (2009). Code of ethics and conduct. Leicester, UK: BPS. Retrieved
from http://www.bps.org.uk/sites/default/files/documents/code_of_ethics_and_conduct.pdf
Brown, J. D., & McNamara, T. (2004). The devil is in the detail: Researching gender issues in
language assessment. TESOL Quarterly, 38(3), 524–538.
Chapelle, C., & Douglas, D. (2006). Assessing languages through computer technology. In C. J.
Alderson & L. F. Bachman (Eds.), Cambridge language assessment. Cambridge, UK:
Cambridge University Press.
Clariana, R., & Wallace, P. (2002). Paper-based versus computer-based assessment: Key factors
associated with test mode effect. British Journal of Educational Technology, 33(5), 593–602.
Council of Europe. (2001). The common European framework of reference for languages.
Cambridge, UK: Cambridge University Press.
Creswell, J. W., & Plano Clark, V. L. (2011). Designing and conducting mixed methods research
(2nd ed.). Thousand Oaks, CA: Sage.
Economic and Social Research Council. (2012). ESRC Framework for Research Ethics (FRE)
2010. Swindon, UK: ESRC. Retrieved from http://www.esrc.ac.uk/_images/framework-for-
research-ethics-09-12_tcm8-4586.pdf
Endres, H. (2012). A comparability study of computer-based and paper-based writing tests.
Research Notes, 49, 26–33, Cambridge, UK: Cambridge ESOL.
European Commission. (n.d.). The RESPECT project code of practice. The European Commission’s
Information Society Technologies (IST) Programme. Retrieved from http://www.respectpro-
ject.org/code/index.php
Fox, J. (2002). An R and S-PLUS companion to applied regression. Thousand Oaks, CA: Sage.
Fox, J. (2008). Applied regression analysis and generalized linear models. Thousand Oaks, CA:
Sage.
Hackett, E. (2005). The development of a computer-based version of PET. Research Notes, 22,
9–13, Cambridge, UK: Cambridge ESOL.
Hong Kong Special Administrative Region Government Education Bureau. (n.d.). General studies
for primary curriculum. Retrieved from http://www.edb.gov.hk/en/curriculum-development/
kla/general-studies-for-primary/index.html
Hong Kong Special Administrative Region Education Bureau Information Services Department.
(2014). The fourth strategy on information technology in education. Realising IT potential,
unleashing learning power. Retrieved from http://www.edb.gov.hk/attachment/en/edu-system/
primary-secondary/applicable-to-primary-secondary/it-inedu/Policies/4th_consultation_eng.pdf
Jones, N. (2000). BULATS: A case study comparing computer-based and paper-and-pencil tests.
Research Notes, 3, 10–13, Cambridge, UK: Cambridge ESOL.
Jones, N. (2006). Assessment for learning: The challenge for an examination board. In R. Oldroyd,
(Ed.), Excellence in assessment: Assessment for learning (pp. x–u). Cambridge, UK: Cambridge
Assessment.
The Development and Validation of a Computer-Based Test of English for Young… 189
Jones, N., & Maycock, L. (2007). The comparability of computer-based and paper-based tests:
Goals, approaches, and a review of research. Research Notes, 27, 11–14, Cambridge, UK:
Cambridge ESOL.
Lee, H. K. (2004). A comparative study of ESL writers’ performance in a paper-based and a
computer-delivered writing test. Assessing Writing, 9(1), 4–26.
Li, J. (2006). The mediation of technology in ESL writing and its implications for writing assess-
ment. Assessing Writing, 11(1), 5–21.
Maycock, L. & Green, T. (2005). The effects on performance of computer familiarity and attitudes
towards CB IELTS. Research Notes, 20, 3–8, Cambridge, UK: Cambridge ESOL.
McDonald, A. S. (2002). The impact of individual differences on the equivalence of computer-
based and paper-and-pencil educational assessment. Computers & Education, 39(4),
299–312.
Merrell, C., & Tymms, P. (2007). What children know and can do when they start school and how
this varies between countries. Journal of Early Childhood Research, 5(2), 115–134.
National Association for the Education of Young Children (NAEYC). (2009). Joint position state-
ment from the National Association for the Education of Young Children and the National
Association of Early Childhood Specialists in State Departments of Education: Where we stand
on curriculum, assessments and program evaluation. Retrieved from http://www.naeyc.org/
files/naeyc/file/positions/StandCurrAss.pdf
Neuman, G., & Baydoun, R. (1998). Computerization of paper-and-pencil tests: When are they
equivalent? Applied Psychological Measurement, 22(1), 71–83.
O’Sullivan, B., Weir, C., & Yan, J. (2004) Does the computer make a difference? IELTS Research
Project Report. Cambridge, UK: Cambridge ESOL.
OECD/CERI (2008). New millennium learners: Initial findings on the effects of digital technolo-
gies on school-age learners. Retrieved from http://www.oecd.org/site/educeri21st/40554230.
pdf and http://www.oecd.org/edu/ceri/centreforeducationalresearchandinnovationceri-
newmillenniumlearners.htm
Pedró, F. (2006). The new millennium learners: Challenging our views on ICT and learning. Paris:
OECD/CERI.
Pedró, F. (2007). The new millennium learners: Challenging our views on digital technologies and
learning. Nordic Journal of Digital Literacy, 2(4), 244–264.
Pommerich, M. (2004). Developing computerized versions of paper-and-pencil tests: Mode effects
for passage-based tests. The Journal of Technology, Learning and Assessment, 2(6), 3–44.
Pomplun, M., Frey, S., & Becker, D. (2000). The score equivalence of paper-and-pencil and com-
puterized versions of a speeded test of reading comprehension. Educational and Psychological
Measurement, 62, 337–353.
Prensky, M. (2001). Digital natives, digital Immigrants. On the Horizon, 9(5), Bradford, UK: MCB
University Press.
Rideout, V. J., Vandewater, E. A., & Wartella, E. A. (2003). Zero to six: Electronic media in the
lives of infants, toddlers and preschoolers. Menlo Park, CA: The Henry J. Kaiser Family
Foundation.
Russell, M., & Haney, B. (1997). Testing writing on computers: An experiment comparing stu-
dents’ performance on test conducted via computer and via paper-and-pencil. Education Policy
Analysis Archive, 5(3), 1–19.
Sim, G., Holifield, P., & Brown, M. (2004). Implementation of computer assisted assessment:
Lessons from the literature. ALT-J, 12(3), 215–229.
Sim, G., & Horton, M. (2005). Performance and attitude of children in computer based versus
paper based testing. In P. Kommers & G. Richards (Eds.), Proceedings of ED-MEDIA World
conference on educational multimedia, hypermedia & telecommunications. Seattle, WA:
AACE.
Social Research Association. (2003). Social Research Association ethical guidelines. London:
Social Research Association. Retrieved from http://the-sra.org.uk/wp-content/uploads/
ethics03.pdf
190 S. Papp and A. Walczak
Taylor, C., Jamieson, J., Eignor, D., & Kirsch, I. (1998). The relationship between computer
familiarity and performance on computer based TOEFL test tasks. TOEFL Research Reports.
Princeton, NJ: ETS.
Tymms, P., & Merrell, C. (2009). On-entry baseline assessment across cultures. In A. Anning,
J. Cullen, & M. Fleer (Eds.), Early childhood education: Society and culture (2nd ed.,
pp. 117–128). London: Sage.
Tymms, P., Merrell, C., & Hawker, D. (2012). IPIPS: An international study of children’s first year
at school. Paris: OECD.
Wall, K., Higgins, S., & Tiplady, L. (2009, September). Pupil views templates: Exploring pupils’
perspectives of their thinking about learning. Paper presented at 1st International Visual
Methods Conference Leeds.
Wang, S., Jiao, H., Young, M. J., Brooks, T. E., & Olson, J. (2007). A meta-analysis of testing mode
effects in grade k–12 mathematics tests. Educational and Psychological Measurement, 67,
219–238.
Zandvliet, D. (1997). A comparison of computer-administered and written tests. Journal of
Research on Technology in Education, 29(4), 423–438.
Learning EFL from Year 1 or Year 3?
A Comparative Study on Children’s EFL
Listening and Reading Comprehension
at the End of Primary Education
Abstract Do primary school children achieve better listening and reading skills
when they start learning EFL in year 1 instead of year 3? Addressing this question
this chapter sums up an empirical study investigating the EFL achievements of more
than 6,500 primary school children in Germany. Data was collected in 2010 and
2012 as part of the interdisciplinary longitudinal research study Ganz In allowing
for the comparison of two cohorts who differ in the length and quantity of early EFL
instruction due to curricular changes: Whereas the 2010 cohort learned EFL for 2
lessons per week over 2 years (beginning at the age of ~8) the 2012 cohort learned
EFL for two hours per week over 3.5 years (beginning at the age of ~6). In summary
the findings show that children with three and a half years of early EFL education
demonstrated higher receptive achievements than children with 2 years of early
EFL education. Independent of their mono- or multilingual backgrounds all learn-
ers seemed to benefit from extending EFL education. The results of a multilevel
regression analysis indicate that the language background of young learners cannot
explain any variance in their receptive EFL achievements. Instead, their reading
skills in German (the language of schooling) in addition to their socio-economic
status and gender were identified as factors.
E. Wilden (*)
English Department, University of Vechta, Vechta, Germany
e-mail: [email protected]
R. Porsch
Institute of Educational Science, University of Muenster, Muenster, Germany
e-mail: [email protected]
1 Introduction
Do primary school children achieve better listening and reading skills when they
start learning English as a foreign language (EFL) in year 1 instead of year 3? This
chapter sets out to present the design and results of an empirical study relating to the
receptive EFL achievements of more than 6,500 primary school children in Germany
and to find a preliminary answer to this research question. The data that were col-
lected in 2010 and 2012 as part of the interdisciplinary longitudinal research project
Ganz In – All-Day Schools for a Brighter Future allow us to compare two cohorts
that, due to curricular changes, differ in the length and quantity of early EFL instruc-
tion. Whereas the 2010 cohort learned EFL over the course of 2 years at two lessons
per week (beginning approx. at the age of 8), the 2012 cohort learned EFL over
three and a half years at two hours a week (beginning approx. at the age of 6). This
chapter seeks to answer a question relevant throughout Europe and beyond: whether
earlier EFL education at primary level leads to better learning outcomes.
The chapter is structured as follows: After sketching out the current curricular
situation with regards to early foreign language learning in Germany, the theoretical
concepts underlying this study, particularly listening and reading competences as
well as multilingualism, will be presented. This is followed by a summary of prior
research findings on listening and reading competences in early foreign language
education with a particular focus on research on young mono- and multilingual
learners. In the empirical section, the research questions, the research hypotheses
and the research design will be presented before the findings of the study are
described and discussed.
In Germany education is mainly the task of the federal states (Länder). As a conse-
quence, each of the 16 states has its own school system and own curriculum.
However, all of the different school systems do share most of the following charac-
teristics: In general, children enter primary education at the age of 6. In most states,
children enter secondary education after year 4, in two states after year 6. It is com-
pulsory for children to attend at least 10 years of schooling; teenagers aiming at
university education attend school for 12 or 13 years in total. Most federal states
begin with EFL education at the primary level in year 3, in five states children
already start learning EFL in year 1 (Rixon, 2013, pp. 116–117; Treutlein, Landerl
& Schöler, 2013, pp. 20–22). As the present study was conducted in the federal state
of North-Rhine Westphalia (NRW), the political and curricular situation in this par-
ticular state is outlined in greater detail. Compulsory EFL education in year 3 was
first introduced in NRW in the 2003/2004 school year. Just 5 years later it was
Learning EFL from Year 1 or Year 3? A Comparative Study on Children’s EFL… 193
moved forward to the second term of year 1. These curricular changes caused
significant transformations within a relatively short time span for both teachers and
school management. Even though early EFL education was embraced by both EFL
researchers and many teachers, there was a huge media controversy about these cur-
riculum changes as exemplified in an article by Kerstan (2008) in the German
broadsheet Die Zeit titled, “No Murks, please. Stoppt den Fremdsprachenunterricht
an Grundschulen! [No screw ups, please. Stop foreign language teaching in primary
schools!]”.
As a consequence of these curricular changes in NRW, the two cohorts tested in
this study differ in two respects: On the one hand, they differ in the length of EFL
education with the groups having two years and three and a half years respectively
(approximately eighty 45-min lessons per school year). On the other hand, they
were taught on the basis of two different curricula: The cohort tested in 2010 was
taught on the basis of the 2003 curriculum (see MSWNRW, 2003) which first intro-
duced primary EFL education in NRW. The second cohort tested in 2012 was the
first group to be taught in accordance with the 2008 curriculum (see MSWNRW,
2008). A comparative analysis of these curricula (Wilden, Porsch & Ritter, 2013,
pp. 173–176) showed that the latter curriculum prescribed a more pronounced inte-
gration of written language: teachers were asked to give written input to support
EFL learning right from the start. Furthermore, the 2008 curriculum for the first
time determined explicit EFL competence levels for the end of primary education in
year 4 after 4 years of schooling. Both curricula highlight oral competences as one
of the main objectives of early EFL education along with the acquisition of listening
and audio-visual skills (also see Benigno & de Jong, 2016; Nikolov, 2016b in this
volume).
In this study, primary school children were tested on their English reading and lis-
tening skills. In this context, listening concerns the ability to extract information
from spoken English. This is a complex, dynamic, active and two-sided (bottom-up
and top-down) process during which learners deduce and attribute meaning and
interpret what they heard (see Field, 2008; Nation & Newton, 2009; Vandergrift &
Goh, 2012 for a detailed introduction to the listening construct).
The term ‘reading’ or ‘reading comprehension’ describes the ability to extract
information from written English texts. This includes various simultaneous pro-
cesses of understanding in the course of which readers construct meaning with the
help of information given in the text (bottom-up), world knowledge gained from
experience (top-down) as well as reading strategies (see Grabe & Stoller, 2011;
Nation, 2008; Urquhart & Weir, 1998 for a detailed introduction to the reading
construct).
194 E. Wilden and R. Porsch
A special focus of this study is on the EFL achievements of children with mono- and
multilingual backgrounds in German primary schools (also see Mihaljević
Djigunović, 2016 in this volume). The concept of multilingualism is used in various
disciplines with different, though overlapping meanings (see Hu, 2010; Roche,
2013a, pp. 189–199). In foreign language education, multilingualism is considered
to be both a prerequisite and a goal (Hu, 2004, p. 69). On the one hand, the European
Commission set the political goal that every European should have communicative
competences in several languages. On the other hand, active use of several lan-
guages is already part and parcel of the life of many school children in Germany
even though German is the official and predominant language in Germany. This is
due to the fact that there is a significant population of immigrants in Germany and
according to the most recent 2012 census about 16.3 million people living in
Germany (out of a total of about 80.5 million people) have a migration background
(Statistisches Bundesamt Deutschland, 2013).
In line with the interdependence hypothesis (Roche, 2013b, p. 136; Rohde, 2013,
p. 38) as well as the cultural dimension of multilingualism, knowledge and use of
several languages outside of school should therefore be considered as a factor in
further school-based language education (Hu, 2003; Roche, 2013b, pp. 193–195;
Schädlich, 2013, p. 33).
We consider children to be multilingual if the following aspects apply to their
lives: (a) They use German as the language of schooling and it is not their first, but
their second or even third, etc. language, and (b) they either grew up with more than
one language before starting their formal education or they changed to the German
education system from another one to learn German as official language alongside
other foreign languages on the basis of their first language (Hu, 2010, pp. 214–215).
In this sense, children are categorized as being ‘multilingual’ in this study if they are
growing up with more than one language in their lives outside of school and are
learning English as a third or fourth language. In contrast, children are categorized
as ‘monolingual’ if they are growing up with only German. The English they learn
in primary EFL education is their second language.
In what follows, several empirical studies on listening and reading in early language
learning of EFL will be outlined with a particular focus on studies situated in
Germany. In order to limit the scope of the overview, studies relating to other aspects
of early foreign language education are not considered (however see in this
volume Szpotowicz & Campfield, 2016; Papp & Walczak, 2016; Mihaljević
Djigunović, 2016).
Learning EFL from Year 1 or Year 3? A Comparative Study on Children’s EFL… 195
The EVENING study (Paulick & Groot-Wilken, 2009) tested children in Germany
(NRW) in 2006 (N = 1748) and 2007 (N = 1344) at the end of primary education in
year 4 (age 9–10 years) on their listening and reading skills after two years of EFL
learning. The tests developed in the study complied with the requirements of the
relevant curriculum (MSWNRW, 2003) and even exceeded them considerably in
terms of the listening test (Paulick & Groot-Wilken, p. 185). However, there were
some differences between the two parts of the listening test (cf., pp. 185–187): In
the first part, children heard isolated sentences and scored a mean value of 11.5 out
of 17 points, which the authors of the study interpreted as being ‘good’ or even
‘very good’ results (p. 185). The second part of the test (in which children answered
questions on a story they heard twice) appeared to be more challenging, for they
scored a mean value of 5.5 out of 11 points. More than 73 % of the children tested
were able to answer more than half of the listening items correctly and 15 %
answered correctly more than three quarters of the items. The report by Paulick and
Groot-Wilken does not specify whether the data analysis was based on both surveys.
The absolute values in the tables on pp. 191–192 seem to indicate, however, that the
results of the data analysis presented are solely based on the 2007 survey (N = approx.
1300). These results occurred in spite of the fact that the listening test was far more
demanding than required in the curriculum and many teachers had assessed it as too
difficult prior to its administration (p. 186).
The KESS 4 study (May, 2006) tested all primary school children in the federal
state of Hamburg at the end of year 4 (ages 9–10 years) on their EFL listening
achievements with a test developed for the study. The results indicated that most of
the children were able to understand individual statements and answer simple ques-
tions after 2 years of EFL learning (p. 223). Twenty-five percent of the children
belonged to the high-achieving group who were able understand a coherent text
read to them and connect different parts of the text with one another.
The 3-year longitudinal ELLiE study (Enever, 2011) examined among other
aspects the listening skills of roughly 1,400 children in seven European countries
(Germany did not take part). Beginning in the second year of EFL learning pupils
aged 7–8 years were tested in listening at the end of each school year from 2007 to
2010 (The ELLiE team, 2011, pp. 15–16). By repeating four items (at the CEFR
level A1; see Szpotowicz & Lindgren, 2011, p. 129) in each testing phase, the study
was able to analyse the development of children’s listening skills. The results
showed, with only a few exceptions and country-specific variations, an improvement
of children’s listening achievements during the three years (pp. 130–133). The
authors identified non-school related factors such as the use of the language in
society or the media as factors influencing the development of foreign language
listening skills (p. 133).
In a quasi-experimental study with 10 year 3 classes (age 8–9 years), Duscha
(2007) researched the influence of reading and writing on various aspects of early
196 E. Wilden and R. Porsch
language learning. All ten groups were taught six parallel units, with half of the
classes receiving no written language input. The pupils were tested at the end of
each teaching unit. The impact of written language input on listening comprehen-
sion was tested with a picture-sentence-matching task at the end of a four-lesson
unit on prepositions (after a total of 15 lessons). The children who had participated
in the lessons with written language input on average scored better on the listening
test than the children who had received no written language input (p. 288). These
findings could be seen as an indicator that written language input in early language
learning could be beneficial for the development of listening skills.
In conclusion, outcomes of these studies on the listening comprehension of pri-
mary school children (school years 1–4, aged approx. 6–10 years) can be summed
up as follows (also see Bacsa & Csíkos, 2016 in this volume): The majority of
children are able to understand individual, spoken sentences after two years of EFL
learning and high-achieving children can even understand longer, coherent texts
(May, 2006; Paulick & Groot-Wilken, 2009). In a longitudinal European compara-
tive study, the majority of primary school children demonstrated a development
of their listening skills over three years (Szpotowicz & Lindgren, 2011). Written
language input in the primary EFL classroom was identified as beneficial for the
development of listening comprehension (Duscha, 2007).
In recent years there has been an increase in studies on the effect of written language
input in early foreign language learning in primary schools in Germany. This trend
stems from the academic discourse among researchers and teachers on when the
best time is to introduce written language into the early foreign language classroom
(see Bleyhl, 2000, 2007; Diehr & Rymarczyk, 2010; Doyé, 2008; Treutlein et al.,
2013). These studies explore both reading silently for comprehension and reading
out loud for focusing on pronunciation. In line with the research focus of this study,
only studies on silent reading are overviewed in this section.
On the reading test of the EVENING study, children at the end of year 4 demon-
strated good reading skills after two years of EFL education – a result similar to the
one found on the listening test. In the first part of the test, the young learners had to
read individual sentences and match them with another sentence. On average the
children scored 9.1 out of 14 points (Paulick & Groot-Wilken, 2009, pp. 188–190).
In the second part of the reading test, they had to reconstruct a narrative text through
a sentence-picture matching activity. On average they scored 5.6 out of 8 points.
Thus, the authors of the study conceded that this part of the test appeared to be too
easy for the target group (p. 189). Moreover, they stated that future studies should
also go beyond the sentence level and test reading comprehension at the text level
as well (p. 195). Overall, 74.2 % of the children solved more than half of the items
Learning EFL from Year 1 or Year 3? A Comparative Study on Children’s EFL… 197
on the reading test and 32.5 % managed to get more than three-quarters right. The
authors of the EVENING study had not expected these results (p. 195), as hardly
any written language input had been presented in the 88 lessons that were evaluated
in the study (Groot-Wilken, 2009, p. 137). Moreover, the teachers interviewed in the
study had considered written language use to be a subordinate aspect of primary
EFL teaching (p. 132).
In the ELLiE study, reading comprehension was tested with a matching activity
in which the children had to fill in speech bubbles in a comic strip (Szpotowicz &
Lindgren, 2011, p. 133). This task allowed for a differentiation of reading skills
based on the level of difficulty of the different items. While more than 75 % of the
children were able to match texts to concrete objects in a picture, only 32 % were
able to correctly match a text for which they had to use contextual information and
“vocabulary knowledge from the world beyond the cartoon” (p. 135).
Rymarczyk (2011) researched the EFL reading skills of year 1 and year 3 pupils
and found that even underachieving learners demonstrated considerable achieve-
ments in reading provided that written language input was supplied in the EFL
classroom. The author identified differences in silent reading for comprehension
and reading out loud. On the one hand, the children relied on the German grapheme
phoneme correspondence and thus did less well in reading out loud activities. On
the other hand, they achieved much better results in silent reading comprehension
activities in which they had to match pictures and words (pp. 61–65). On the basis
of these results, the author argues in favour of using written language input from
year 1 of primary EFL education (p. 65).
In a study examining two primary school classes who had learned English from
the second semester of year 2, Frisch (2011) researched both the participants’ read-
ing comprehension and pronunciation in EFL reading. Over a period of 10 months
they were taught according to two different methods. Whereas one class was taught
following the whole word approach, the other one was taught following the phonics
method (see Thompson & Nicholson, 1998). The study originated in the grapheme-
phoneme correspondence of the English language which, compared to the more
regular German grapheme-phoneme correspondence, is rather obscure (Frisch,
p. 71). While the whole word approach aims at inductive-implicit reading, the pho-
nics method explicitly deals with sound letter relationships. At the end of the proj-
ect, both groups showed good test results in reading comprehension (p. 82).
Moreover, the children’s pronunciation appeared to have benefited from learning
EFL following the phonics method (p. 84). On the basis of these findings, Frisch
argued for using written language input in the early EFL classroom explicitly and
systematically.
In conclusion, these empirical studies on the EFL reading comprehension of
primary school children (school years 1–4) can be summed up as follows: After 2 or
3 years of early foreign language education, most children are able to understand
simple sentences (Paulick & Groot-Wilken, 2009; Szpotowicz & Lindgren, 2011)
as well as to reconstruct narratives with the help of pictures (Paulick & Groot-
Wilken). The children demonstrate these good reading skills even if the teaching
198 E. Wilden and R. Porsch
mainly focused on fostering oral skills (Paulick & Groot-Wilken, 2009; Szpotowicz
& Lindgren, 2011). From the first year of FL learning children appear to benefit in
their reading comprehension from written language input and the explicit teaching
of reading comprehension (Frisch, 2011; Rymarczyk, 2011).
4 Research Design
The aim of this study is to determine the effect of extending the EFL learning time
at German primary schools on listening and reading comprehension and to compare
test results from young learners after learning EFL for two years with those who
have learned EFL for three and a half years. Relating to the discourse on the pros
and cons of early foreign language learning for children growing up in mono- or
multilingual families, the data on receptive EFL achievements are further analysed
to see whether all children benefit from the extended learning time. In other words,
the study compares the test results of children with different linguistic backgrounds.
The study aims at answering the following research questions:
(1) Do learners with three and a half years of early EFL learning show higher lis-
tening and reading competences than learners with two years of early EFL
learning?
(2) Considering their mono- and multilingual backgrounds, do learners show
higher degrees of listening and reading competences after three and a half years
of EFL learning than after two years?
(3) Do the mono- or multilingual backgrounds of EFL learners influence their EFL
listening and reading competences at the end of primary education when statis-
tically controlling gender, socio-economic background (SES) and German
reading skills?
The following hypotheses were devised on the basis of prior research findings:
Children who learned EFL for three and a half years demonstrate higher listening
and reading skills than those who learned English for only two years (hypothesis 1).
All children demonstrate higher receptive EFL achievements through extending the
EFL learning time – independent of their linguistic backgrounds (hypothesis 2).
Children growing up in multilingual families with German will demonstrate higher
receptive EFL achievements than children growing up in multilingual families with-
out German (hypothesis 3). Regarding research question 3, the existing empirical
evidence is currently insufficient to devise a hypothesis.
Learning EFL from Year 1 or Year 3? A Comparative Study on Children’s EFL… 201
The data for this study were collected as part of the research dimension of the
German Ganz In project. This project supports 30 secondary schools (Gymnasien)
in NRW as they restructure their school organizations to become all-day schools. In
2010 (group 1) and 2012 (group 2) two cohorts of year 5 pupils were tested imme-
diately after their transition from primary to secondary school (in the first 6 weeks
after the summer holidays). The paper-pencil tests were administered by trained
test administrators following standardized test manuals. The children in group 1
(N1 = 3216) had learned EFL for two years, whereas those in group 2 (N2 = 3279) had
learned EFL for three and a half years. The composition of both groups was
compared with regard to various background variables (nominal-scaled responses):
gender, first language (German or other) and place of birth (Germany or other) in
order to ensure the comparability of the two groups (see Table 1).
For the metric-scaled variables (age, number of books at home (SES) and the
grades in German and English) descriptive values and results from t-tests are
presented in Table 2.
Table 3 Number of pupils grouped by the languages acquired at home (in brackets percentage)
Group 1 (2010) Group 1 (2012) total
(a) Monolingual with German 1647 (63.3) 1749 (66.5) 3431 (64.9)
(b) Multilingual with German 614 (23.6) 557 (21.2) 1180 (22.3)
(c) Multilingual without German 342 (13.1) 325 (12.4) 679 (12.8)
4.3 Measures
At both times the participants completed the same measures of EFL listening and
reading comprehension as well as a socio-demographic background questionnaire.
The EFL listening comprehension test that was developed for the EVENING study
consisted of two tasks with a total of 28 items (α = .68). The EFL reading compre-
hension test, with a total of 24 items (α = .69), was also partially developed in the
EVENING-study (Börner, Engel & Groot-Wilken, 2013; Paulick & Groot-Wilken,
2009). On both tests, the items were either multiple-choice or short answer ques-
tions designed to test the Common European Framework of Reference for Languages
(Council of Europe, 2001) levels A1 and A2. Furthermore, proficiency scores from
a reading comprehension test in German (Adam-Schwebe, Souvignier & Gold,
2009) were estimated using a Rasch model (18 items, α = .70) as well as an index for
estimating the SES in addition to the aforementioned background variables.
This index is based on Bourdieu’s theory (1983) and includes the pupils’ and
parents’ responses to assess their economic, social and cultural capital, thus allowing
for the allocation of pupils to four different groups (1–4) that indicate a lower or
higher SES.
Learning EFL from Year 1 or Year 3? A Comparative Study on Children’s EFL… 203
In order to obtain proficiency scores, the students’ responses were first coded as
being either correct or false (dichotomous variables). Second, the data for both
cohorts were scaled in one model using a probabilistic approach (Rasch model; see
Rasch, 1960/1980), but for each domain (listening comprehension and reading
comprehension) separately in order to get a common mean value for both groups.
Analyses were computed with ConQuest 2.0 (Wu, Adams, Wilson, & Haldane,
2007) estimating person parameters (weighted likelihood estimates, WLE; Warm,
1989). The estimates are based on the scale provided by ConQuest and reach from
roughly minus three to about three with a mean of zero. Following the conventions
of international studies such as PISA, the scores were transformed on a scale with
a mean of 500 and a standard deviation of 100. Finally, in order to estimate
whether the means for the different groups are statistically significant, t-tests were
conducted with an adjustment of the probability value (Bonferroni correction;
see e.g., Mayers, 2013).
In order to answer the research questions, thee multi-level analyses were con-
ducted (random intercept models) instead of traditional linear regression models.
Multilevel modelling accounts for the variability at different levels, as it bears in
mind that the data structure is nested or hierarchical in nature (i.e., children nested
within classrooms within schools). Failing to use multi-level analyses would lead to
an inaccurate picture of the results, for the assumption of independent samples
would be violated regarding the nested data and the standard errors of the parame-
ters would be underestimated. All of the children tested at grade 5 were from the
same school type (Gymnasium); however, the schools were regionally diverse
(urban and rural), which influenced the composition of the cohorts (e.g., SES, the
proportion of children from migrant families). All predictors were z-standardized,
which has the advantage that the regression coefficients from multilevel models can
be interpreted nearly as standardized regression coefficients (Bryk & Raudenbush,
1992). The analyses were conducted using the free software “R” (package: lme4).
5 Results
The results for answering research question 1 are provided in Fig. 1: On average, the
children with three and half years of primary EFL education (group 2 in 2012) dem-
onstrated higher receptive achievements than those with two years of EFL education
(group 1 in 2010). On the listening comprehension test, the 2010 cohort scored a
mean of 492 points and the 2012 cohort a mean of 507 points (M = 500, SD = 100).
Similarly, on the reading comprehension test the former group scored a mean of 491
points whereas the latter scored a mean of 508 points. The 16-point difference is
statistically significant for both domains.
204 E. Wilden and R. Porsch
507
Listening p < .001
comprehension 492
508
p < .001
Reading
491
comprehension
Fig. 1 Results for listening and reading comprehension after 3.5 vs. 2 years of English at primary
school
Table 4 Proportion of students in four proficiency groups for listening and reading comprehension
(percentage in each group)
Less than 400 400–499 500–599 600 or more
Listening comprehension Group 1 (2010) 10.3 46.1 35.6 7.9
Group 2 (2012) 7.7 38.0 41.6 12.7
Reading comprehension Group 1 (2010) 15.1 38.1 35.2 11.5
Group 2 (2012) 12.0 33.9 39.0 15.1
515
2012 501
496 2012:
(a) vs. (b) < .05
(a) vs. (c) = .264
2010 486 (b) vs. (c) = .108
2010:
488 (a) vs. (b) < .05
(a) vs. (c) = .09
(b) vs. (c) = .662
Fig. 2 Results for listening comprehension after 3.5 vs. 2 years of primary EFL education grouped
according to language background (mean values and t-test results)
512
2012 506
505
(a) monolingual German
481 2010:
(a) vs. (b) = .301
(a) vs. (c) < .05
(b) vs. (c) = .162
Fig. 3 Results for reading comprehension after 3.5 vs. 2 years of primary EFL education grouped
according to language background (mean values and t-test results)
However, the differences between the two groups tested in 2010 and in 2012 are low
and statistically not significant. However, regardless of the length of EFL education,
the children growing up monolingually with German demonstrated the highest
receptive achievements in both domains, even if only some of the differences
between the groups of the monolingual and multilingual learners were statistically
significant (with a maximum difference of 14 points).
206 E. Wilden and R. Porsch
The third question in this study addresses the influence of the children’s language
backgrounds on their receptive EFL achievements. The scores from the receptive
EFL tests were taken as the dependent variables. Apart from the language back-
ground of the children (monolingual German, multilingual with or without German)
the following variables were controlled: gender, SES, and reading comprehension
skills in German using the proficiency score from the reading test. The analyses are
based on the data from the 2012 study. First, the intraclass correlations (ICC) were
calculated by applying random intercept models without any predictors (zero
model). As a result, the variance proportion of the total variance that can be explained
by the different schools is given. For listening comprehension the ICC is .085,
meaning that only 9 % of the variance in the performance can be explained by
differences across schools; for reading comprehension it was even less at only 3 %.
Three models were computed: Model 1 includes the participants’ reading compre-
hension skills in German, SES and gender as predictors for EFL listening and
reading comprehension. In model 2 only the language background serves as a
predictor. Finally, in model 3 all of the variables were included as predictors
(see Tables 5 and 6).
Learning EFL from Year 1 or Year 3? A Comparative Study on Children’s EFL… 207
Table 7 Results for German reading comprehension grouped according to pupils’ language
backgrounds (means and standard deviations in brackets)
(a) Monolingual (b) Multilingual (c) Multilingual
(German)1 with German2 without German3
Reading comprehension 513 (99.57) 492 (94.94) 477 (101.27)
in German
Note. N1 = 1634; N2 = 511; N3 = 300
The results show that the children’s language background cannot explain any
variance in their performance on the receptive EFL tests. Instead, their reading com-
prehension skills in German, their SES and their gender were identified as factors
that explain some variance. However, the proportion of performance variance
(regarding the receptive skills) that can be attributed to the individual level explained
by the predictors included in the models is very small. A maximum of 9 % of the
listening comprehension skills and 5 % of the reading comprehension skills are
explained by model 3. Nevertheless, the findings suggest that instead of the lan-
guage background of young EFL learners (whether they grow up mono- or multilin-
gually, with or without German in their families) it is actually their German reading
skills instead, in addition to their SES and gender, which impacts their receptive
EFL achievements. Therefore, the data from the German reading comprehension
test were also analysed to differentiate the test results according to the participants’
language backgrounds (see Table 7). The proficiency scores from an IRT analysis
were transformed and put onto a scale with a mean of 500 and a standard deviation
of 100.
The results show considerable differences in the German reading comprehension
test scores depending on the children’s language background. As expected, children
growing up monolingually with German achieved the highest scores. Children
growing up in multilingual families with German scored 21 points less, but were
still 15 points ahead of children growing up in multilingual families without
German. The differences between the three groups were tested using a ANOVA
model (F[2, 2442] = 21.942, p < .001). Interestingly, the large difference in their
German reading comprehension skills appears to have only a small effect on their
performance on the EFL tests. Comparing the mean differences in receptive EFL
skills (see Tables 2 and 3), the largest difference is 14 points between the three
mono- and multilingual groups. In contrast, the largest difference between the three
groups on the German reading comprehension test is 36 points. The results from the
multilevel analyses point to the general importance of the language proficiency in
German – the language of schooling – for achievements in the EFL classroom that
cannot be explained by the individual language background of these young learners.
This indicates that potentially there are underlying competences which help chil-
dren to understand written and oral texts across languages.
208 E. Wilden and R. Porsch
The findings from this study can be summarized as follows. On average, the children
with three and a half years of early EFL education demonstrated higher receptive
achievements than children with two years of early EFL education. In the 2012
cohort, which had three and a half years of early EFL learning, there were more
overachieving children who demonstrated very high receptive EFL achievements.
In contrast, there were more underachieving children with rather low achievements
with regards to their receptive EFL skills in the 2010 cohort, which had two years of
early EFL learning. The comparison of the receptive EFL achievements of children
growing up in (a) monolingual families with German, (b) multilingual families with
German and (c) multilingual families without German showed that all learners
seemed to benefit from extending the EFL learning time from two to three and a half
years, for all three groups demonstrated higher receptive EFL skills after three and
half years of EFL learning.
Furthermore, the results of a multilevel regression analysis indicated that the
language background of young learners – whether they are mono- or multilingual –
cannot explain any variance in their receptive EFL achievements. Instead, their
reading skills in German (the language of schooling) in addition to their SES and
gender were identified as factors that explain a small proportion of variance in the
receptive EFL achievements of these young learners. A comparison of mono- and
multilingual learners’ German reading skills showed considerable differences
between the three groups. While the children growing up in monolingual families
with German demonstrated the highest German reading skills, the children growing
up in multilingual families with German demonstrated considerably lower German
reading achievements, but were still significantly ahead of the children growing up
in multilingual families without German. However, the large differences in the
German reading skills seemed to have only a small effect on their receptive EFL
achievements, as the differences between the EFL proficiency scores of the three
groups are much smaller. Nevertheless, these findings indicate a general importance
of proficiency in the language of schooling for successful EFL learning on the part
of young learners.
One possible explanation for this particular finding in the present study might be
that children with good German competences benefit more from what teachers say
in German in the EFL classroom (even though teachers should predominantly speak
English). The DESI study (Helmke et al., 2007) conducted in Germany in 2003/2004
measuring among other aspects the proportion of English and German spoken in
year 9 English classrooms found that 84 % of all teacher utterances were in English.
However, correlations of the proportion of German/English in the classroom with
students’ performance in English were not reported. Unfortunately, the present
study did not collect data on the language of primary EFL teacher utterances. It
might be worth considering this aspect in future research studies in the field of early
EFL education.
Learning EFL from Year 1 or Year 3? A Comparative Study on Children’s EFL… 209
The results of the present study should be interpreted cautiously, and it would be
ill-advised to hastily conclude ‘The earlier, the better’. A few limitations of the
study should be considered when discussing these results (Wilden et al., 2013,
pp. 194–196): On the one hand, the sample is not representative in spite of its being
large and standardized, for only children in one German federal state and who are
attending one particular secondary school type (Gymnasium) in a multipartite
school system were tested. Furthermore, the instruments used in the study cannot be
linked to any model of competence levels. On the other hand, the curricula have also
changed and there were considerable changes in EFL teacher education in NRW
which coincided with the introduction of early EFL education in primary schools.
These two aspects were not measured in the study; thus, it is not possible to say
whether they had an impact on the findings.
Nevertheless, the findings from this study seem to indicate that – in spite of some
of the arguments put forward in the German media controversy – early EFL educa-
tion from year 1 seems to ‘work’ as all children appear to benefit from the extended
learning time. However, whether the children learn ‘enough’ in the early EFL class-
room cannot be determined on the basis of this study. In any case EFL teachers
ought to be concerned with fostering their pupils’ skills in the language of schooling
(here: German) in order to support their foreign language competences as well. This
could be done in accordance with a ‘language across the curriculum’ policy which
many schools pursue in order to develop pupils’ literacy skills in all school
subjects.
Against this background, further research is planned to complement this study by
(1) extending it to other secondary school types and federal states, and (2) conduct-
ing a longitudinal study on the medium and long-term developments of young EFL
learners based on tasks that are linked to a competence scale.
References
Adam-Schwebe, S., Souvignier, E., & Gold, A. (2009). Der Frankfurter Leseverständnistest (FLVT
5–6). In W. Lenhard & W. Schneider (Eds.), Diagnose und Förderung des Leseverständnisses
(pp. 113–130). Göttingen, Germany: Hogrefe.
Bacsa, É., & Csíkos, C. (2016). The role of individual differences in the development of listening
comprehension in the early stages of language learning. In M. Nikolov (Ed.), Assessing young
learners of English: Global and local perspectives. New York: Springer.
Benigno, V., & de Jong, J. (2016). The “Global Scale of English Learning Objectives for Young
Learners”: A CEFR-based inventory of descriptors. In M. Nikolov (Ed.), Assessing young
learners of English: Global and local perspectives. New York: Springer.
Bleyhl, W. (2007). Schrift im fremdsprachlichen Anfangsunterricht – ein zweischneidiges Schwert.
Take off! Zeitschrift für frühes Englischlernen, 1, 47.
Bleyhl, W. (2000). Empfehlungen zur Verwendung des Schriftlichen im Fremdsprachenerwerb der
Grundschule. In W. Blehl (Ed.), Fremdsprachen in der Grundschule: Grundlagen und
Praxisbeispiele (pp. 84–91). Hannover, Germany: Schroedel.
210 E. Wilden and R. Porsch
Börner, O., Engel, G., & Groot-Wilken, B. (Eds.). (2013). Hörverstehen. Leseverstehen. Sprechen:
Diagnose und Förderung von sprachlichen Kompetenzen im Englischunterricht der
Primarstufe. Münster, Germany: Waxmann.
Bourdieu, P. (1983). Die feinen Unterschiede: Kritik der gesellschaftlichen Urteilskraft (2nd ed.).
Frankfurt am Main, Germany: Suhrkamp.
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models. Applications and data
analysis methods. Newbury Park, CA: Sage Publications.
Council of Europe. (2001). Common European framework of reference for languages. Strasbourg,
France: Council of Europe.
Diehr, B., & Rymarczyk, J. (Eds.). (2010). Researching literacy in a foreign language among pri-
mary school learners/Forschung zum Schrifterwerb in der Fremdsprache bei Grundschülern.
Frankfurt am Main, Germany: Peter Lang.
Doyé, P. (2008). Sprechen. Zuhören. Schreiben? Lesen? Gehört die Schrift in den
Fremdsprachenunterricht der Grundschule? Grundschule, 40(3), 53.
Duscha, M. (2007). Der Einfluss der Schrift auf das Fremdsprachenlernen in der Grundschule.
Dargestellt am Beispiel des Englischunterrichts in Niedersachsen. Doctoral dissertation,
Technische Universität Braunschweig, Germany. Retrieved from http://www.digibib.tu-bs.
de/?docid=00021088
Elsner, D. (2007). Hörverstehen im Englischunterricht der Grundschule. Frankfurt am Main,
Germany: Peter Lang.
Enever, J. (Ed.). (2011). ELLiE. Early language learning in Europe. London: British Council.
Field, J. (2008). Listening in the language classroom. Cambridge, UK: Cambridge University
Press.
Frisch, S. (2011). Explizites und implizites Lernen beim Einsatz der englischen Schrift in der
Grundschule. In M. Kötter & J. Rymarczyk (Eds.), Fremdsprachenunterricht in der
Grundschule: Forschungsergebnisse und Vorschläge zu seiner weiteren Entwicklung
(pp. 69–88). Frankfurt am Main, Germany: Peter Lang.
Grabe, W., & Stoller, F. L. (2011). Teaching and researching reading (2nd ed.). New York:
Routledge.
Groot-Wilken, B. (2009). Design, Struktur und Durchführung der Evaluationsstudie EVENING in
Nordrhein-Westfalen. In G. Engel, B. Groot-Wilken, & E. Thürmann (Eds.), Englisch in der
Primarstufe – Chancen und Herausforderungen: Evaluation und Erfahrungen aus der Praxis
(pp. 124–139). Berlin: Cornelsen Scriptor.
Helmke, A., Helmke, T., Kleinbub, I., Nordheider, I., Schrader, F., & Wagner, W. (2007). Die
DESI-Videostudie. Der Fremdsprachliche Unterricht: Englisch, 90, 37–45.
Hu, A. (2003). Schulischer Fremdsprachenunterricht und migrationsbedingte Mehrsprachigkeit.
Tübingen, Germany: Narr.
Hu, A. (2004). Mehrsprachigkeit als Voraussetzung und Ziel von Fremdsprachenunterricht: Einige
didaktische Implikationen. In K.-R. Bausch, F. G. Königs, & H.-J. Krumm (Eds.),
Mehrsprachigkeit im Fokus – Arbeitspapiere der 24. Frühjahrskonferenz zur Erforschung des
Fremdsprachenunterrichts (pp. 69–76). Tübingen, Germany: Narr.
Hu, A. (2010). Mehrsprachigkeitsdidaktik. In C. Surkamp (Ed.), Metzler Lexikon
Fremdsprachendidaktik (pp. 215–217). Stuttgart, Germany: Metzler.
Husfeldt, V., & Bader-Lehmann, U. (2009). Englisch an der Primarschule. Lernstandserhebung
im Kanton Aargau. Kanton Aarau, Switzerland: Department für Bildung, Kultur und Sport.
Kerstan, T. (2008, December 17). No Murks, please. Stoppt den Fremdsprachenunterricht an
Grundschulen! Zeit Online. Retrieved from http://pdf.zeit.de/2008/52/C-Seitenhieb-52.pdf
May, P. (2006). Englisch-Hörverstehen am Ende der Grundschulzeit. In W. Bos & M. Pietsch
(Eds.), KESS 4 – Kompetenzen und Einstellungen von Schülerinnen und Schülern am Ende der
Jahrgangsstufe 4 in Hamburger Grundschulen (pp. 203–224). Münster, Germany: Waxmann.
Mayers, A. (2013). Introduction to statistics and SPSS in psychology. Harlow, UK: Pearson
Education Limited.
Learning EFL from Year 1 or Year 3? A Comparative Study on Children’s EFL… 211
Abstract This case study looks at results of students who took English as a foreign
language achievement tests in their Years 4–6 (ages 10–12) at Chongqing Nanping
Primary School (CNPS) and analyzes them between 2010 and 2013. The students,
as they used different course books, were divided into two groups: PEP English and
Oxford English. The investigation of the test papers and scores of the students in the
two groups has yielded the following findings: (1) As shown in the test component,
in lower grades of both groups, CNPS put more emphasis on speaking and listening
than comprehensive abilities; (2) For the language areas assessed, the PEP English
Test prioritized vocabulary and grammar while the Oxford English Test devoted
many items to assessing communicative skills; (3) Both groups had high achievers;
however, students’ performances showed moderate decline as the grade went higher;
(4) In-depth interviews with teachers revealed that students and teachers were more
motivated in the Oxford English group. The test scores also indicate that this group
performed better than the PEP English group.
1 Introduction
In China, English has been offered from grade three (age 9) in elementary schools
since 2001. The New English Curriculum Standards (NECS, 2001b) and Basic
Requirements of English Teaching in Elementary School (BRETES, 2001a), which
were issued by the Ministry of Education of the People’s Republic of China, specify
This article was supported by grants from the Fundamental Research Funds for the Central
Universities of China in the project: A study on Regional Foreign Language Education Planning.
The project number is CQDXWL-2012-074.
J. Peng (*)
Research Centre of Language, Cognition and Language Application,
Chongqing University, Chongqing, China
e-mail: [email protected]
S. Zheng
College of Foreign Languages and Cultures, Chongqing University, Chongqing, China
e-mail: [email protected]
2 Method
2.1 Participants
Participants were 498 students and seven English teachers at Chongqing Nanping
Primary School (CNPS). The students were randomly divided into two groups
according to different course books they would use in 2010 when they entered grade
4. The first group consisted of 304 students in six classes. The other 194 students
were put in the second group comprising five classes. Table 1 offers the information
on students.
In addition to the students, seven English teachers (including the vice principal,
Teacher 2) were interviewed. A demographic profile of the teachers is given below
in Table 2.
Four teachers from among the seven, Teacher 1, Teacher 2, Teacher 3 and Teacher
4 participated in the construction of respective English test papers taken by the stu-
dents during 2010–2013. They were qualified as ‘backbone’ teachers at CNPS; this
meant that they were particularly trained in the teaching and assessing of young
learners.
2.2 Instruments
The instruments applied in the present study included course books students used,
test scores of students and teachers’ feedback. The results of test score analysis
were rationalized by examining whether or not course books exerted influence,
which was then legitimated via feedback provided by teachers.
Table 1 Numbers of students in classes and the course books they used in grades 4, 5 and 6 in
years 2010–2013
PEP English group Oxford English group
Number of students 304 194
Class numbers Classes 1–6 Classes 7–11
Course books PEP English (Gong, 2003) Oxford English (Shi, 2010)
Students used different course books: the PEP group adopted PEP English (Gong,
2003). It is published by People’s Education Press and is widely used in most public
schools in China. The Oxford group used Oxford English (Shi, 2010), published by
Shanghai Foreign Language Education Press. It was introduced from Britain and
then adapted by members in Committee of Curriculum Reform in Shanghai. As
deemed more difficult than PEP English, Oxford English is less frequently applied
in primary schools. The purpose of using two English course books at CNPS is to
examine the difference of impacts on students’ learning interests and outcomes.
The achievement tests had two versions based on the course books. The PEP group
took the PEP English test, whereas the Oxford group took the Oxford English test.
Both tests comprised an oral and a written component. Students were required to
take an oral test, whereas the written test was a traditional paper-and-pencil test. For
the 2010–2011 academic year, students of grade 4 were required to take the test to
move to grade 5. For simplicity, in the present research, we also refer to academic
year 2010–2011 as Year 4, 2011–2012 as Year 5 and 2012–2013 as Year 6.
The test papers were developed by backbone teachers in respective grades.
Usually, in late December each year, the vice principal called a meeting to brief
them about the requirements of test drafting. Then, after a week or so, the first draft
was produced, which then went through several editing phases before administra-
tion. The final versions of test papers were administered to students at the end of
each academic year (at the beginning of January). In the written test, some 30–40
students were allocated to each examination room, which was invigilated by an
external teacher (not knowing the students taking tests). The written test lasted an
hour. The oral test took place (before or after the written tests, depending on the
testing schedule) in the teachers’ offices (N = 10) where an examiner (students’ class
teacher, who also played the role of an interlocutor in the oral dialogue) assumed
responsibility for evaluating the performance of their students in pairs or threes
(when the number of students was odd, but the procedure was the same). The oral
test usually took 15–20 min for each group of test takers. Following the administra-
tion of tests, oral test scores were immediately reported back to the head of the
English department in each grade whereas written tests were graded (cross-graded
by English teachers from different grades) on the day after administration. A final
report card registering the numerical grades was sent to students and their parents.
The present study employs test item analysis. As the test was administered annu-
ally during 2010–2013 to students in the PEP and Oxford groups, altogether six test
papers were meticulously reviewed and analyzed in terms of the number, format,
and language areas assessed. Scores of students on the tests were also computed and
interpreted. Distribution charts and graphs were produced using Excel.
218 J. Peng and S. Zheng
Following the data analysis of test papers and scores, a semi-structured interview
was conducted with the teachers (N = 6) and the vice principal. The questions con-
cerned their views on test paper construction and the students’ performance. We
devised two groups of questions (N = 9), among which five were for test writers
(N = 4) only.
2.3 Procedures
To attain the original test papers of both groups in 3 years, a brief meeting was
arranged with the vice principal on Jan 12, 2014. During the meeting, she reviewed
the research proposal and agreed to be of assistance in gathering the test papers and
score reports. She also appointed the head of the English department as the liaison
between CNPS and the researchers.
A week later, a dozen of test papers and score reports in JPG format were emailed
to the researchers, which were then printed and reorganized. After that, the test
papers were thoroughly reviewed and the statistics of the types, formats and num-
bers of items were collected. The raw data acquired from students’ tests was then
entered into a spreadsheet to be analyzed.
While examining the data, problems were identified and written down.
Concerning these issues, an interview outline was drafted. Next, interview ques-
tions were discussed and proposed, with nine open-ended questions established (see
Appendix 1). On May 7, 2014, face-to-face interviews were conducted with all the
seven teachers. Each interview lasted approximately half an hour depending on the
informants’ responses. The interview was carried out in Chinese so that both parties
could express their ideas clearly, reducing the chance of causing any unnecessary
misunderstanding. The feedback from each interviewee was written down immedi-
ately and the interviews were also recorded with the participants’ consent. All data
from the interviews was stored in a computer, transferred to written text and categorized
according to the research questions. The written texts were then read, compared and
analyzed repeatedly, and deductions were made. In the present study, some of the
words are quoted (translated from Mandarin Chinese by the researchers).
3 Results
Both the PEP English test and the Oxford English test consisted of oral and written
components, as displayed in Table 3.
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 219
Table 3 Marks allocated to oral and written test components in PEP and Oxford tests
PEP Oxford
Written Oral Written
Grade Oral Listening Comprehensive skills Listening Comprehensive skills
4 50 20 30 50 20 30
5 40 20 40 40 20 40
6 0 30 70 0 30 70
The total mark for each test paper was 100. No difference was detected in the
component make-up in the same grade across the two tests. The ratio between the
oral and written tests was 50–50 % in Grade 4, 40–60 % in Grade 5, and there was
no oral test in Grade 6. The written tests comprised two sections: listening and com-
prehensive skills. The second section took up a larger share in the written test, with
60 % in grade 4, around 66.7 % in grade 5 and 70 % in grade 6. Put into the whole
test, this section also comprised a high proportion of items, especially in grade 6.
The kind of test methods or formats used can affect test performances as much as
the abilities we want to measure (Brown, 1996). Thus, it is necessary to examine
them to see how they function in testing the students. Some common formats were
included in both the PEP and Oxford English tests items. In this part, item formats
of the oral, listening and comprehensive skills sections are discussed.
In the oral section, the tasks ranged from reading a sentence or a passage, answering
questions to doing a talent show, like singing an English song or reciting an English
poem. Table 4 describes the make-up of oral section in terms of item format.
Most items in the oral section comprised reading aloud: 62.5 % in the PEP test
in grade 5. The least frequently used item required speaking on the given topic.
Table 4 Marks allocated to and distribution of each item format in oral section
Grade
4 5
Formats PEP Oxford PEP Oxford
Read aloud 30 (60 %) 25 (50 %) 25 (62.5 %) 10 (25 %)
Talent show 10 (20 %) 5 (10 %) 5 (12.5 %) 15 (37.5 %)
Dialogue with the interviewer 10 (20 %) 20 (40 %) 5 (12.5 %) 5 (12.5 %)
Describe pictures 10 (25 %)
Speak on the given topic 5 (12.5 %)
Total 50 50 40 40
220 J. Peng and S. Zheng
In reading aloud items, students were given a few seconds to glance through an
extract (of 10–15 words) or a familiar passage in the textbooks before reading it out.
However, the risk is that such items are meant to assess pronunciation as distinct
from free speaking. After all, the ability to read aloud does not equal the ability to
converse and communicate with another person. Indeed, Heaton (1988) points out
that the backwash of this kind of items may be harmful. However, according to the
NECS (2001b), reading aloud is necessary for beginners for familiarizing them with
the English sounds so that they can learn to read and speak English by osmosis.
However, NECS does not specify if reading aloud can or should be included.
Participating in a dialogue was the second most often used item type. A close
examination of these items reveals that the so-called dialogue was more of a single
question-answer sequence. For instance, many items were similar like this:
Example 1 (taken from Oxford oral test, grade 5)
1. What did you have for breakfast/lunch/dinner yesterday?
Model Response: I had…yesterday.
2. What’s your favorite subject?
Model Response: My favorite subject is…
3. What’s the weather like today?
Model Response: It’s…
The examiners would first ask the question which was to be answered by the
students using words or sentences provided in Model Responses. When answering
question one, students only needed to produce the names of the food to provide the
information needed for scoring. After that, the conversation was terminated without
any feedback from the examiner who moved on to the next question immediately.
Thus, questions were unrelated and restricted both students and teachers to a drill
with no real communication, except for directing students’ attention to specific sen-
tence collocations. According to Heaton (1988), these items are strictly controlled,
lacking the essential element of constructive interplay with unpredictable stimuli
and responses, leaving no room for authentic and genuine interaction. However, for
beginners, these questions may successfully elicit vocabulary and formulaic expres-
sions. Once they have passed this phase, the complexity of the questions can be
increased and some unpredictability can be added.
The third most used format was talent show, which provided the students with a
stage to showcase their language-related skills and talents. When being tested, stu-
dents were required to perform solo. The time limit was 5 min, as in this example:
Example 2 (taken from PEP oral test, grade 4)
Item 4: Choose one of the favorite songs you have learned in class to perform.
As a traditional item in oral tests, singing or reciting occurred twice in PEP tests
and three times in Oxford tests over the 3 years. Students came to be tested knowing
what they were expected to do and prepared for it. However, when they recited texts
in class in order to do well on the oral tests, they relied on their memory as well as
their speaking skills. NECS (2001b) mentioned the importance of children reciting
materials in English without specifying whether orally or in writing.
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 221
The second least favored format of oral test items was describing pictures. First,
the students were given 1 min to study the picture in front of them. Then, they
described the picture in response to the examiner’s question (for instance, how
many people are there in the pictures?) The description in this sense, however, was
not creative in that the students were merely answering questions instead of struc-
turing their own perceptions and putting them into words by themselves.
The least often used item was speaking on a given topic. Students were required
to give a short talk on a theme they chose. They were allowed a few minutes to pre-
pare, and in some cases, provided with textbooks for references. In the six test
papers, only the PEP test in Grade 5 adopted this item format, which listed five
available topics, one lifted from the textbook, the other four covering topics related
to the ones in the textbook. Although these tasks are useful for stimulating and pro-
voking students’ thinking and learning, these items pose great challenges for EFL
learners especially at beginning stages (McKay, 2006).
In the listening section, the tasks included three task types: phoneme discrimination,
choose an answer to a short question, and complete a passage. Table 5 shows that
the first type was the most favored format in the listening tests of both the PEP and
Oxford tests, except for PEP test in Grade 6. Usually, children heard a word or sen-
tence and had to decide which one of the three or four words or sentences printed in
the answer booklet corresponded to what they heard. Hence, these items not only
tested the ability to discriminate between the different sounds of a language but also
the knowledge of vocabulary. However, they may appear to be of limited use, mostly
for diagnostic purposes because the ability to distinguish between phonemes does
not in itself imply an ability to understand verbal messages in real life. In contrast,
the second type can be more suitable if we want to measure how well students can
understand samples of speech by interpreting and analyzing what they have heard.
As for the third type, a short written passage was provided with words omitted at
regular or irregular intervals; students were asked to listen to the text and to fill in the
Table 5 Marks allocated to and distribution of each item format in listening section
Grade
4 5 6
Formats PEP Oxford PEP Oxford PEP Oxford
Phoneme 15 (75 %) 10 (50 %) 15 (75 %) 12 (60 %) 10 (33.3 %) 20 (66.7 %)
discrimination
Choose an answer 5 (25 %) 5 (25 %) 5 (25 %) 8 (40 %) 15 (50 %) 10 (33.3 %)
to a short question
Listen to complete 5 (25 %) 5 (16.7 %)
a passage
Total 20 20 20 20 30 30
222 J. Peng and S. Zheng
missing words. Also referred to as “aural cloze” items, they focus more on students’
ability to detect sounds of the words being used (McKay, 2006). In fact, students who
do not possess appropriate literacy levels to understand the whole passage can write
the words down as they hear them, which resembles what they do in a dictation.
Some common item formats were found in the comprehensive skills section in both
the PEP and Oxford English tests: multiple-choice, true-false, matching, fill-in
blanks, short answer and essay. Table 6 demonstrates the difference of the weight-
ing of each item format.
We can see that from grade four to six, the most frequently used item format in
both the PEP and the Oxford tests was multiple choice, followed by true-false, and
matching. Multiple choice items accounted for at least 35.7 % among all the test
items. Its number even added up to half of the items in grade 4. However, Kohn
(2000) claims multiple choice items are the “most damaging” type which limits
assessment to raw data and neglects the most important features of learning, such as
initiative, creativity, curiosity, and imagination. Despite the fact that these items run
the risks of assessing recall of knowledge as well as guessing, they are an indispens-
able part in the achievement tests, and if well-designed, they can be applied to chal-
lenge students’ higher level of thinking (Berry, 2008).
The essay items pushed the task beyond discrete-point tests that measured small
bits and pieces of a language to challenge their higher-level cognitive skills (Brown,
1996). According to NECS (2001b), an appropriate proportion of essay items can
be introduced; however, as for the measurement of this proportion, no yardstick is
offered. It was found that the least favored item format (especially in the PEP test) was
essay. This might result from the discussion that writing should be age-inappropriate
for young EFL learners, since it exerts far more cognitive demands on children than
they can process (Weigle, 2002).
Thus, merely judging from the number of item formats, it is not possible to
decide if they are appropriate for the testees without evaluating what is being tested.
In fact, all of the above item formats have been applied widely in the tests of young
EFL learners and they have been proved useful (e.g., Hasselgren, 2005; McKay,
2006). Then, it is imperative to look at what the expectations are and what knowl-
edge and skills they should possess to perform well on the tests. To answer the ques-
tion, we have to study what areas of English language are assessed in the tests. This
is the focus of the next section.
A thorough review of the test papers and the items yielded Fig. 1, which shows
the difference between marks the PEP and the Oxford tests allocated to items assess-
ing different language areas.
Over the three grades, the four types of language areas assessed by one or more
items varied in both the PEP and the Oxford tests. Phonology items were focused in
both tests in the first 2 years, whereas vocabulary became highlighted in Grade 6.
Grammar also secured its place in the test paper for both groups, with PEP taking
up a higher proportion. As for function-notion items, the Oxford tests devoted more
items to assessing language use than the PEP tests in all three grades.
The last example (provided above) represented one of these items where students
were required to choose the most appropriate answer in a context. The item went
beyond language knowledge to assess students’ communicative ability. In this con-
text, students needed to understand how to report attitudes properly to the speaker
who was in trouble. All the three options were grammatically acceptable but only
one of them was appropriate in the context where the dialogue took place. The
appropriate response could only be chosen if students understood how to perform
the expressive function and to express regret in western culture. Even if they have
mastered a number of language elements (the meaning of each option, for example),
it is likely that they chose a wrong answer. An item like this offered the students
authentic language, though more demanding than retrieval or rote memorization of
factual information, and provided them with an opportunity to use the language.
Such item is acclaimed by Heaton (1988, p. 10) as “the best test of mastery of a
language”. Hymes (1972) also points out that learners not only have to use qualified
sentences according to the grammar rules, but they should also have the ability to
use them in different contexts. Therefore, in an English test paper, it is necessary to
develop items with authentic materials in authentic contexts to serve a purpose,
which Ao (2002, p. 31) described as “observing if the learners have the competence
of using language to achieve the aims of communication.”
70
60
50
40 phonology
30 vocabulary
grammar
20
function-notion
10
0
PEP Oxford PEP Oxford PEP Oxford
Grade 4 Grade 5 Grade 6
The students’ performance on the tests was described by their scores. Before we
discuss comparisons of the two test papers, it is necessary to take a look at the level
of difficulty.
The data we collected allowed us to estimate the mean level of difficulty (P) using
the formula P = M/T (Yao & Duan, 2004). M represents the mean score of the stu-
dents while T means the total score of the test paper (100 marks). The higher the
value, the easier the test paper is. The value of P ranges from 0 to 1. The M and P
values of 3 years on both the PEP and the Oxford tests are given in Table 7. The
mean level of difficulty for both tests was relatively low; it increased over the years.
Although two different test papers were used, the level of difficulty was compara-
tively close, with the PEP test paper showing a slightly (almost negligible) higher P
than that of the Oxford test. The highest level of difficulty was found for the PEP
test paper of Year 6, and the P value reached 0. 82.
As for the score distribution, four bands are applied to see how students performed
on the test, which is 90–100; 80–89; 60–79; below 60. Teachers at CNPS generally
viewed students who scored in the first band as outstanding performers, those in the
second band were considered good performers, in the third band poor performers
and students in the last band failed to achieve the required level.
The vertical axis in Fig. 2 shows the number of students who score in each band.
In both the PEP and the Oxford groups, while the outstanding performers comprised
the largest ratio throughout 3 years, their number declined over the years. As for
good performers, both groups showed a steady growth of students, but the PEP
group outnumbered the Oxford group. Poor performers could be observed
throughout 3 years, with the lowest number appearing in the Oxford Group in year
4, when only ten students were counted. For students scoring below 60 (failed), the
number increased gently every year. In year 4, no students failed in any of the groups
whereas at the end of primary school education (Year 6), 25 students (8.2 %) in the
PEP group failed; this number constituted the largest ratio. As for the Oxford
English group, seven students (3 %) failed.
250
200
150
90-100
80-89
100
60-79
bel 60
50
0
PEP Oxford PEP Oxford PEP Oxford
Year 4 Year 5 Year 6
95
90
class 1
85 class 2
class 3
80 class 4
class 5
75 class 6
70
Year 4 Year 5 Year 6
Fig. 3 Mean score attained for six classes in the PEP group
We computed the mean score for each class in the PEP and the Oxford groups, as
depicted in Figs. 3 and 4. The vertical axis denotes the mean score attained by the
different classes. There was a general trend of decline in the mean scores in all 11
classes as they entered higher grades. In the Oxford group, however, the situation
changed in Year 6: the mean scores in classes 8 and 10 increased slightly. Over 3
years, in the PEP group, the mean score ranged between 75 and 95; whereas it was
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 227
100
95
class 7
90
class 8
class 9
85
class 10
class 11
80
75
Year 4 Year 5 Year 6
Fig. 4 Mean score attained for five classes in the Oxford group
between 80 and 100 in the Oxford Group. By this measure, it is safe to say that most
students in the 11 classes performed well on the tests; those in the Oxford group,
overall, performed better than their peers taking the PEP tests.
We have described and delineated above the English tests of the PEP and the Oxford
groups at CNPS. An attempt was made to answer what the tests comprised, what
test formats were used to assess what language areas, and how the students per-
formed on the tests over 3 years. This section explores what the teachers and test
developers have to say about the tests, how they scored the tests, and we intend to
probe into some of the issues in test quality. Another focus is on students’ perfor-
mance, how they performed and why.
As was shown in Table 3, the oral test took up 50 % in grade 4, 40 % in grade 5, and
no oral component was used in grade 6. When asked why she included oral tests,
Teacher 1, an item writer, explained her belief as follows:
We (and I) believe… learning to “speak” English at critical ages would exert great influ-
ence on children’s EFL study. Thus, it’s necessary to develop oral tests to signal that oral
abilities are important.
(Teacher 1, interview extract, 05/07/2014)
This view concurs with the literature on children’s language learning indicating
that oral abilities play a critical role. According to Hadley (2001), spoken language
228 J. Peng and S. Zheng
By “following the gut”, the scorers did not refer to any guidelines or rubrics in
the grading process, which may compromise the reliability of the scores. However,
in Teacher 3’s understanding, this sacrifice was necessary to accommodate for the
needs of young language learners. He further added:
Learners of English at this age are very unlikely to speak English unless they are asked to.
So we should give them a break when assessing them, otherwise they will be discouraged.
(Teacher 3, interview extract, 05/07/2014)
In this vein, the oral tests served to please children rather than to assess them.
This, fueled by the huge time-consumption in administration, some teachers pro-
posed a modification of the present oral tests, while others suggested its
cancellation.
Another interesting observation is that Table 3 clearly depicts a general decrease
of items designed for oral tests, which according to Teacher 2, stood in line with
how English teachers at CNPS prioritized their teaching goals.
The makeup of test items doesn’t come from nowhere. For example, in low grade, we believe
speaking should be given priority. In response, we develop a high proportion of these items
in grade 4. As students enter higher grades, we shift the focus to vocabulary and grammar.
Hence we design no oral test in grade 6.
(Teacher 2, interview extract, 05/07/2014)
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 229
The listening section occupied a large portion in the written tests of both the PEP
and the Oxford groups, accounting for 40 % in grade 4, around 33 % in grade 5 and
30 % in grade 6. The consideration of devising so many listening items, according
to Teacher 4, is to:
…emphasize the input on the part of children so that …the likelihood of them producing
increased language output may not be a fantasy.
(Teacher 4, interview extract, 05/07/2014)
This understanding may find its root in theories of second language acquisition.
With insights gained from studies of child language acquisition, Byrnes (1984)
highlights the key role listening plays in the development of a learner’s second lan-
guage, particularly at the beginning stages of language development. Without the
input provided by listening at the right level, learning cannot begin (Nunan, 1999).
McKay (2006) also argues that “listening needs its own profile in assessment”
(p. 207) in that it plays an important role, not only in language learning, but also in
learning in general.
Despite their huge number, most listening items (as shown in Table 5) were con-
structed to target students’ ability to discriminate between phonemes, with very
little emphasis on processing at the semantic level to understand the meaning of an
utterance. As Chastain (1979) put it, these items may be valid for testing conscious
knowledge of the language, but they are not realistic indications of the ability to
comprehend a spoken message. In real life situations, even when occasional confu-
sions over selected pairs of phonemes are made, listeners can still use contextual
clues to interpret what they hear. By this measure, the listening test was of a tradi-
tional kind, which Teacher 2 justified:
Listening poses much challenge to children…we didn’t use too many items to assess “how
well they understand a message”, not least because children are still limited in the ability
to use vocal keys to unlock the meaning of the communication.
(Teacher 2, interview extract, 05/07/2014)
It seems that the skill of “understanding a message” has given way to “recogniz-
ing and discriminating sounds”. But again, is “understanding a message” something
we should expect from English learners at the beginning stage? Teacher 6 gave no
to this question:
Should we not be more concerned with children understanding how English “sounds” than
what it means?
(Teacher 6, interview extract, 05/07/2014)
It was found that most items (35.7–50 %, as shown in Table 6) in this section of the
PEP and Oxford tests were multiple choice items. Why use these items? Teacher 1
offered her explanation:
230 J. Peng and S. Zheng
We have a lot of content to cover in a test paper and multiple choice items can do that for
us. They can assess more topics than what can be squeezed into other forms of questions,
and also they are highly reliable and objective.
(Teacher 1, interview extract, 05/07/2014)
However, McKay (2006) cautions about the danger of some multiple choice
items eliciting only selected or limited response, hence they are to be used with
more care with young learners. In the tests of the PEP and the Oxford groups, we
found that up to 91 % and 83 %, respectively, of the items assessing grammar and
vocabulary were designed as multiple choices. While such items assessing individ-
ual grammatical forms (e.g., third person singular) focus on accuracy, they do not
involve children in purposeful, creative and spontaneous language use in a particu-
lar situation (McKay, 2006) because they lack contextual support and authenticity
(Zhan, 2007). Likewise, Purpura (2004) commented that they are “old-fashioned
and out-of-touch with students’ language learning goals” (p. 253).
Williams (1984) pointed out that language use tasks, similar to those used in the
classroom can be reused for assessment of young learners (doing puzzles, solving
problems, listening to and retelling stories, etc.). However, using these tasks for
assessment means more than handing students a piece of test paper. The administra-
tion may be more complex and impractical for teachers at CNPS, each of whom was
responsible for more than 40 or even 50 students. Besides, the scoring may be more
subjective than using multiple choice items. Considering both sides of the coin,
Teacher 3, when she was asked to make a choice, said:
I would still stick to multiple choices because they are more objective items. They make it
easier for us to ensure fairness in scoring children.
(Teacher 3, interview extract, 05/07/2014)
Her view is corroborated by Brown (1996, p. 29) who phrases this awareness as
“a tendency to seek objectivity” in assessment. But he also points out that many of
the elements of language courses may not be testable in the most objective test
types. For this reason, among others, CNPS devised a number of essay writing tasks
in both groups to assess how well students can use the English language to com-
municate meaning. These items often provided cue words/sentence pattern guid-
ance in the target or the source language to help students compose a short passage
on a topic. However, testing writing in primary school has been the subject of much
controversy. Teacher 5 voiced her doubts about constructing the essay items:
I understand the importance of writing. But we seem to follow the logic that since we have
listening and reading (input), there must be writing (output). And students might find it per-
plexing to put into so much effort expected to write a passage, yet attaining at most five marks.
(Teacher 5, interview extract, 05/07/2014)
Teacher 4 reported how some of students came to her complaining about their
low scores on the writing item:
Some students were so discouraged that they asked me why they were given a low score, but,
you know, actually, 80 % of the students get below three marks…we have so much to take
into consideration in the grading of writing, such as spelling, grammar, etc. Once we spot-
ted a mistake, 0.5 mark would be taken away.
(Teacher 4, interview extract, 05/07/2014)
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 231
As far as the assessed language areas are concerned (shown in Fig. 1), it could be
seen that while both the PEP and the Oxford tests concerned four language areas,
the PEP tests focused more on testing the first three (i.e. phonology, vocabulary,
grammar) than the Oxford test using many items to assess notion-function of lan-
guage use. It could also be inferred that item writers for the PEP group took a struc-
tural approach to language testing, whereas those in the Oxford group adopted a
more communicative approach (Heaton, 1988). In grade six, for example, items
assessing notion-function were assigned as many as 25 marks in the Oxford English
tests. So why did the PEP and Oxford test differ in the assessed language areas?
Teacher 4 interpreted this as a result of different textbooks and teaching.
We have to test what we teach and how we teach. Oxford English is structured in a way that
emphasizes the use of real-life and practical language while PEP highlights the importance
of flowing from words to sentences, then paragraphs.
(Teacher 4, interview extract, 05/07/2014)
In this sense, the differentiation reflected the respective textbooks and the meth-
odologies they followed. Therefore, the items were aligned with desired outcomes
defined in the textbooks. If so, then item writers in both groups did a good job. As
stated by Heaton (1988), when a more traditional, structural approach to language
learning has been adopted, the test should closely reflect such a structural approach.
The same goes for the communicative approach. A study by Li (2010) also reported
that many local English tests in China at the primary stage assessed individual lan-
guage performance depending on the curriculum to which pupils were exposed,
thus the selection of the test contents and materials was fully combined with teach-
ing objectives and teaching materials.
It is reasonable to state that test writers followed the guidance of teaching materi-
als to develop what they believed to be a good test, which acted as an obedient ser-
vant since it followed and aped the teaching (Davies, 1968). However, Hughes
(1989) proposed that we cannot expect testing to follow teaching only. Instead,
testing should be supportive of good teaching and, where necessary, exert a correc-
tive influence on bad teaching. According to communicative language testing theo-
ries, “bad teaching” only makes clear what learners know about the language and
not how they use the target language in the appropriate context, irrespective of
assisting them to use language knowledge in meaningful communicative situations
(Canale & Swain, 1980). To change that, using more items assessing the notion-
function of language may facilitate good teaching and induce preferable learning
outcomes on the part of children.
232 J. Peng and S. Zheng
More and more researchers (e.g., Bachman, 1990; Berry, 2008; Shu, 2001) agree
that scientific testing entails the integration of validity and reliability to ensure its
quality.
3.3.5.1 Validity
3.3.5.2 Reliability
However, simply taking good care of validity does not qualify a good test. According
to Heaton (1988), for a test to be valid at all, it must be reliable as a measuring
instrument. Reliability has to do with the consistency of an examinee’s performance
on the test, i.e. the extent to which the results can be considered consistent or stable
(Brown, 1996). Hughes (1989) points out that there are two components of test reli-
ability: the performance of candidates from occasion to occasion, and the reliability
of the scoring. The first reliability can be estimated with a strategy called the test-
retest method, which administers the test in question two times to the same group of
students. Once completed, the pairs of scores for each student are lined up in two
columns, and a Person product-moment correlation coefficient can be calculated
between the two sets of scores. The test-retest method has never been used at CNPS
due to the skepticism about the necessity and feasibility of conducting such an
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 233
Indeed, though researchers have substantiated with many publicly used tests the
significance of catering to issues of reliability (e.g., Ao, 2002; Choi, 2008), few
studies have been carried out to probe into small tests, which, however, does not
provide an excuse for item writers for not bearing in mind some factors affecting
reliability.
When we asked what the teachers had done to keep reliability at a desirable level,
the inquiry met with a detailed explanation of the test construction process. We
summarized three item writers’ words as follows in a way that attempted to describe
the process as clear and brief as possible:
We (item writers) would gather together several times to discuss details as to what to incor-
porate in the tests and how to distribute the weighting. Each one will assume responsibility
for one section of items. Following the completion of the first draft, the test paper will be
subject to critical scrutiny by another item writer. Then, it is sent to the Jiaoyanyuan (a lead-
ing figure in subject teaching in the district. The candidate is appointed by the local educa-
tional institution to supervise and evaluate the teaching at school levels) from Teachers’
Training Institution in Nan’an District, who reviews the paper and offers suggestions in
regard to the paper quality. The final version will then be printed and prepared for
administration.
We also asked whether CNPS had given any thought to the second reliability, the
scorer reliability. In response, Teacher 2 said:
Because scores of students largely depend on the quality of their response against the cri-
teria set by the scorers, at the beginning of grading, I (or the head of the English depart-
ment) gather teachers from the same grade to discuss the scoring criterion, especially in the
case of subjective items. On some occasions, a detailed scoring key specifying acceptable
answers and assigning points will be given out to them.
(Teacher 2, interview extract, 05/07/2014)
Grading began only after scorers agreed upon the criterion. The test papers were
randomly distributed to each scorer. In the process of scoring, a leader (usually a
backbone teacher) assumed responsibility for clearing doubts in terms of the scoring
standards. After the completion of grading, teachers were involved in producing
“score reports” with basic analysis of data.
Nevertheless, it was found that not every scorer was willing to toil through such
a rigorous grading procedure. For example, in light of the strong subjectivity to
personal judgment, oral tests and writing tasks entailed huge demands on scorers.
However, oral tests were graded in a more causal way to encourage students’ speak-
ing. As for writing items, Teacher 3 filed his complaint:
It’s not like those high-stakes tests where the scores decide someone’s fate. Personally, I
don’t favor the idea of making a fuss in grading (writing) even if we are told to.
(Teacher 3, interview extract, 05/07/2014)
234 J. Peng and S. Zheng
This view is in line with teachers’ willingness to improve test quality through
teacher training in quantitative analysis. Teacher 2 envisaged that:
I hope professionals and experts in English language assessment will come to our rescue.
Even though we know little about some testing theories and statistical analysis, we are
never afraid of embracing the challenges when it means we can improve teaching and
learning.
(Teacher 2, interview extract, 05/07/2014)
It was demonstrated in Table 7 that the mean level of difficulty for both tests was
relatively close. Then why did the Oxford group perform better than the PEP group
when they took the test of approximately the same level of difficulty? In this section,
we asked the teachers and item writers what they thought and report three main
reasons:
3.3.6.1 Textbooks
It appears that teachers’ assessment and students’ performance in the PEP group
were constrained by the textbook. When asked if the textbook really meant the
problem, Teacher 4 said this:
We should not be shackled by textbooks. Actually, it is how we use them that determine our
teaching outcomes. I think we should induce change in our teaching…any adjustment can
be made possible if you embrace it. We all want the same things; don’t shackle yourself just
because the textbook says so.
(Teacher 4, interview extract, 05/07/2014)
The destination is the same, the route of arriving there makes a difference. While
English proficiency was what teachers intended for students’ learning outcomes, the
Oxford group approached it through language-focused activities and games.
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 235
Teachers used diversified and dynamic teaching methods to help students enjoy the
language-embedded activities. In contrast, in the PEP classrooms, few activities
were introduced, some of which were non-language related. Teacher 1 commented
on how activities varied:
In PEP, we have to take much time to deal with words, sentence patterns as such. Sometimes,
we design activities just because students are tired from learning and we want to cheer them
up. But in Oxford, we integrate games in learning, and give children opportunities to prac-
tice language, which slides into their heads without them knowing.
(Teacher 1, interview extract, 05/07/2014)
As Chou (2012) pointed out, using games or other forms of play without a clear
objective related to language learning is likely to result in ineffective learning, in the
sense that the pupils will be unable to demonstrate what they have learned in class
through games. However, language-oriented and learner-centered games in the lan-
guage classroom can yield desirable results. McKay (2006) reports that language-
rich activities or games involving doing, thinking and moving can be used to provide
children with opportunities to listen and guess from the context, to risk using the
target language, and to engage in interactions. Therefore, it might be argued that
students in the Oxford group benefited more from carefully designed and language-
related activities than their peers in the PEP group and this is why they demon-
strated a higher level of English proficiency.
However, students in the Oxford group progressed at a slower pace. The good
performances of the Oxford students emerged only after a period of time when stu-
dents in the PEP group were already making strides ahead. According to Teacher 2:
At the beginning of learning, students find it a headache to keep up with the pace of learning
in Oxford textbooks because we have so many activities and things to learn. But as time
went by, they have displayed much higher English proficiency.
(Teacher 2, interview extract, 05/07/2014)
3.3.6.3 Motivation
From her words, we see that increasing content complexity was one of the inter-
nal factors demotivating students. This view is also supported by Teacher 2:
Knowledge covered in textbooks rolls like a snowball over the years. We encounter more
boring grammatical structures and vocabularies. Some students are afraid that … they are
unable to tackle the “hard” part, trying to run away from English and saying it is demon.
(Teacher 2, interview extract, 05/07/2014)
Facing the complex content crisis, teachers tended to cut down or omit fun activi-
ties and introduced more serious but boring tasks in higher grades so that they could
focus on dealing with the “hard” parts in a step by step fashion. However, this may
be the very reason why students became discouraged. A longitudinal study by
Nikolov (1999) looked into how Hungarian children’s motivation changed over
their 8 years of learning English. She found that for children (ages 6–14), intrinsi-
cally motivating activities, tasks and materials meant one of the most important
motivating factors.
Deprived of the time spent on learning by doing and playing, students have mani-
fested negative attitudes towards learning English. Apart from this, learning a sub-
ject for such a long a time emerged as the second reason accounting for the abatement
of motivation, as pointed out by Teacher 2:
Students become more impatient in classes. Some of their parents come to us, reporting that
their children have complained that they have studied more and longer than they could
handle.
(Teacher 2, interview extract, 05/07/2014)
This is verified by a study of Davies and Brember (2001), who measured atti-
tudes of second and sixth grade students using a Smiley-face Likert scale. They
found that all participants harbored significantly less positive attitudes in the higher
grade, and concluded that the more years students spent studying a subject, the more
disenchanted they became with it.
The third factor has something to do with the abolishment of the general gradu-
ation examination in elementary school: since 2011, after graduation from primary
school, children automatically enter a neighborhood middle school without taking
any form of exams or tests. Since then, less pressure has been endured by the stu-
dents to fight for better grades. As Teacher 5 observed:
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 237
Without struggling through a formal examination to win a ticket to a middle school, some
students are slacking off in school, paying less attention during the class session and skip-
ping their homework.
(Teacher 5, interview extract, 05/07/2014)
4 Conclusions
This case study analyzed the achievement test results of the students tested in their
Years 4, 5 and 6 at CNPS between 2010 and 2013. Through examining the test
papers in the PEP group and the Oxford English group, we have answered questions
concerning the component, item types and language areas measured of the two sets
of tests. Then by looking into the students’ scores, we have attempted to understand
how the students performed on the two tests over 3 years and investigate the differ-
ences between two groups. A follow-up interview brought us closer to what the
teachers and test writers at CNPS built their teaching and testing beliefs upon.
The results document the commitment of teachers and administrators to catering
to children’s needs by developing well-scrutinized achievement tests. However, we
found that not all the seven interviewees interpreted the test scores in a way that
provides feedback on how students learn, how they perceive the learning process,
and then inform teaching in the best interests of their students. In addition, endeav-
ors have been made to look at how children had performed and analyze the reasons
contributing to their performance.
The study bears implications for using achievement tests to assess young EFL learn-
ers in elementary schools. The findings contribute to the body of evidence of how
primary schools apply language assessment and what can be done to refine test
papers and improve teaching, which entails teamwork where teachers, school
administrators as well as students themselves all play a part.
For teachers, as indicated in the difference of students’ performance and motiva-
tion between two groups, they are advised to reflect on how they use textbooks, how
they teach and develop good quality tests (see also Hsieh, 2016 in this book). To
accommodate the young language learners’ age and personality, both teaching and
assessment need to be engaging and flexible, without intimidating children and
causing boredom. In addition, testing should not be limited to measuring the learn-
ing results, but also serve to provide feedback for teaching and support for learning.
As Berry (2008) points out, the paradigm of assessment should not only be of learning,
but more importantly, for and as learning, which “places special emphasis on the
238 J. Peng and S. Zheng
role of the learner and highlights the use of assessment to increase learners’ ability
to control their own learning” (p. 9).
When teachers prepare themselves for the changes, the question ensuing would
be whether school administrators would support the reform in teaching and assess-
ment and welcome the new ideas that might seem to undermine and even contradict
what is prescribed by the education authorities and what is expected by those par-
ents who care only about higher grades. Whereas the general graduation examina-
tion in elementary schools has been cancelled, the mindset of some school leaders
and parents are still score-oriented. This in a way poses a threat to teachers exploring
better ways to serve pedagogy.
It should also be noted that students can participate in the assessment process by
providing feedback to teachers on how they feel towards the test, what they think is
difficult or easy. With information of this kind, it would give teachers some perspec-
tives on what the students have learned, whether the test has achieved the goals set
in their mind, what to do in the next phase of teaching. However, it cannot be sub-
stituted for the analysis of test papers and test scores.
The limitations of the present research are manifold. First, we were not able to con-
duct statistical analyses of test items, because teachers and school administrators
failed to store and allow us to process raw scores. It raises an imminent question
whether schools like CNPS should at all evaluate the scores of small-scale, school-
based, non-public used test papers using quantitative methods. However, from the
in-depth interviews with teachers, we find that, despite their impoverished sensitiv-
ity to checking test paper quality, they expressed willingness to use what they called
“high-above theory” to guide their test development. Some teachers have already
begun to consider issues like reliability and validity, and they are looking forward to
receiving training in assessment. It is hoped that teachers will use the expertise they
gained to create and administer tests, and to interpret the evidence they generate in
a scientific way, and eventually, they will be able to reflect on the findings in order
to change their practice.
Second, although we have managed to probe into the motivational factors exert-
ing influence on students’ performance through interviews, the participants were
only seven teachers. Some of their opinions could be personal and biased without
cross-checking with students what actually happened to them and what they truly
thought. Also, to what extent the motivational factors have contributed to their dif-
ference in performance remains to be investigated.
Another drawback is that what has been explored at CNPS cannot be generalized
to the whole picture of English language tests in China at the primary level. Given
the specific learning context and the relatively small sample size, future research in
other contexts and with a wider population of children of the same age group is
much warranted.
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 239
Appendix 1
Interview Questions
References
Ao, S. (2002). English language testing in China: A survey of problems and suggestions for reform.
Unpublished master’s thesis. Ghent University, Brussels, Belgium.
Bachman, L. F. (1990). Fundamental concepts in language testing. Oxford, UK: Oxford University
Press.
Bailey, A. L. (2005). Test review: Cambridge young learners English (YLE) tests. Language
Testing, 22(2), 242–252.
Berry, R. (2008). Assessment for learning. Hong Kong, China: Hong Kong University Press.
Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents.
Butler, Y. G., & Lee, J. (2010). The effects of self-assessment among young learners of English.
Language Testing, 27(1), 5–31.
Byrnes, H. (1984). The role of listening comprehension: A theoretical base. Foreign Language
Annals, 17(4), 317–329.
Cameron, L. (2001). Teaching languages to young learners. Cambridge, NY: Cambridge University
Press.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second lan-
guage teaching and testing. Applied Linguistics, 1(1), 1–47.
Chastain, K. D. (1979). Testing listening comprehension tests. TESOL Quarterly, 13(1), 81–88.
Chik, A., & Besser, S. (2011). International language test taking among young learners: A Hong
Kong case study. Language Assessment Quarterly, 8(1), 73–91.
240 J. Peng and S. Zheng
Choi, I. C. (2008). The impact of EFL testing on EFL education in Korea. Language Testing, 25(1),
39–62.
Chou, M. H. (2012). Assessing English vocabulary and enhancing young English as a foreign
language (EFL) learners’ motivation through games, songs, and stories. Education, 3(13),
1–14.
Davies, A. (Ed.). (1968). Language testing symposium: A psycholinguistic perspective. London,
UK: Oxford University Press.
Davies, J., & Brember, I. (2001). The closing gap in attitudes between boys and girls: A 5-year
longitudinal study. Educational Psychology, 21(1), 103–114.
Fleurquin, F. (2003). Development of a standardized test for young EFL learners. In Spaan Fellow
Working Papers in Second or Foreign Language Assessment, 1(1), 1–23.
Gardner, S., & Rea-Dickins, P. (2001). Conglomeration or chameleon? Teachers’ representations
of language in the assessment of learners with English as an additional language. Language
Awareness, 10(3), 161–177.
Gong, Y. F. (2003). PEP primary English students’ book. Beijing, China: People’s Education Press.
Gronlund, N. E. (1993). How to make achievement tests and assessments. Needham Heights, MA:
Allyn & Bacon.
Hadley, O. (2001). Teaching language in context. Boston, MA: Heinle & Heinle.
Halliwell, S. (1992). Teaching English in the primary classroom. London: Longman.
Hasselgren, A. (2005). Assessing the language of young learners. Language Testing, 22(3),
337–354.
Heaton, J. B. (1988). Writing English language tests. New York: Longman.
Hsieh, C.-N. (2016). Examining content representativeness of a young learner language assess-
ment: EFL teachers’ perspectives. In M. Nikolov (Ed.), Assessing young learners of English:
Global and local perspectives. New York: Springer.
Hughes, A. (1989). Testing for language teachers. Cambridge, UK: Cambridge University Press.
Hymes, D. H. (1972). On communicative competence. In Sociolinguistics: Selected readings
(pp. 269–293). Harmondsworth, UK: Penguin.
Jacobs, L. C., & Chase, C. I. (1992). Developing and using tests effectively: A guide for faculty.
San Francisco: Jossey-Bass.
Jia, G. J. (1996). Psychology of foreign language education. Nanning, China: Guangxi Education
Publishing House.
Kohn, A. (2000). The case against standardized testing: Raising the scores, ruining the schools.
Portsmouth, NH: Heinemann.
Koretz, D. M. (2002). Limitations in the use of achievement tests as measures of educators’ pro-
ductivity. Journal of Human Resources, 37(4), 752–777.
Lasagabaster, D. (2011). English achievement and student motivation in CLIL and EFL settings.
Innovation in Language Learning and Teaching, 5(1), 3–18.
Li, S. L. (2010). The communicative English testing framework for students at primary stage.
Unpublished master’s thesis, Gannan Normal University, Jiangxi, China.
McKay, P. (2006). Assessing young language learners. Cambridge, UK: Cambridge University
Press.
Ministry of Education. (2001a). Basic requirements of English teaching in elementary school.
Beijing, China: Beijing Normal University Publishing Group.
Ministry of Education. (2001b). The new English curriculum standards. Beijing, China: Beijing
Normal University Publishing Group.
Ministry of Education. (2011). Standard of English curriculum for basic education. Beijing,
China: Beijing Normal University Publishing Group.
Morrow, K. (2012). Communicative language testing. In C. Coombe & B. O’Sullivan (Eds.), The
Cambridge guide to second language assessment (pp. 140–146). Cambridge, NY: Cambridge
University Press.
A Longitudinal Study of a School’s Assessment Project in Chongqing, China 241
Nikolov, M. (1999). ‘Why do you learn English?’‘Because the teacher is short’. A study of
Hungarian children’s foreign language learning motivation. Language Teaching Research,
3(1), 33–56.
Nikolov, M. (2016). Trends, issues and challenges in assessing young language learners. In
M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives.
New York: Springer.
Nunan, D. (1999). Second language teaching & learning. Oxford, UK: Heinle & Heinle Publishers.
Phelan, C., & Wren, J. (2005). Exploring reliability in academic assessment. Retrieved January 15,
2014, from http://www.uni.edu/chfasoa/reliabilityandvalidity.htm
Pinter, A. (2006). Teaching young language learners. Oxford, UK: Oxford University Press.
Purpura, J. (2004). Assessing grammar. Cambridge, UK: Cambridge University Press.
Shi, J. P. (2010). Oxford English. Shanghai, China: Shanghai Foreign Language Education Press.
Shohamy, E. (2001). The power of tests: Critical perspectives on the uses of language tests.
Harlow, UK: Longman.
Shu, Y. X. (2001). The theories and methods of foreign language testing. Beijing, China: World
Book Publishing Company.
Soto, L. D. (1988). The motivational orientation of higher-and lower-achieving Puerto Rican chil-
dren. Journal of Psychoeducational Assessment, 6(3), 199–206.
Vale, D., & Feunteun, A. (1995). Teaching children English, a training course for teachers of
English to children. Cambridge, NY: Cambridge University Press.
Weigle, S. (2002). Assessing writing. Cambridge: Cambridge University Press.
Weir, C. J. (1990). Communicative language testing. London: Prentice Hall.
Wilkins, D. A. (1974). Second-language learning and teaching. London: Edward Arnold.
Williams, M. (1984). A framework for teaching English to young learners. In C. Brumfit, J. Moon,
& R. Tongue (Eds.), Teaching English to children. Harlow, UK: Longman.
Yao, J., & Duan, H. C. (2004). Research on the difficulty distribution of test papers based on the
Monte Carlo method. Computer Applications and Software, 21(9), 66–67.
Zhan, C. F. (2007). The multiple-level English testing framework (MLETF) for English teaching at
primary stage. Unpublished master’s thesis, Guangxi Normal University, Guangxi, China.
Jing Peng is a professor in Research Centre of Language, Cognition and Language Application,
Chongqing University, China. She teaches English Curriculum and Instruction to MA students at
College of Foreign Languages And Cultures. Her areas of interest include: teacher education,
teacher development, and ELT methodology. In recent years she has done school, regional and
national projects for primary and secondary schools and published extensively in the areas of ELT
curriculum innovations, methodology, and teacher education. She has also been involved in teacher
education program sponsored by the Tin Ka Pin Foundation, Hong Kong, and projects supported
by British Council.
Shicheng Zheng is a graduate student major in English curriculum and instruction, Chongqing
University. He has been working as a teaching assistant at College of Foreign Languages And
Cultures. His areas of interest include: curriculum, teaching methodology, and teacher talk. He has
done several studies of the curriculum development and methodologies of teaching English as a
foreign language.
Individual Learner Differences and Young
Learners’ Performance on L2 Speaking Tests
Abstract This chapter focuses on motivation and self-concept and their role in oral
production in early learning of English as a foreign language. A review of major
research findings considering the relationship of these individual learner differences
and oral performance by young foreign language learners is followed by presenta-
tion and discussion of the study the author carried our with Croatian learners of
English as a foreign language. The participants, aged 11 at the start and 14 at the end
of the study, were followed for 4 years. Each year their motivation and self-concept
were measured by means of smiley questionnaires and oral interviews, while their
oral production was elicited each year through picture description tasks and per-
sonal oral interviews. The study offers interesting evidence of the dynamics of
young learners’ motivation and self-concept and their relationship with their devel-
oping oral performance. Implications of the findings are considered as well.
1 Introduction
Although children are still commonly thought to be highly similar to each other
when language learning is concerned, recently research into individual learner dif-
ferences has extended to young L2 learners as well. Thus, major publications in the
early L2 learning field increasingly include sections on how young language learn-
ers differ in their approach to L2 learning as well as in various aspects of the lan-
guage learning process and learning outcomes (e.g., Enever, 2011; Muñoz, 2006;
Murphy, 2014; Nikolov, 2009a, 2009b). Attitudes and motivation of young L2
learners have perhaps been investigated the most extensively leading to whole vol-
umes devoted to the topic (e.g., Heinzmann, 2013). Some attention has been paid to
young learners’ language aptitude (e.g., Alexiou, 2009; Kiss, 2009; Kiss & Nikolov,
2005), learning strategies (e.g., Kubanek-German, 2003; Lan & Oxford, 2003;
Mihaljević Djigunović, 2002; Šamo, 2009; Tragant & Victori, 2006), attributions
(e.g., Julkunen, 1994), language anxiety (e.g., Low, Brown, Johnstone & Pirrie,
1995; Seebauer, 1996) and self-concept (Julkunen, 1994; Mihaljević Djigunović,
2014). In some studies interactions between different individual learner characteris-
tics as well as with some contextual factors were also investigated.
In the present study we focus on young learners’ oral performance in English as
L2 and two individual differences: motivation and self-concept. While the relation-
ship of L2 achievement with the first learner factor has been the focus of interest for
some time now, self-concept has only recently caught the attention of young learner
researchers.
Most empirical studies suggest that there is a significant relationship between
motivation and language learning achievement. Thus, Harris and Conway (2002)
report on more motivated Irish young learners of French, German and Italian being
more successful at these languages than their less motivated peers. Such a positive
relationship has been found in other studies, and has been shown to be evident with
learners as young as four (e.g., Bernaus, Cenoz, Espı & Lindsay, 1994) as well as
with 14-year-olds (e.g., Bagarić, 2007; Dörnyei, Csizér & Németh, 2006). However,
the relationship seems to be quite complex once we take into account different types
of measures of learning outcomes, or age and learning experience of young L2
learners as well as types of motivation. Studies have, thus, shown that motivation is
less strongly correlated with objective measures of language achievement than with
teacher-assigned grades or with learner self-assessment (Margoret, Bernaus &
Gardner, 2001). Tragant and Muñoz (2006) have found motivation to be more sig-
nificantly related to performance on integrative than discrete-point measures. Quite
a few studies (e.g., Graham, 2004; Masgoret & Gardner, 2003; Tragant & Muñoz,
2000) have indicated that correlations of motivation with language achievement
tend to decrease with increasing age of learner.
Mercer (2011) defines the L2 learner’s self-concept as ‘an individual’s self-
description of competence and evaluative feelings about themselves as a Foreign
Language (FL) learner’ (p. 14). Highlighting the importance of L2 self-concept,
Arnold (2007) says that ‘(l)earners must both be competent and feel competent.’
(p. 18). Due to the common belief that young learners have a positive self-perception
as if by default, until recently this affective learner variable was not considered a
relevant topic in the early L2 learning field. However, with increasing interest in
researching young learners the young L2 learner’s self-concept has become a poten-
tially important variable which could offer deeper insight into early L2 learning
processes. Harter (2006) claims that children tend to develop too positive self-
perceptions because it is difficult for them to distinguish between their real and ideal
selves. Based on self-rating of their abilities, Pinter (2011) calls young L2 learners
‘learning optimists’. Damon and Hart (1988) suggest that young learners’ self-
knowledge becomes more complex as they mature. Kolb (2007), however, claims
that children possess quite high awareness of their L2 learning process and entertain
complex language learning beliefs: they base these on their learning experiences
and personal knowledge. Studies by Wenden (1999) and Mihaljević Djigunović and
Individual Learner Differences and Young Learners’ Performance on L2 Speaking… 245
Lopriore (2011) also suggest that young learners are capable of participating in
reflective activities and providing relevant and important data on their L2 learning
process. Mihaljević Djigunović and Lopriore have found that young L2 learners
display both inter- and intra-learner variability in their L2 self-concept. In her com-
parative study of children who started L2 learning earlier (at age 6) and those who
started later (at age 9), Mihaljević Djigunović (2016) has found that the develop-
ment of L2 self-concept of earlier and later starters follows different trajectories.
In the past two decades or so assessment of early L2 learning outcomes has
focused on different aspects of language achievement. Among these a number of
studies have been dedicated to reseaching the mastery of some or all of the four
language skills (e.g., García Mayo & García Lecumberi, 2003; Harris & Conway,
2002; Low, Duffield, Brown & Johnstone, 1993; Mihaljević Djigunović & Vilke,
2000; Nikolov & Józsa, 2006). Assessment of the speaking skill is not easy to carry
out on larger samples, hence many studies do not include it. Different tasks have
been used to test the speaking skills in different studies (see also Nikolov, 2016;
Hung, Samuelson & Chen, 2016 in this volume). Thus, Low et al. (1993) used
paired interviews, and found that Scottish young learners of French and German
showed different rates of progress in speaking.
Harris and Conway (2002) tested the speaking skills of Irish young learners of
French, German, Italian and Spanish by means of a complex task which tested both
listening and speaking. The speaking part involved responding to the examiner’s
questions about the pupils themselves and to questions based on a picture of a fam-
ily having a birthday party at a restaurant. The findings indicated that achievement
was connected to the young learners’ attitudes and motivation.
Studying the speaking skills of learners of Irish Harris, Forde, Archer, Fhearaile
and O’Gorman (2006) designed a complex speaking test which was meant to mea-
sure communication, fluency of oral description, vocabulary, control of the mor-
phology of verbs, prepositions, qualifiers and nouns, and syntax of statements in
speaking. The communication component consisted of question and reply sequences
which resulted in the pupil’s telling the examiner about their life, and of role-plays
carried out by pairs of pupils.
Medved Krajnović (2007) tested the speaking skills of Croatian year 8 (age
13–14 years) and year 12 (age 17–18 years) learners of English as L2 using a set of
tests developed in Hungary (Fekete, Major & Nikolov, 1999). In case of the year 8
participants these included first answering a set of personal questions, then describ-
ing a picture and relating it to a personal experience, followed by role-playing (with
the examiner) three different age-appropriate life situations. In case of the year 12
participants, the third task was replaced by a different one: the participants were
presented with five statements on which people had different opinions, then, they
had to choose one and offer four reasons why they thought people agreed or dis-
agreed with the statement. All oral performances were assessed along four criteria:
task achievement, vocabulary, accuracy and fluency. Both subsamples scored lower
on accuracy than on the other dimensions. Positive attitudes and motivation were
found to correlate with the oral performance of all the participants.
246 J.M. Djigunović
Hoti, Heinzmann and Müller (2009) designed a similar speaking test for their 3rd
grade learners of English as L2 in Switzerland: the first part included personal ques-
tions to the pupil and picture description, while the second part involved role-
playing as a speaking task performed by two pupils. The authors analyzed the young
participants’ oral production taking into account task fulfillment, the participants’
interaction strategies, complexity of the utterances produced and vocabulary range.
Their findings indicated that whereas the third graders’ attitudes proved to be a sig-
nificant explanatory factor of their speaking skills, motivation and self-concept
emerged as unimportant in this context.
The study described in this chapter was carried out with Croatian young learners of
English as L2. A long tradition of early learning of foreign languages is character-
istic of the Croatian context. The foreign language has been the compulsory part of
the Croatian primary curriculum for more than seven decades now (Vilke, 2007).
For years the starting point was grade 5 (age 10–11 years), then grade 4, and since
2003 it has been the beginning of primary education, that is grade 1 (age 6–7 years).
English, French, German and Italian have traditionally been offered. Recently the
most popular choice has, like in many other contexts, been English. Thus, estima-
tions indicate that over 85 % of first graders learn English, over 10 % German, while
French and Italian are present in very small numbers (Medved Krajnović & Letica
Krevelj, 2009). Those young learners who start with a language other than English
are required to take it from grade 4, so no learner exits primary school without hav-
ing had English classes (National Framework Curriculum, 2001).
Exposure to English is currently extensive, especially through the media (e.g.,
undubbed TV programmes with subtitles). Croatian users of English have a lot of
opportunity to use it with foreign visitors (e.g., business people or tourists) and can
often hear or see English expressions (e.g., advertisements in shopping malls).
The study is part of the Croatian national research project entitled Acquiring English
from an early age: Analysis of learner language (2007–2013) (for more details see
Mihaljević Djigunović & Medved Krajnović, 2016). The project was sponsored by
the Croatian Ministry of Science, Education and Sport. Motivational factors were
investigated in a number of earlier projects carried out with Croatian young learners
of English (Mihaljević Djigunović, 1993, 1995; Mihaljević Djigunović & Bagarić,
2007; Vilke, 1976, 1982), each time their relevance for language learning achieve-
ment being underscored. The Croatian young learners’ self-concept was looked into
Individual Learner Differences and Young Learners’ Performance on L2 Speaking… 247
in a longitudinal study carried out as part of the ELLiE (Early Language Learning
in Europe) project (for details see Enever, 2011; www.ellieresearch.eu). The study
(Mihaljević Djigunović, 2014) suggested that young learners’ self-concept is a com-
plex and dynamic learner characteristic which interacts with other relevant indi-
vidual as well as contextual factors.
3.1 Aims
3.2 Sample
There were 24 participants included in the study: 12 boys and 12 girls. They were
drawn from four primary schools. In terms of their language learning ability they
included four high ability, four average and four low ability boys and girls, respec-
tively. The level of ability was estimated by their respective teacher of English. We
followed them for 4 years: from grade 5 (age: 11 years) to grade 8 (age: 14 years).
They had all started learning English a year before, when they were in grade 4,
which means that we studied their motivation, self-concept and oral production
from their second to their fifth year of learning English as L2. Their L1 was Croatian.
3.3 Methodology
The instruments used to elicit data on the young learners’ motivation and self-
concept were taken over from the ELLiE study. The participants’ motivation was
measured by means of smiley questionaires and oral interviews. Towards the end of
each year they were asked to indicate in the smiley questionnaire how much they
liked learning English and how much they liked learning new English words. The
latter item was introduced because it had been shown that learning new words can
be an important source of motivation in early L2 learning (Szpotowicz, Mihaljević
Djigunović & Enever, 2009). In the annual interviews the participants were asked
which school subject was their favourite. In earlier projects on early L2 learning in
Croatia (Mihaljević Djigunović, 1993, 1995; Vilke, 1982) it was found that such a
248 J.M. Djigunović
mother could be seen through the kitchen window hanging the washing. The partici-
pants were then asked to compare the kitchen in the picture to their own kitchen at
home, to say whether they would like to have the kitchen like the one in the picture,
as well as to describe their ideal kitchen.
The two parts of each speaking test were assessed separately by two independent
raters. Each part of the participant’s oral production was assessed along the follow-
ing four criteria: task achievement, vocabulary, accuracy and fluency. A maximum
of five points could be assigned per criterion. The points were determined on the
basis of the extent to which the participant met the national curricular targets for
each grade.
Below we first present results concerning the individual learner differences we mea-
sured. This is followed by presentation of the participants’ performance on the oral
tests. Finally, we will display the interactions between individual differences and
achievements on the tests.
grade5
grade6
grade7
grade8
Most studies on motivation suggest that motivation is not a stable variable. With
young learners it is usually intrinsic at the start and connected with motivating class-
room activities (Nikolov, 2002, 2009b) and the teacher (Nikolov, 1999; Vilke,
1995). Low levels of motivation are usually associated with uninspiring teaching or
unfavourable conditions in which L2 is taught (Mihaljević Djigunović, 2009). With
increasing length of learning the classroom seems to turn less inspiring for young
learners. There may be multiple reasons for this. In contexts where there is high
exposure to L2, learners may find it hard to connect the L2 they are learning in
school with what they are exposed to outside school. Unfortunately, many teachers
fail to integrate the L2 knowledge which their learners bring to the L2 classroom. It
is possible as well that learners’ interest switches to the new subjects which are
introduced in later grades, as is the case with the Croatian curriculum for grades 6
and 7. Also, during the early teens young learners enter puberty and have to deal
with new challenges. The rise in motivation from grade 7 in our sample may be the
result of the young learners getting more mature and realising the value of knowing
English. Their motivation may be getting more instrumentally oriented and may
reflect awareness that all school marks are important for their entry into secondary
education (which, in Croatia, takes place after grade 8).
As far as the participants’ self-concept is concerned, its developmental trajectory
was different from that of motivation. Their self-concept peaked in grade 6, and then
steadily decreased (see Fig. 2).
If we take into consideration Pinter’s (2011) observation that young learners can
be considered ‘learning optimists’, it seems that the young learners in this study
became more realistic after grade 6. Teacher feedback, marks in English as well as
comparison with classmates probably influenced their self-perception during the
fourth year of learning English. It is interesting that, although self-concept is
generally thought to be a good predictor of motivation, in this study the trajectories
of these two learner characteristics are different.
grade5
grade6
grade7
grade8
As Fig. 3 indicates, the participants’ overall oral performance was lowest in grade 5
and highest in grade 6. After grade 6 it slowly decreased during grade 7 and remained
at more or less the same level in grade 8. It is interesting to observe that changes in
oral performance seem to follow the self-concept developmental pattern, which
suggests that their self-concept was realistic.
The lowest performance in grade 5 can perhaps be assigned to less experience in
describing pictures the participants were presented with for the first time. It is very
likely that they had practised describing pictures in class only after they were gradu-
ally familiarized with the relevant structures and vocabulary through guided class-
room activities.
Besides the overall scores on the speaking tests in each grade, we looked into
how the young participants scored on each of the four criteria (task achievement,
vocabulary, accuracy and fluency) in the two subtasks (picture description and per-
sonal conversation) taken together and taken separately each year.
As can be seen in Figs. 4 and 5, task achievement was quite high over the 4 years.
In fact, there was only one participant who did not manage to complete the two
subtasks (continually for 3 out of the 4 years). These results confirm that the young
learners in this study were generally able to engage in communication in English
at the level set out in the national curriculum. It is interesting to note that in
grade5
grade6
grade7
grade8
grade5
grade6
grade7
grade8
TA_PD
TA_INT
Fig. 5 Scores on task achievement separately for the two subtasks. TA task achievement, PD
picture description, INT interview
grades 6 and 7 task achievement was higher in the personalized interview than in the
picture description task. It is possible that the participants were more eager to talk
about themselves in those grades. It could also be assumed that free conversation in
the interview subtask was facilitated by the preceding practice in oral production in
picture description.
Individual Learner Differences and Young Learners’ Performance on L2 Speaking… 253
3.4.2.2 Vocabulary
The overall vocabulary range (Fig. 6) was also good. Interestingly, in grades 5 and
8 it was higher in the personalized interview part than in the picture description task.
A possible explanation may be that the questions asked in these grades were more
stimulating in terms of vocabulary range (Fig. 7).
grade5
grade6
grade7
grade8
V_PD
V_INT
Fig. 7 Separate scores on vocabulary for the two tasks. V_ PD vocabulary in picture description,
V_INT vocabulary in personalized interview
254 J.M. Djigunović
3.4.2.3 Accuracy
grade5
grade6
grade7
grade8
A_PD
A_INT
Fig. 9 Scores on vocabulary separately for the two tasks. A_ PD accuracy in picture description,
A_INT accuracy in personalized interview
Individual Learner Differences and Young Learners’ Performance on L2 Speaking… 255
3.4.2.4 Fluency
Overall fluency is the dimension of the speaking skills that showed the most
consistent development over the 4 years (Fig. 10). It progressed in parallel in the
two subtasks. It may be assumed that fluency increases with speaking practice
(Fig. 11).
grade5
grade6
grade7
grade8
F_PD
F_INT
Fig. 11 Scores on fluency separately for the two subtasks. F_ PD fluency in picture description,
A_INT fluency in personalized interview
256 J.M. Djigunović
Similarly to Hoti et al. (2009), motivation in this study did not emerge as a significant
factor in explaining the development of speaking skills of the young learners.
There were no significant correlations between motivation and oral performance on
the speaking tests.
However, contrary to the Swiss study, which found that self-concept was not an
important factor in terms of explaining 3rd graders’ speaking skills, in our study the
young learners’ L2 self-concept proved to be important. Many of the correlations
we computed were found to be statistically significant. The strongest correlation
between overall oral performance and self-concept was found in grade 5 (r = .693,
p = .001), and it was also significant in grade 6 (r = .450, p = .046) and grade 7
(r = .498, p = .038). In grade 8, however, the correlation was not significant (r = .254,
p = .293). This shows that the relationships weakened over the years.
In Tables 1, 2, 3, and 4 below we show the correlations of the participants’ self-
concept in each grade with the four criteria along which we assessed the oral perfor-
mances in the respective years.
As can be seen in Table 1, all the correlations were statistically significant, with
L2 self-concept being the most strongly associated with fluency. In grade 6 (Table 2)
all correlations were significant except the one with accuracy. It is interesting to
observe that these significant coefficients were lower than those in grade 5. The fol-
lowing year the pattern was similar: only the correlation with accuracy was non-
significant (Table 3). In the final year (Table 4) no significant correlations were
established with any of the four criteria.
The correlational analyses suggest that self-concept is more important in earlier
than in later years, and learners seem to associate their self-concept more with task-
schievement and fluency than with the other two criteria. We assume other individ-
ual learner factors (e.g., willingness to communicate, anxiety) emerge in later years
as more relevant, and cancel out the linear relationship between L2 self-concept and
oral performance.
4 Conclusions
The findings of the study described above offer, first of all, further evidence that
young learners’ motivation and self-concept are unstable affective learner variables,
and that their oral production is also characterised by inter- as well as intra-variability
as they progress from year to year. The interaction these variables enter are dynamic,
too. Contrary to most previous research we found that L2 achievement as reflected
in learners’ oral production need not be related to motivation as conceptualised in
this study. It seems that it might be useful to define motivation of young L2 learners
at more specific levels than is usually done. Perhaps it would be more revealing if
task motivation was used as a measure when looking into interaction of motivation
with speaking skills of L2 learners aged between 9 and 14 years. The relevance of
L2 self-concept comes as no surprise, but it seems worth noting that its interaction
with speaking skills is not linear, but more complex and dynamic than so far
assumed. Our findings suggest that L2 self-concept is more strongly associated with
the quality of oral performance during earlier years than later, and that the accuracy
dimension of oral production is the first to show non-significant relationships
between the two variables.
258 J.M. Djigunović
The findings of this study are based on a rather small sample and do not allow us to
venture more definite conclusions. Before making generalisations about the rela-
tionship of motivation and self-concept with oral performance of young L2 learners
our findings should be verified on a larger sample. In future research it would be
useful to examine how the relationships we found are impacted by classroom prac-
tices; how teachers value, for example, fluency over accuracy in their feedback, or
how peers react to one another. Including other measures of motivation might prove
useful too. Perhaps it might be good to also include other individual learner factors
which may be relevant for the development of speaking skills, such as willingness
to communicate or language anxiety. Comparing young learners’ achievement in
the other language skills may also be revealing and could be a fruitful focus in
future research.
Classroom teachers can benefit from the insights presented in this chapter in a num-
ber of ways. The evidence the study offers of the dynamics of young learners’ moti-
vation and self-concept during the primary years can help teachers raise their
awareness of how their learners feel and, as a result, understand better their lan-
guage learning behaviour. The speaking skill is complex and hard to master and
requires a lot of time and effort on the part of both teachers and learners. Based on
the findings of the current study, teachers may try to design classroom activities that
would be more aligned with their learners’ affective needs. By doing that teaching
may become more inspiring and offer the scaffolding young learners may need at
different points during their early years of learning English.
References
Alexiou, T. (2009). Young learners’ cognitive skills and their role in foreign language vocabulary
learning. In M. Nikolov (Ed.), The age factor and early language learning (pp. 46–61). Berlin/
New York: Mouton de Gruyter.
Arnold, J. (2007). Self-concept as part of the affective domain in language learning. In F. Rubio
(Ed.), Self-esteem in foreign language learning (pp. 13–29). Newcastle, UK: Cambridge
Scholars Publishing.
Bagarić, V. (2007). English and German learners’ level of communicative competence in writing
and speaking. Metodika, 14, 239–257.
Bernaus, M., Cenoz, J., Espí, M. J., & Lindsay, D. (1994). Evaluación del aprendizaje del inglés en
niños de cuatro añ os: Influencias de las actitudes de los padres, profesores y tutores.
[Assessment of EFL learning in four-year-old children. Impact of teacher, parent and gardian
attitudes]. APAC of News, 20, 6–9.
Individual Learner Differences and Young Learners’ Performance on L2 Speaking… 259
Damon, W., & Hart, D. (1988). Self-understanding in childhood and adolescence. New York:
Cambridge University Press.
Dörnyei, Z., Csizér, K., & Németh, N. (2006). Motivation, language attitudes, and globalisation:
A Hungarian perspective. Clevedon, UK: Multilingual Matters.
Enever, J. (Ed.). (2011). ELLiE: Early language learning in Europe. London, UK: The British
Council.
Fekete, H., Major, É., & Nikolov, M. (Eds.). (1999). English language education in Hungary: A
baseline study. Budapest, Hungary: British Council Hungary.
García Mayo, M. P., & García Lecumberi, M. L. (Eds.). (2003). Age and the acquisition of English
as a foreign language. Clevedon, UK: Multilingual Matters.
Graham, S. (2004). Giving up on modern foreign languages? Students’ perceptions of learning
French. The Modern Language Journal, 88, 171–191.
Haenni Hoti, A., Heinzmann, S., & Müller, M. (2009). “I can you help?”: Assessing speaking skills
and interaction strategies of young learners. In M. Nikolov (Ed.), The age factor and early
language learning (pp. 119–140). Berlin, NY: Mouton de Gruyter.
Harris, J., & Conway, M. (2002). Modern languages is Irish primary schools. An evaluation of the
national pilot projects. Dublin, Ireland: Institiúid Teangeolaíochta Ḗireann.
Harris, J., Forde, P., Archer, P., Fhearaile, S. N., & O’Gorman, M. (2006). Irish in primary schools.
Long-term national trend sin achievement. Dublin, Ireland: Department of Education and
Science.
Harter, S. (2006). The self. In W. Damon, R. M. Lerner, & N. Eisenberg (Eds.), Handbook of child
psychology. Vol. 3. Social, emotional, and personality development (pp. 505–570). New York:
Wiley.
Heinzmann, S. (2013). Young language learners’ motivation and attitudes. London: Bloomsbury
Academic.
Hung, Y.-J., Samuelson, B. L., & Chen, S.-C. (2016). The relationships between peer- and self-
assessment and teacher assessment of young EFL learners’ oral presentations. In M. Nikolov
(Ed.), Assessing young learners of English: Global and local perspectives. New York: Springer.
Julkunen, K. (1994). Gender differences in students’ situation- and task-specific foreign language
learning motivation. In S. Tella (Ed.), Näytön paikka opetuksen kulttuurin arvioiuti [Evaluating
the culture of teaching] (pp. 171–180). Helsinki, Finland: University of Helsinki, Department
of Teacher Education.
Kiss, C. (2009). The role of aptitude in young learners’ foreign language learning. In M. Nikolov
(Ed.), The age factor and early language learning (pp. 253–276). Berlin/New York: Mouton de
Gruyter.
Kiss, C., & Nikolov, M. (2005). Preparing, piloting and validating an instrument to measure young
learners’ aptitude. Language Learning, 55(1), 99–150.
Kolb, A. (2007). How languages are learnt: Primary children’s language learning beliefs.
Innovation in Language Learning, 2(1), 227–241.
Kubanek-German, A. (2003). Frühes intensiviertes Fremdsprachenlernen. Bericht zur wissen-
schaftlichen Begleitung eines Modellprojekts des Kultursministeriums des Freistaates Sachsen
[An intensified programme for teaching modern languages to children by the Saxon Ministry
of Education. Research Report]. Braunschweig/Dresden.
Lan, R., & Oxford, R. (2003). Language learning profiles of elementary school students in Taiwan.
IRAL, 41, 339–379.
Low, L., Brown, S., Johnstone, R., & Pirrie, A. (1995). Foreign languages in primary schools:
Evaluations of the Scottish pilot projects 1993-1995. Final report. Stirling, Scotland: Scottish
CILT.
Low, L., Duffield, J., Brown, S., & Johnstone, R. (1993). Evaluating foreign languages in primary
schools. Stirling, Scotland: Scpttish CILT.
Masgoret, A., Bernaus, M., & Gardner, R. C. (2001). Examining the role of attitudes and motiva-
tion outside of the formal classroom: A test of the mini-AMTB for children. In Z. Dörnyei &
260 J.M. Djigunović
R. Schmidt (Eds.), Motivation and second language acquisition (pp. 281–295). Honolulu, HI:
The University of Hawaii Second Language Teaching and Curriculum Center.
Masgoret, A., & Gardner, R. C. (2003). Attitudes, motivation, and second language learning: A
meta-analysis of studies conducted by Gardner and associates. Language Learning, 53,
123–163.
Medved Krajnović, M. (2007). Kako hrvatski učenici govore engleski?/How well do Croatian
learners speak English? Metodika, 14, 173–190.
Medved Krajnović, M., & Letica Krevelj, S. (2009). Učenje stranih jezika u Hrvatskoj: politika,
znanost i javnost [Foreign language learning in Croatia: policy, research and the public]. In
J. Granić (Ed.), Jezična politika i jezična stvarnost [Language policy and language reality]
(pp. 598–607). Zagreb, Croatia: Croatian Association of Applied Linguistics.
Mercer, S. (2011). Towards an understanding of language learner self-concept. Dordrecht/
Heidelberg/London/New York: Springer.
Mihaljević Djigunović, J. (1993). Investigation of attitudes and motivation in early foreign lan-
guage learning. In M. Vilke & Y. Vrhovac (Eds.), Children and foreign languages I (pp. 45–71).
Zagreb, Croatia: Faculty of Philosophy.
Mihaljević Djigunović, J. (1995). Attitudes of young foreign language learners: A follow-up study.
In M. Vilke & Y. Vrhovac (Eds.), Children and foreign languages II (pp. 16–33). Zagreb,
Croatia: Faculty of Philosophy.
Mihaljević Djigunović, J. (2002). Language learning strategies and young learners. In B. Voss &
E. Stahlheber (Eds.), Fremdsprachen auf dem Prüfstand. Innovation-Qualität-Evaluation
(pp. 121–127). Berlin: Pädagogische Zeitschriftenverlang.
Mihaljević Djigunović, J. (2009). Individual differences in early language programmes. In
M. Nikolov (Ed.), The age factor and early language learning (pp. 199–225). Berlin/New
York: Mouton de Gruyter.
Mihaljević Djigunović, J. (2014). Developmental and interactional aspects of young EFL learners’
self-concept. In J. Horváth & P. Medgyes (Eds.), Studies in honour of Marianne Nikolov
(pp. 53–72). Pécs, Hungary: Lingua Franca Csoport.
Mihaljević Djigunović, J. (2015). Individual differences among young EFL learners: Age- or
proficiency-related? A look from the affective learner factors perspective. In J. Mihaljević
Djigunović & M. Medved Krajnović (Eds.), Early learning and teaching of English. New
dynamics of primary English (pp. 10–36). Bristol, UK: Multilingual Matters.
Mihaljević Djigunović, J., & Bagarić, V. (2007). A comparative study of attitudes and motivation
of Croatian learners of English and German. Studia Romanica et Anglica Zagrebiensia, 52,
259–281.
Mihaljević Djigunović, J., & Lopriore, L. (2011). The learner: do individual differences matter? In
J. Enever (Ed.), ELLiE: Early language learning in Europe (pp. 29–45). London: The British
Council.
Mihaljević Djigunović, J., & Medved Krajnović, M. (Eds.). (2015). Early learning and teaching of
English: New dynamics of primary English. Bristol, UK: Multilingual Matters.
Mihaljević Djigunović, J., & Vilke, M. (2000). Eight years after: Wishful thinking vs facts of life.
In J. Moon & M. Nikolov (Eds.), Research into teaching English to young learners (pp. 66–86).
Pécs, Hungary: University Press Pécs.
Muñoz, C. (Ed.). (2006). Age and the rate of foreign language learning. Clevedon, UK: Multilingual
Matters.
Murphy, V. A. (2014). Second language learning in the early school years: Trends and contexts.
Oxford: Oxford University Press.
National Framework Curriculum (2001). Zagreb: Ministry of Science, Education and Sport.
Nikolov, M. (1999). “Why do you learn English?” “Because the teacher is short.” A study of
Hungarian children’s foreign language learning motivation. Language Teaching Research,
3(1), 33–56.
Nikolov, M. (2002). Issues in English language education. Bern, Switzerland: Peter Lang.
Individual Learner Differences and Young Learners’ Performance on L2 Speaking… 261
Nikolov, M. (Ed.). (2009a). Early learning of modern foreign languages. Processes and outcomes.
Bristol, UK: Multilingual Matters.
Nikolov, M. (Ed.). (2009b). The age factor and early language learning. Berlin/New York: Mouton
de Gruyter.
Nikolov, M. (2016). A framework for young EFL learners’ diagnostic assessment: Can do state-
ments and task types. In M. Nikolov (Ed.), Assessing young learners of English: Global and
local perspectives. New York: Springer.
Nikolov, M., & Józsa, K. (2006). Relationships between language achievements in English and
German and classroom-related variables. In M. Nikolov & J. Horváth (Eds.), UPRT 2006:
Empirical studies in English applied linguistics (pp. 197–224). Pécs, Hungary: Lingua Franca
Csoport, PTE.
Pinter, A. (2011). Children learning second languages. Basingstoke, UK: Palgrave Macmillan.
Šamo, R. (2009). The age factor and L2 reading strategies. In M. Nikolov (Ed.), Early learning of
modern foreign languages: Processes and outcomes (pp. 121–131). Bristol, UK: Multilingual
Matters.
Seebauer, R. (1996). Fremdsprachliche Kompetenzen und Handlungskompetenzen von
Grundschullehrem. Empirische Evidenz und Neuorientierung. [Linguistic competence in the
foreign language and work competence of primary school children. Empirical evidence and
new directions]. Praxis des neusprachlichen Unterrichts, 43(1), 81–89.
Szpotowicz, M., Mihaljevic Djigunovic, J., & Enever, J. (2009). Early language learning in Europe:
A multinational, longitudinal study. In J. Enever, J. Moon, & U. Raman (Eds.), Young learner
English language policy and implementation: International perspectives (pp. 141–147).
Reading, UK: Garnet Publishing Ltd.
Tragant, E., & Muñoz, C. (2000). La motivación y su relación con la edad en un contexto escolar
de aprendizaje de una lengua extranjera. [Motivation and its relationship to age in language
learning in the school context]. In C. Muñoz (Ed.), Segundas lenguas. Adquisición en el Aula
(pp. 81–105). Barcelona, Spain: Ariel.
Tragant, E., & Victori, M. (2006). Reported strategy use and age. In C. Muñoz (Ed.), Age and the
rate of foreign language learning (pp. 208–236). Clevedon, UK: Multilingual Matters.
Vilke, M. (1976). The age factor in the acquisition of foreign languages. Rassegna Italiana di
Linguistica Applicata, 3, 179–190.
Vilke, M. (1982). Why start early? In R. Freudenstein (Ed.), Teaching foreign languages to the very
young (2nd ed., pp. 12–28). Oxford: Pergamon Press.
Vilke, M. (1995). Children and foreign languages in Croatian primary schools: Four years of a
project. In M. Vilke & Y. Vrhovac (Eds.), Children and foreign languages II (pp. 1–16). Zagreb,
Hungary: University of Zagreb.
Vilke, M. (2007). English in Croatia: A glimpse into past, present and future. Metodika, 8(14),
17–24.
Wenden, A. (1999). An introduction to metacognitive knowledge and beliefs in language learning:
Beyond the basics [Special Issue]. System, 27, 435–441.
The Role of Individual Differences
in the Development of Listening
Comprehension in the Early Stages
of Language Learning
Abstract This chapter discusses the results of a longitudinal project examining the
development of listening comprehension and the role of individual differences in this
process in an early language learning context. We aimed at exploring how language
learning aptitude, motivation, attitudes, the use of listening strategies, beliefs about
language learning and listening anxiety as decisive variables of individual differ-
ences (Dörnyei, AILA Rev 19:42–68, 2006; Lang Learn 59(1):230–248, 2009;
Mihaljević Djigunović, Role of affective factors in the development of productive
skills. In: Nikolov M, Horváth J (eds) UPRT 2006: empirical studies in English
applied linguistics. Lingua Franca Csoport, Pécs, pp 9–23, 2006; Individual differ-
ences in early language programmes. In: Nikolov M (ed) The age factor and early
language learning. Mouton de Gruyter, Berlin, pp 198–223, 2009) relate to each
other and to the learners’ performances on listening measures. The main objective of
the present study is to explore and identify the internal structure, roles and relation-
ships of individual variables in the development of early language learners’ listening
comprehension based on a multi-factor dynamic model of language learning (Gardner
& MacIntyre, Lang Teach 26:1–11, 1993) and its reinterpretation (Dörnyei, The rela-
tionship between language aptitude and language learning motivation: Individual
differences from a dynamic systems perspective. In: Macaro E (ed) Continuum com-
panion to second language acquisition. Continuum, London, pp 247–267, 2010).
A total of 150 fifth and sixth graders (11–12-year-olds; 79 boys and 71 girls) of
ten school classes in Hungary participated in the research. The findings are in line
with the predictions of the theoretical framework: the variables of individual
differences are themselves multifactor constructs, the components are in constant
interaction with each other and with their environment, thus, changing and creating
a complex developmental pattern.
The results of the two phase assessment project clearly indicate that language
aptitude defined as one of the main cognitive factors and parents’ education are
strong predictors of listening performance. The affective factors (e.g., listening anx-
iety) also contribute to the performance on the listening tests, but their rates change
over time and they are sensitive to the context of language learning. Beliefs and
emotions are interrelated and they also play a decisive role in the development of
listening skills in the early years of language learning. Consequently, what the
learners think or believe about language learning and how they feel about it influ-
ence the learners’ achievement in listening comprehension. In our model, these
beliefs are rooted in the students’ social background (parents’ education) and lan-
guage aptitude, and this relationship is exactly in contrast with the direction dis-
played in Gardner and MacIntyre’s (Lang Teach 26:1–11, 1993) model.
1 Introduction
In recent decades, the study of affective factors in second language learning has
gained significant ground in addition to the research of cognitive variables, which,
according to researchers of the field, could considerably contribute to the under-
standing and interpretation of individual differences (Dörnyei, 2006, 2009; Gardner,
1985; Gardner & MacIntyre, 1992, 1993; Mihaljević Djigunović, 2006, 2009) The
underlying question of the research has been: what might be the main cause of sig-
nificant variance in the achievement of students from similar backgrounds in similar
circumstances. Hence, individual differences became the focus of study in the field
originally covering two subfields, language aptitude (e.g., Hasselgren, 2000; Kiss &
Nikolov, 2005; Ottó, 2003; Sáfár & Kormos, 2008; Skehan, 1998) and motivation
for language learning (e.g., Dörnyei, 1998, 2001; Gardner, 1985; Heitzmann, 2009;
Martin, 2009; Nikolov, 2003a). Later on, research on learning styles (Dörnyei &
Skehan, 2003) and language learning strategies (e.g., Cohen, 1998; Griffiths, 2003;
Mónus, 2004; Nikolov, 2003b; O’Malley & Chamot, 1990; Oxford, 1990; Wenden
& Rubin, 1987) also received more attention. Yet, the question remained, what could
account for the individual differences where no significant variance is perceived in
internal and external circumstances. One possible explanation might be self-percep-
tion that fostered the investigation of variables such as attitude to language learning,
anxiety, interest and beliefs (e.g., Bacsa, 2012; Brózik-Piniel, 2009; Csíkos & Bacsa,
2011; Csizér, Dörnyei, & Németh, 2004; Dörnyei & Csizér, 2002; Hardy, 2004;
Matsuda & Gobel, 2004; Spinath & Spinath, 2005; Tóth, 2008, 2009; Yim, 2014).
It is widely accepted that foreign language proficiency does not solely result
from language teaching, but it is the outcome of several factors related to student
The Role of Individual Differences in the Development of Listening Comprehension… 265
achievement. Moreover, the majority of these factors are not static but change
dynamically over time. It is also clear that these factors are not independent from
one another but they affect learning outcome in interaction with each other (Dörnyei,
2006, 2009, 2010; Gardner & MacIntyre, 1993; Nikolov & Mihaljević Djigunović,
2006, 2011). Research on individual differences in language learning used to study
the relationships between single variables and learning outcomes in general.
However, recent studies have had a much narrower scope, targeting one skill area.
Hence, the subfields of research on motivation, anxiety and learning strategies in
reading, writing, listening and speaking skills have been developed (e.g., Goh,
2008; Kormos, 2012; Woodrow, 2006).
In our research we focus on listening comprehension in the early stages of
English as a foreign language (EFL) learning. The review of the relevant literature
suggests that listening comprehension is a cornerstone of early language learning,
since it is based on the processes of first language acquisition, relying primarily on
memory, where language input is provided largely through listening (MacWhinney,
2005; Skehan, 1998). The development of listening comprehension is vital to
achieving verbal expression and well developed communicative competence, since
high level speech production presupposes highly developed listening comprehen-
sion (Dunkel, 1986; Mordaunt & Olson, 2010). In addition, rapidly spreading digi-
tal technology redefines language teaching by providing auspicious possibilities in
listening to authentic language sources. However, research in the context of the
present study found that listening comprehension was one of the most neglected
areas of language teaching even though primary school language teaching ought to
focus on listening and speaking skills (Bors, Lugossy, & Nikolov, 2001).
The present research is novel in the field of early language learning in that it is
the first survey that investigates the development of listening comprehension skills
in interaction with the multicomponent construct of individual variables, and applies
diagnostic measures of the development of listening comprehension in school con-
text for testing for learning purposes in addition to testing of learning (Alderson,
2005; McKay, 2006; Sternberg & Grigorenko, 2002).
First, we provide a theoretical background to the survey; then, we describe the
methods and the procedure of the research that is followed by the discussion of find-
ings and their theoretical and pedagogical implications.
2 Literature Review
Early Language Learning and Young Language Learners appear more and more
frequently in the literature of foreign language learning and instruction. Amongst
other aspects, research is targeting the specifics of childhood foreign language
learning, the optimal time of start and the effective methods of teaching. Having
reviewed the relevant literature of the recent years, Nikolov and Mihaljević
266 É. Bacsa and C. Csíkos
Djigunović (2006, 2011) emphasize the importance of further research in the field
due to the increased interest in early language learning in Hungary and across the
globe. This interest is based on the widespread assumption held not only by research-
ers that starting language learning early is directly related to its success: “the
younger the better”. However, several empirical studies support “the claim that
younger learners are more efficient and successful in all respect and at all stages of
SLA is hard to sustain in its simple form” (Nikolov, 2000, p. 41; see details in Halle,
Hair, Wandner, McNamara, & Chien, 2012; Larson-Hall, 2008; Mihaljević
Djigunović, 2010; Moon & Nikolov, 2000; Nikolov, 2009; Nikolov & Curtain,
2000; Nikolov & Mihaljević Djigunović, 2006, 2011).
Researchers agree that young learners’ development significantly differs from
that of older children and adults. Krashen (1985) distinguishes language acquisition
and language learning. He claims that foreign language acquisition is mainly
instinctive, resembling the acquisition of the mother tongue, whereas language
learning is a conscious process typical after puberty.
Several models have been constructed to describe language proficiency (e.g.,
Bachman & Palmer, 1996; Canale & Swain, 1980; CEFR, 2001). In Hungary, the
2003 revision of the Hungarian National Core Curriculum (2003) was the first to
define the concept of usable language knowledge besides describing the objective of
language teaching:
The objective of foreign language learning is to establish communicative linguistic compe-
tence. The concept of communicative linguistic competence is identical with usable lan-
guage knowledge. It means the ability to use adequate language in various communicative
situations. Its assessment and evaluation is possible in the four basis language skills (listen-
ing comprehension, speaking skills, reading comprehension and writing skills). (p. 38)
Nikolov (2011) outlined the theoretical framework of the assessment and devel-
opment of English language proficiency for early language learners in grades 1–6,
for children between the ages of 6 and 12. She highlighted that the assessment of
English language proficiency has to account for language knowledge as a compre-
hensive and complex construct corresponding to the level of the learners’ knowl-
edge and their age specifics (also see Nikolov, 2016 in this book).
Several studies point out that traditional summative, exam like performance
measurements are not appropriate for this age group (Inbar-Lourie & Shohamy,
2009; McKay, 2006). Such tasks are needed that could provide feedback to the
teachers and learners about the level of their language development, their strengths
and weaknesses, thus outlining the path for successful future development. In other
words, assessment for learning, conducted by the teachers in the classroom embed-
ded into their daily work of development, is gaining ground in addition to the prac-
tice of external evaluation that are mainly targeting accountability, i.e. assessment
of learning (Lantolf & Poehner, 2011; Nikolov & Szabó, 2011a).
The most important objective of assessment for learning is to positively influence
the learning process by scaffolding young learners’ language development in the
process of using measurement and feedback. However, assessment must not be
restricted to tasks measuring language knowledge, but it has to provide feedback on
other domains, like language learning strategies and motivation as they dynamically
The Role of Individual Differences in the Development of Listening Comprehension… 267
influence the process of early language learning (Nikolov & Szabó, 2011a).
Assessment can effectively support development only if assessment and develop-
ment are in a dynamic relationship; these two have to work together a single process
for future development (Sternberg & Grigorenko, 2002).
Understanding speech in one’s mother tongue seems simple and effortless; how-
ever, in a foreign language it involves difficulties, sometimes causing frustration and
it is a source of significant stress for many learners (Chang & Read, 2007). Foreign
language listening comprehension is an invisible mental process, which is difficult
to describe precisely. The learner has to distinguish the sounds, understand vocabu-
lary and grammatical structures, interpret the stress and tone of speech, keep in
mind what has been said and interpret what has been heard the socio-cultural con-
text (Vandergrift, 2012). Listening comprehension is rather poorly represented in
research on foreign language learning, despite being a crucial skill: it is first acquired
in the mother tongue as well as in early language learning.
Research in cognitive psychology revealed that listening comprehension is more
than a mere extraction of meaning from the incoming verbal text. It was found to be
the process in which the speech is getting linked to the lexical knowledge one
already acquired (Vandergrift, 2006, 2012). Hence it is obvious that listening com-
prehension goes beyond the perception and processing of acoustic signals. This skill
has been described in various ways in recent models. The currently most widely
accepted cognitive psychological approach perceives it to be a hierarchically struc-
tured interactive process. The interactive model of Marslen-Wilson and Tyler (1980)
is based on the assumption that the recognition of words involves simultaneously
bottom-up processes, where information derives from the uttered word itself and
top-down processes, where the information is deducted from the contextual triggers
(Eysenck & Kean, 2005). Hence, speech recognition can be described as a two
directional process; on the one hand, bottom-up, when learners activate their lin-
guistic knowledge (sounds, grammatical rules etc.) to understand the message, on
the other hand, top-down, when learners activate their contextual prior knowledge
(topic, direct context, text type, cultural information etc.) to understand the message.
At the same time, listening comprehension does not only work top-down or bottom-
up, but is composed of the interaction of the two processes, since the listener uses
both prior contextual and linguistic knowledge to comprehend the message. The
rate of activation between these two processes depends on the linguistic knowledge,
familiarity with the topic and the objective of the listening task (Vandergrift, 2012).
According to Field (2004), the two processes could not be considered alternative to
each other, since their relationship is a much more complex interdependency.
In recent decades, communicative and competence-based language teaching has
emphasized listening comprehension and its implications for teaching methodol-
ogy. All methods prioritize listening comprehension, since it is much more fre-
268 É. Bacsa and C. Csíkos
quently used than the other skills. Learners need to spend a significant amount of
time listening to speech in the target language and they need to comprehend what
they listen to (Mordaunt & Olson, 2010).
Dunkel (1986, p. 100) points out that we need to “put the horse (listening com-
prehension) before the cart (speech production)” in order to achieve a high level of
communicative competence. In other words, high level of speech production pre-
supposes a high level of listening comprehension. Hence, the task of language
teachers is to present their learners with a wide variety of listening comprehension
tasks (also see Wilden & Porsch, 2016 in this volume).
Foreign language listening comprehension is heavily influenced by the level of
listening comprehension in the mother tongue. Simon’s (2001) findings revealed a
close relationship between achievements of listening comprehension in L1 and in a
foreign language. The development of listening comprehension is not self-serving,
since well-developed listening comprehension significantly enhances the develop-
ment of other skills (Richards, 2005; Rost, 2002).
differences, although widely used, still represents a rather loose concept and differ-
ent authors list different learner characteristics as individual differences.” She col-
lected the most frequently listed variables in recent publications: (1) intelligence,
(2) aptitude, (3) age, (4) gender, (5) attitude and motivation, (6) language anxiety,
(7) learning style, (8) learning strategies and (9) willingness to communicate.
Others highlight some significant domains instead of giving extensive lists of
individual differences. Dörnyei (2009) mentions four important variables: (1)
Motivation refers to the direction and extension of student behaviour, including the
choice of the learner, intensity of learning and endurance. (2) Ability of language
acquisition refers to the capacity and quality of learning. (3) Learning style includes
the way of learning. (4) Learning strategies are located halfway between learning
style and motivation, indicating the proactivity of the learner in selecting the learn-
ing path. “Thus the composite of these variables has seen to answer why, how long,
how hard, how well, how proactively, and in what way the learner engages in the
learning process” (p. 232).
Prior research predominantly investigated the learner’s characteristics in the con-
text of individual differences and they were generally included in research as back-
ground variables that modify, personalize the picture of the language acquisition
process (Dörnyei, 2009). Today several researchers perceive foreign language
learning as the result of interaction between learner characteristics and the learning
context, assuming a complex relationship between these two factors. In addition,
increased efforts are put into a deeper understanding of connections between the
learners and the context of learning (Mihaljević Djigunović, 2009). Some IDs are
more stable and less sensitive to the changes of circumstances (e.g., intelligence,
aptitude), while others (e.g., motivation, strategies, anxiety) respond quickly to
changed context (e.g., in training program). The question can be raised whether an
optimal combination of individual variables could be identified that would particu-
larly enhance the effectiveness of language learning. According to Ackerman
(2003), individual characteristics can strongly influence learning success separately
as well, however, any combination of these characteristics would definitely have a
larger impact.
Research on IDs further highlights the fact that different variables influence suc-
cess and student achievement to different degrees. Hence, the traditional approach
identifies primary and secondary variables (Gardner & MacIntyre, 1992, 1993).
According to this classification, aptitude and motivation can be considered as
primary variables in foreign language research, since these variables have the stron-
gest demonstrable impact on student achievement: aptitude is the primary cognitive
factor and motivation is the primary affective factor. Others extended this class of
primary variables to include aptitude, attitude and motivation, social background,
status of the target language and the quality of language teaching (Csapó, 2001;
Ellis, 1994; Józsa & Nikolov, 2003, 2005; Nikolov, 2007). According to Dörnyei
(2010), the perceived effect of these variables also depends on the method applied
to measure these constructs.
Furthermore, some recent investigations question the modular approach to indi-
vidual variables. Dörnyei (2009, 2010) approaches the role of individual differences,
270 É. Bacsa and C. Csíkos
especially the two primary variables (aptitude and motivation), from the perspective
of a “dynamic system”. He claims that “identifying ‘pure’ individual difference fac-
tors has only limited value [..]; instead, a potentially more fruitful approach is to
focus on certain higher-order combinations of different attributes that act as inte-
grated wholes” (Dörnyei, 2010, p. 267; Dörnyei, MacIntyre, & Henry, 2015).
It has been revealed that young learners do not resemble each other in every
aspects of their learning either, hence it is possible as well as desirable to study their
IDs (Mihaljević Djigunović, 2009; Nikolov, 2009). However, adequate methods and
instruments for assessment are scarce, since the majority of available measures
were developed for older age groups. According to Mihaljević Djigunović (2009),
the main line of future research should focus on exploring the relationships between
IDs among early language learners, which ultimately presupposes the development
of relevant measures and methods.
Findings of prior research draw a varied picture about the relationship between
IDs and student achievement (also see Mihaljević Djigunović, 2016 in this volume).
There has been a consensus that cognitive, affective and additional background fac-
tors all impact the success of language learning, however, the significance attributed
to individual factors varies across the studies (Csapó & Nikolov, 2009). Consequently,
the study of student achievements should out cover a wide range of interactions
between individual variables (Nikolov & Mihaljević Djigunović, 2011).
3 The Study
When defining the theoretical framework of our research a language learning model
had to be found that would meet the requirements of complexity, interactivity and
dynamism (flexibility, versatility) in terms of the context and components of lan-
guage learning. The Socio-educational model of second language acquisition pro-
posed by Gardner and MacIntyre (1993) is one of the most often cited models. It
perceives the learning process embedded in a comprehensive socio-cultural context,
and highlights four different aspects, related to each other: (1) antecedent factors:
e.g., age, gender, prior learning experience and beliefs; (2) ID variables: e.g., intel-
ligence, language aptitude, strategies, attitudes, motivation, anxiety; (3) language
learning contexts: formal and informal learning contexts; and (4) outcomes: linguis-
tic and non-linguistic achievements. The model describes the factors influencing
language learning as interrelated, exerting direct and indirect impact on the process
of language acquisition which effects achievement. The authors note that the model
is extendable, since several additional cognitive and affective factors might be pres-
ent in language learning influencing learning outcome. This model was the first to
place emphasis on the interaction of variables, perceiving language learning as a
The Role of Individual Differences in the Development of Listening Comprehension… 271
We aimed to explore and identify the internal structure, roles and relationships of
individual variables in the development of early language learners’ listening com-
prehension based on a multi-factor dynamic model of language learning (Gardner &
MacIntyre, 1993) and its reinterpretation (Dörnyei, 2010). A further objective was
272 É. Bacsa and C. Csíkos
4 Method
4.1 Participants
Participants were elementary school students in grade 5 and grade 6. A total of 150
students of EFL were involved in ten school classes of a mid-sized town in Hungary.
In order to get results that can be generalized, the sample was representative with
regards to gender, ability levels of the student groups and socio-economic status.
The research design applied the methodologies and measures used in the field and
the characteristics of the sample with a preference of mixed methods (Moschener,
Anschuetz, Wernke, & Wagener, 2008; Nikolov, 2009; Nunan & Bailey, 2009). (1)
Diagnostic listening comprehension tasks (Nikolov & Szabó, 2011a, 2011b) were
provided for teachers to measure and monitor their students’ development of
The Role of Individual Differences in the Development of Listening Comprehension… 273
listening comprehension during the assessment period. (2) Pretests and posttests
(Nikolov & Józsa, 2006) were applied to measure listening comprehension achieve-
ments. Relevant adapted and newly developed questionnaires were used to capture
IDs in the following areas: (3) language aptitude (Kiss & Nikolov, 2005), (4) strate-
gies of listening comprehension (Vandergrift, 2005, 2006), (5) beliefs about lan-
guage learning (Bacsa, 2012), (6) attitude and motivation related to language
learning (Nikolov, 2003a, 2003b) and (7) listening anxiety (Kim, 2005). All the
questionnaires applied a 5 point Likert-scale to assess statements. We used (8) inter-
views and (9) think-aloud protocols to gain in-depth insight into the functioning of
listening comprehension.
The features of the questionnaires and the tests are presented in Tables 1 and 2.
A longitudinal design was used covering the period of a semester, involving two
measurement sessions (except for language aptitude which was measured once
between the two assessment periods). All students were given a booklet including
diagnostic tasks of listening comprehension and questionnaires of individual differ-
ences. The instruments were administered with the help of classroom teachers,
whereas the aptitude and the placement tests were completed under the supervision
of the first author. The collected data was analyzed with the help of SPSS 22 and
AMOS 20 software.
The development of listening comprehension over the period of 6 months was
analyzed in previous papers (Bacsa, 2014; Bacsa & Csíkos, 2013). The specifics of
individual differences were identified by detailed investigations of the individual
variables, which provided a picture of how these variables influenced student
achievement and how they changed between the two testing sessions.
The present study provides a synthesis of the main findings of the longitudinal
research on the role of IDs in the development of young language learners’ listening
5 Results
The diagnostic tasks used in this research for the first time were welcomed by most
teachers and students and they also received positive reviews as measurement
instrument. The results of the series of assessments monitoring the development of
listening comprehension show that the majority of the sample continuously devel-
oped throughout the assessment period.
As far as the reliability of the measures is concerned, the results show that the
pretest reliability figures (Cronbach-α = 0.51) were lower than expected and lower
than what was found in prior research (Cronbach-α = 0.72 in Nikolov & Józsa,
2006), which might partially be explained by the lower item and sample size
(Dörnyei, 2007), as well as the lower number of distractors. Therefore, we decided
to add validated tests and the modified tests provided sufficient differentiation in the
posttest (Cronbach-α =0.79).
A significant increase was found in overall listening comprehension over the
semester long assessment period (t = −4.268; p < 0.001). Subsamples divided by age
and gender did not show significant variance; although, boys achieved somewhat
lower scores than girls, as did the grade 5 subsample compared to grade 6, where
insignificant difference reoccurred on the post-test as well. Significant inter-group
variance was found on the pretest [F(9.127) = 4.90); p < 0.001] and the posttest
[F(9.128) = 13.20); p < 0.001] along with a considerable within-group variance.
IDs were assessed by applying quantitative and qualitative research methods. The
questionnaires (Attitude and motivation, Beliefs about language learning) were either
originally constructed for the age of the sample or adapted (MALQ and FLLS) to their
age specifics, by reproducing the original factor structure to measure the construct
reliably. This statement is supported by several findings of the qualitative investiga-
tions. The reliability indices of subscales deriving from the internal factor structures
of the questionnaires were found to be lower in some cases than expected in social
scientific research, hence, only those factors were included in the components of
The Role of Individual Differences in the Development of Listening Comprehension… 275
The data presented in Table 4 show that the revealed (obtained from MALQ)
strategy use (metacognitive awareness) of focusing on keywords and understanding
scored high in both assessments without a significant difference. Observed strategy
use (think-aloud protocol) confirmed the primary usage of focusing on keywords in
the listening process. It can also be seen that, based on the average scores, students
do not think that they would have difficulties in learning EFL, since they scored high
in both assessments on the related belief scales without significant differences. The
students’ motivational self-concept (learner level) did not reflect a significant change
by the end of the school year. However, attitude and motivation in classroom learn-
ing decreased significantly by the second assessment, which might be explained by
end-of-year exhaustion or incidental negative experiences. The three components of
anxiety about listening comprehension scored below 3.00 on average in both assess-
ments, which indicate that the participants’ anxiety levels are rather low. In addition,
the interviews revealed that their anxiety relates mostly to the test situation and pres-
sure for achievement rather than to the listening comprehension activity itself. The
second assessment showed a significant increase in anxiety in case of two variables;
however, the increased level does still not reach “general” anxiety level.
In addition to the seven components of individual differences this study includes
the results of the language aptitude test and parents’ education which proved to be
the main predictors of foreign language achievements of young learners (Csapó &
Nikolov, 2009; Kiss & Nikolov, 2005). We wanted to find out to what degree the
nine ID variables explain the variance found in the two assessments. Previous
research suggested that aptitude would prove to be the best predictor of foreign
language learning achievements (Ellis, 1994; Kiss & Nikolov, 2005; Robinson,
2001; Skehan, 1991; Sparks, Patton, & Ganschow, 2011) and that cognitive vari-
ables would explain more of the variance in case of younger learners than in older
age groups (Csapó & Nikolov, 2009). The results presented in Table 5 support all
these prior research findings in both assessments.
Table 5 shows that the components included in the analysis explain 30 % of the
variance in the listening comprehension scores in the initial and 46 % in the second
assessment. It can be seen that aptitude accounts for a significant degree of variance
in both cases: in the first assessment it gave 80 % of the total explained variance as
the only significant factor, whereas in the second assessment it covered 65 % of the
total variance explained. In both assessments cognitive factors in the traditional
sense (Gardner & MacIntyre, 1992, 1993) explained a higher percentage of variance
than affective factors. In the first assessment aptitude was found to be the only sig-
nificant predictor of listening comprehension results, whereas in the second assess-
ment parents’ education also proved to be a significant indicator of student
achievement. These findings support the findings of previous research that sug-
gested the primary status of cognitive factors in predicting student achievement in
younger age groups (Csapó & Nikolov, 2009; Kiss & Nikolov, 2005). They further
indicate that variables of individual differences (e.g., attitude, motivation, strategies,
beliefs) cannot be viewed as stable constructs, but they change with time reacting to
changes in context (Mihaljević Djigunović, 2009; Robinson, 2001).
In this section the relationship between the components yielded from the two assess-
ment session are explored by attempting to filter the situational effect of the context,
thus allowing us to understand young learners’ development in their English listen-
ing comprehension skills. In the following analyses we studied how the experiences,
opinions and beliefs found in the first assessment predicted the development of lis-
tening comprehension with the help of the factors outlined above. First, a cluster
analysis was conducted to see how the certain variables relate to listening compre-
hension, i.e. what clusters they form around achievement. The dendrogram of the
cluster analysis conducted by the furthest neighbour method is presented in Fig. 1.
The dendrogram in Fig. 1 reflects four separate clusters. Variables of aptitude and
achievement are grouped in a well separated cluster. The other variables link to this
by forming smaller individual clusters. Anxiety variables are grouped together,
motivation variables are connected to strategies, linking to the cluster formed by
beliefs and parents’ education. Following the steps based on the proximity of con-
nections it can be seen that aptitude and parents’ education are followed by the anxi-
ety components which in turn are followed by beliefs about language learning.
Motivation is the last connection to them, supporting the findings that it is the most
weakly interacting component with achievement.
Following the system of relationships between the variables, predicting values of
individual differences are considered. Regression analysis was conducted to reveal
these factors. The question was to what degree the independent variables of indi-
vidual differences (in the first assessment) included in the analysis predicted listen-
ing comprehension achievement as dependent variables in the posttest. Table 6
shows the results of the regression analysis.
278 É. Bacsa and C. Csíkos
Achievement 2
Aptitude 3
Anxiety (3) 5
Anxiety (1) 10
Anxiety (2) 4
Parents’ education 1
Beliefs L2 difficulty 6
Motivation (1) 8
Motivation (2) 9
Strategy.attention 7
Table 6 shows the β values of the regression analysis and the explained variance
of variables (R2). Five out of the nine variables included in the analysis had
significant β values. The nine variables in total explained nearly 50 % of the vari-
ance found in the posttest. Half of this is explained by aptitude alone. Parents’ edu-
cation representing the learners’ socio-economic status was also found to have
significant variance, in line with the majority of other studies conducted in this age
group in Hungary (e.g., Bukta & Nikolov, 2002; Csapó & Nikolov, 2009; Józsa &
Nikolov, 2005). The three additional variables that represent significant explanatory
power relate to the thinking and feeling of the students about the difficulties of lan-
guage learning and listening comprehension.
Finally the paths supposedly leading to listening comprehension achievement
were drawn with the help of path analysis (Fig. 2). The objective of the path analy-
sis is to reveal the degree and strength of suggested causal relationships (Münnich
& Hidegkuti, 2012). The literature (Everitt & Dunn, 1991) suggests drawing the
hypothesized path (just-identified/saturated model) prior to conducting the analysis
so that the outcome of the analysis may confirm our assumptions. The present anal-
The Role of Individual Differences in the Development of Listening Comprehension… 279
Parents’ education
e4
,24 ,18
,00
e3 Anxiety (3)
,24 e1
,20
,47
Beliefs L2 difficulty ,46
-,20 e2 Achievement
,15 ,25 -,20
Anxiety (2)
,35
,57
Aptitude
ysis was based on Gardner and MacIntyre’s (1993) model modified by Dörnyei
(2010), where individual variables have both direct and indirect effect on test
achievement, and as Dörnyei (2010, p. 267) suggests “the cognitive and motivation
(and also emotional) subsystems of human mind cooperate in a constructive man-
ner”. Those components of individual variables were included in the path-analysis
that resulted in significant β values in the regression analysis. Hence, the final model
comprised five variables (exogenous variables) representing the IDs (aptitude, par-
ents’ education, anxiety about the difficulty of understanding and unknown words
and beliefs on the difficulty of language learning). The interactions and causal rela-
tionships of these exogenous variables could explain the development of student
achievement (endogenous variable). The path diagram is shown in Fig. 2 below.
The χ2-test confirmed our null hypothesis, i.e. the saturated and default models
were found to be identical. The parameters were evaluated with the method of
maximum likelihood, which attempts to maximize the value of the likelihood of the
criterion variables.
In this section we describe the indexes of model fit. The saturated model had 27
parameters, the tested model had 21, degrees of freedom (df) was 6 (NPAR). Values
of χ2 = 7.95, p = 0.242 indicate that the model fit between the saturated model and the
data was not (significantly) worse than between the data and the default model. It
can be seen that path coefficients (β values) that are found next to the arrows in the
diagram (Fig. 2) are significant in each case. NFI = 0.949 and CFI = 0.986 values
reflect optimal fit, since both indicators exceed the 0.9 (good fit) level. Finally,
RMSEA = 0.034 value also suggests good model fit: lower than 0.05.
The five variables in the model account for 47 % of the total variance of achieve-
ment. The multivariate analysis of individual differences and test achievements
280 É. Bacsa and C. Csíkos
revealed that components of individual variables exert both direct and indirect effect
on student achievement. The biggest direct effect on achievement (β = 0.57) was
found in case of language aptitude, which also directly effected the students’ beliefs
on language learning (β = 0.35). Beliefs, on the other hand, indirectly influence
achievement through feelings related to the difficulty of listening comprehension
(anxiety or the lack of it). Parents’ education has both a direct (β = 0.18) and an
indirect effect on achievement through the related beliefs and feelings. Anxiety con-
cerning unknown words was also found to exert a significant impact on achievement
directly (β = 0.24) and indirectly through anxiety about comprehension (β = 0.46).
It can be stated that students’ beliefs act as a mediator of the effects of their apti-
tude and their parents’ education, making their way to achievement through emo-
tional states. In other words, student beliefs, what they think about language
learning, and their emotions, how they feel in the learning process, interact in
determining children’s development. The effect of beliefs on anxiety about listening
comprehension (β = −0.20) and the effect of anxiety about listening comprehension
on achievement (β = −0.20) are both negative, as expected based on the correlations.
Those who are less anxious expect English to be easier and have a more positive
self-concept as language learners. Consequently, those who are more positively
inclined toward language learning achieve better, which is certainly also true the
other way around.
School marks were used as additional measures of student achievement that eval-
uate their work throughout the school year. In the next section we discuss the rela-
tionships between IDs and the students’ English marks in order to compare the
overlaps of the two achievement variable with the variables of individual
differences.
Motivation (1) 5
Motivation (2) 6
English Mark 1
Strategy. attention 4
Parents’ education 2
Beliefs L2 difficulty 10
Anxiety (1) 7
Anxiety (2) 9
Anxiety (3) 8
Aptitude 3
Fig. 3 Dendrogram of clusters around English marks. Explanation: Motivation (1): classroom
level; Motivation (2): student level (self-concept); Anxiety (1): following the text; Anxiety (2): dif-
ficulty of understanding; Anxiety (3): unknown words
According to the data in Table 7, the total variance explained comes close to
40 %. It is apparent that only three variables have significant β values predicting
English marks. Aptitude has the highest share in the variance explained, accounting
for almost 50 % of the total. The second most significant predictor is the level of
student motivation and attitude, i.e. self-concept of the learner, describing how suc-
cessful or less successful the students perceive themselves. The third significant
variable is strategy of directed attention to the keywords; metacognitive awareness
about listening comprehension is one of the most important and most frequently
applied strategies of listening comprehension, as was confirmed in student inter-
views. It is also shown that parents’ education does not directly predict English
marks.
Finally, a path-analysis was conducted involving the significant variables result-
ing from regression analysis in order to reveal causal relationships between the
variables in relation to the English marks and the paths leading from IDs to student
achievement evaluated by school marks. Figure 4 shows the path-diagram of
assumed causal relationships.
Three of the ID variables had significant β values, meaning that the direct and
indirect effects of these three variables explain the variance in English marks. First,
the parameters of model fit are reviewed. The saturated model had 14 parameters,
the tested model 13, df was 1. Values of χ2 = 0.088, p = 0.767 suggest that the test
was not significant, showing that the tested model is a good fit. Path-coefficients (β
values) are significant in all relationships. NFI = 0.999 and CFI = 1.000 values also
reflect adequate level, exceeding 0.9 (good fit) level. Finally, RMSEA < 0.001 is
well below 0.05, indicating good model fit.
There are different paths, however, leading to school marks, the other variable of
student achievement. Also, the predictive force of ID variables was considerably
lower (35 %) in this case. The most reliable predictor of English language school
e3
.05
Strategy.attention
.18
e2 e1
.11 .35
.22 .28 English mark
Motivation (2)
.33 .37
Aptitude
Fig. 4 Variables of IDs and causal relationships of English language marks. Explanation:
Motivation (2): student level (self-concept)
The Role of Individual Differences in the Development of Listening Comprehension… 283
marks was language aptitude both directly (β =0.37) and indirectly influencing
school marks. In this latter case the effect of language aptitude was mediated by the
attention to keywords strategy (β =0.22) and learners’ self-concept as a motive
(β = 0.33). These variables had a significant direct effect on marks. As is shown in
Fig. 4, mainly cognitive factors in the traditional sense account for the English
marks. From the affective factors, only motivational self-concept impacts the marks.
The resulting paths in both analyses seem to support Dörnyei’s assumption
(2010): we interpret IDs as dynamic interactions of hierarchically organized
components and cognitive and affective factors (within and between themselves) as
overlapping rather than dichotomic constructs. It became clear that the two achieve-
ment variables, listening comprehension test achievement and English school
marks, were explained by different variables of individual differences to a different
extent.
6 Conclusion
The findings of the research are in line with the predictions of the theoretical frame-
work (Dörnyei, 2006, 2009, 2010; Gardner & MacIntyre, 1992, 1993): the ID vari-
ables are multifactor constructs in themselves, the constituents are in constant
interaction with each other and their environment, changing and consequently creat-
ing a complex pattern of development. Both the components of individual differ-
ences and systemic models of the connections in student achievement support
Dörnyei’s assumption (2010) that the traditional separation of cognitive and affec-
tive variables (Gardner & MacIntyre, 1992, 1993) can be problematic.
The findings confirmed that language aptitude and parents’ education are signifi-
cant predictors of young learners’ listening comprehension achievements (Csapó &
Nikolov, 2009; Józsa & Nikolov, 2005; Kiss & Nikolov, 2005). The other primary
factor in the traditional sense (Gardner & MacIntyre, 1993), the motivational com-
ponent, was excluded from the predictive model of listening comprehension
achievement. This seems to contradict previous findings, however, motivation was
found to significantly predict school achievement represented by the English marks
in this research, in line with others’ findings (Dörnyei, 2009, 2010; Mihaljević
Djigunović, 2006, 2009, 2014).
It was also revealed that listening comprehension achievement is predicted by
the interaction between IDs and the learning context which is constantly changing
throughout the learning process (Dörnyei, 2006, 2009, 2010; Mihaljević Djigunović,
2009).
Additionally, the findings shed light on the fact that learners’ beliefs, thoughts
and feelings related to the difficulty of language learning and students’ aptitude
have a significant effect both on one another and on achievement (Aragao, 2011;
Bacsa, 2012). This means that what young learners think or believe about language
learning and how they feel about their learning experience impact their achievement
in listening comprehension. According to our model, these beliefs are rooted in the
284 É. Bacsa and C. Csíkos
young learners’ social background (indicated by their parents’ education) and lan-
guage aptitude, and the direction of these relationships is the opposite of that dis-
played in Gardner and MacIntyre’s (1993) model.
References
Ackerman, P. L. (2003). Aptitude complexes and trait complexes. Educational Psychologist, 38,
85–93.
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning
and assessment. London: Continuum.
Aragao, R. (2011). Beliefs and emotions on foreign language learning. System, 39(3), 302–313.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford, UK: Oxford
University Press.
Bacsa, É. (2012). Az idegennyelv–tanulással kapcsolatos meggyőződések vizsgálata általános és
középiskolás tanulók körében [Beliefs about language learning]. Magyar Pedagógia, 112(3),
167–193.
Bacsa, É. (2014). The contribution of individual differences to the development of young learners’
listening performances. Doctoral thesis, University of Szeged, Szeged, Hungary. Retrieved
from http://www.doktori.hu/index.php?menuid=193&vid=13859
Bacsa, É., & Csíkos, Cs. (2013). The contribution of individual differences to the development of
young learners’ listening performances. 15th European conference for the research on learning
and instruction: “Responsible Teaching and Sustainable Learning”. Munich, Germany: TUM.
The Role of Individual Differences in the Development of Listening Comprehension… 285
Bors, L., Lugossy, R., & Nikolov, M. (2001). Az angol nyelv oktatása pécsi általános iskolákban
[Teaching English in elementary schools in Pécs]. Iskolakultúra, 1(4), 73–88.
Brózik-Piniel, K. (2009). The development of foreign language classroom anxiety in secondary
school. Doctoral thesis, ELTE, Budapest, Hungary. Retrieved from http://www.doktori.hu/
index.php?menuid=193&vid=4014
Brown, H. D. (1994). Teaching by Principles: An interactive approach to language pedagogy.
New York: Prentice Hall Regents.
Bukta, K., & Nikolov, M. (2002). Nyelvtanítás és hasznos nyelvtudás: az angol mint idegen nyelv
[Teaching a language for usable language competence: the case of English as foreign lan-
guage.] In B. Csapó (Ed.), Az iskolai műveltség [School literacy] (pp. 169–192). Budapest,
Hungary: Osiris.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to language
learning and testing. Applied Linguistics, 1, 1–47.
Chang, A. C., & Read, J. (2007). Support for foreign language listeners: Its effectiveness and limi-
tations. RELC Journal, 38(3), 375–395.
Cohen, A. D. (1998). Strategies for learning and using a second language. New York: Longman.
Council of Europe. (2001). Common European framework of reference for languages: Learning,
teaching, assessment. Cambridge, UK: Cambridge University Press.
Csapó, B. (2001). A nyelvtanulást és a nyelvtudást befolyásoló tényezők [Factors affecting lan-
guage competence and language learning]. Iskolakultúra, 1(8), 25–35.
Csapó, B. (2002a). Az iskolai tudás felszíni rétegei: Mit tükröznek az osztályzatok? [The superfi-
cial stratum of school knowledge: What are school marks for?]. In B. Csapó (Ed.), Az iskolai
tudás [School knowledge] (pp. 37–63). Budapest, Hungary: Osiris.
Csapó, B. (2002b). Iskolai osztályzatok, attitűdök, műveltség [School marks, attitudes, and liter-
acy]. In B. Csapó (Ed.), Az iskolai műveltség [School literacy] (pp. 37–65). Budapest, Hungary:
Osiris.
Csapó, B., & Nikolov, M. (2009). The cognitive contribution to the development of proficiency in
a foreign language. Learning and Individual Differences, 19, 209–218.
Csíkos Cs., & Bacsa, É. (2011). Measuring beliefs about language learning. Paper presented at the
meeting of the 14th biennial conference for research on learning and instruction: “Education
for a Global Networked Society”. Exeter, UK: University of Exeter.
Csizér, K., & Dörnyei, Z. (2002). Az általános iskolások idegen nyelv-tanulási attitűdje és
motivációja [Language learning attitudes and motivation among primary school children].
Magyar Pedagógia, 102(3), 333–353.
Csizér, K., Dörnyei, Z., & Németh, N. (2004). A nyelvi attitűdök és az idegen nyelvi motiváció
változásai 1993 és 2004 között Magyarországon [Changes in language attitudes and motivation
to learn foreign languages in Hungary, 1993–2004]. Magyar Pedagógia, 104(4), 393–408.
Dörnyei, Z. (1998). Motivation in second and foreign language learning. Language Teaching, 31,
117–135.
Dörnyei, Z. (2001). New themes and approaches in second language motivation research. Annual
Review of Applied Linguistics, 21, 43–59.
Dörnyei, Z. (2005). The psychology of the language learner: Individual differences in second lan-
guage acquisition. Mahwah, NJ: Laurence Erlbaum Associates.
Dörnyei, Z. (2006). Individual differences in second language acquisition. AILA Review, 19,
42–68.
Dörnyei, Z. (2007). Research methods in applied linguistics. Oxford, NY: Oxford University
Press.
Dörnyei, Z. (2009). Individual differences: Interplay of learner characteristics and learning envi-
ronment. Language Learning, 59(1), 230–248.
Dörnyei, Z. (2010). The relationship between language aptitude and language learning motivation:
Individual differences from a dynamic systems perspective. In E. Macaro (Ed.), Continuum
companion to second language acquisition (pp. 247–267). London: Continuum.
Dörnyei, Z., & Csizér, K. (2002). Some dynamics of language attitudes and motivation: Results of
a longitudinal nationwide survey. Applied Linguistics, 23(4), 421–462.
286 É. Bacsa and C. Csíkos
Dörnyei, Z., MacIntyre, P. D., & Henry, A. (2015). Introduction: Applying complex dynamic sys-
tems principles to empirical research on L2 motivation. In Z. Dörnyei, P. D. MacIntyre, &
A. Henry (Eds.), Motivational dynamics in language learning (pp. 1–7). Bristol, UK:
Multilingual Matters.
Dörnyei, Z., & Skehan, P. (2003). Individual differences in second language learning. In C. J.
Doughty & M. H. Long (Eds.), The handbook of second language acquisition (pp. 589–630).
Oxford, NY: Blackwell.
Dunkel, P. (1986). Developing listening fluency in L2: Theoretical principles and pedagogical
considerations. Modern Language Journal, 70, 99–106.
Ellis, R. (1994). The study of second language acquisition. Oxford, NY: Oxford University Press.
Everitt, B. S., & Dunn, G. (1991). Applied multivariate data analysis. London: Edward Arnold.
Eysenck, M. W., & Kean, M. T. (2005). Cognitive psychology: A student’s handbook. Hove, UK:
Psychology Press.
Field, J. (2004). An insight into listeners’ problems: Too much bottom-up or too much top down?
System, 32(3), 363–377.
Gardner, R. C. (1985). Social psychology and second language learning: The role of attitudes and
motivation. London: Edward Arnold.
Gardner, R. C., & MacIntyre, P. D. (1992). A student’s contributions to second language learning.
Part I: Cognitive variables. Language Teaching, 25, 211–220.
Gardner, R. C., & MacIntyre, P. D. (1993). A student’s contributions to second language learning.
Part II: Affective e variables. Language Teaching, 26, 1–11.
Goh, C. C. M. (2008). Metacognitive instruction for second language listening development:
Theory, practice and research implications. RELC Journal, 39(2), 188–213.
Griffiths, C. (2003). Patterns of language learning strategy use. System, 31(3), 367–383.
Halle, T., Hair, E., Wandner, L., McNamara, M., & Chien, N. (2012). Predictors and outcomes of
early versus later English language proficiency among language learners. Early Choldhood
Research Quarterly, 27, 1–20.
Hardy, J. (2004). Általános iskolás tanulók attitűdje és motivációja az angol mint idegen nyelv
tanulására [The attitude and motivation of primary school children for learning English as a
foreign language]. Magyar Pedagógia, 104(2), 225–242.
Hasselgren, A. (2000). The assessment of the English ability of young learners in Norwegian
schools: an innovative approach. Language Testing, 17(2), 261–277.
Heitzmann, J. (2009). The influence of the classroom climate on students’ motivation. In
R. Lugossy, J. Horváth, & M. Nikolov (Eds.), UPRT 2008: Empirical studies in English
applied linguistics (pp. 207–224). Pécs, Hungary: Lingua Franca Csoport.
Moon, J., & Nikolov, M. (2000). Research into teaching English to young Learners. Pécs, Hungary:
University Press.
Inbar-Lourie, O., & Shohamy, E. (2009). Assessing young language learners: What is the con-
struct? In M. Nikolov (Ed.), The age factor and early language learning (pp. 83–96). Berlin,
Germany: Mouton de Gruyter.
Johnson, K. (2001). An introduction to foreign language learning and teaching. London: Longman.
Józsa, K., & Nikolov, M. (2003). Az idegen nyelvi készségek fejlettsége angol és német nyelvből a
6. és 10. évfolyamon a 2002/2003-as tanévben [Levels of performances in English and German
in year 6 and 10]. Budapest, Hungary: OKÉV.
Józsa, K., & Nikolov, M. (2005). Az angol és német nyelvi készségek fejlettségét befolyásoló
tényezők [Factors influencing achievement in English and German as foreign languages].
Magyar Pedagógia, 105(3), 307–337.
Kim, Y. (2001). Foreign language anxiety as an individual difference variable in performance: An
interactionist’s perspective (ERIC Document Reproduction Service No. ED 457 695)
Kim, J. (2005). The reliability and validity of a foreign language listening anxiety scale. Korean
Journal of English Language and Linguistics, 5(2), 213–235.
Kiss, C. (2009). The role of aptitude in young learner’s foreign language learning. In M. Nikolov
(Ed.), The age factor and early language learning (pp. 253–276). Berlin, Germany: Mouton de
Gruyter.
The Role of Individual Differences in the Development of Listening Comprehension… 287
Kiss, C., & Nikolov, M. (2005). Developing, piloting, and validating an instrument to measure
young learners’ aptitude. Language Learning, 55(1), 99–150.
Kormos, J. (2012). The role of individual differences in L2 writing. Journal of Second Language
Writing, 21, 390–403.
Krashen, S. (1985). The input hypothesis: Issues and implications. New York: Longman.
Lantolf, J. P., & Poehner, M. E. (2011). Dynamic assessment in the classroom: Vygotskian praxis
for second language development. Language Teaching Research, 15(1), 11–33.
Larsen-Freeman, D., & Long, M. H. (1991). An introduction to second language acquisition
research. London: Longman.
Larson-Hall, J. (2008). Weighing the benefits of studying a foreign language at a younger starting
age in a minimal input situation. Second Language Research, 24(1), 35–63.
MacWhinney, B. (2005). A unified model of language acquisition. In J. F. Kroll & A. M. B. De
Groot (Eds.), Handbook of bilingualism: Psycholinguistic approaches (pp. 49–67). Oxford,
UK: Oxford University Press.
Marslen-Wilson, W., & Tyler, L. K. (1980). The temporal structure of spoken language under-
standing. Cognition, 8, 1–71.
Martin, A. J. (2009). Motivation and engagement across the academic life span. Educational and
Psychological Measurement, 69, 794–824.
Matsuda, S., & Gobel, P. (2004). Anxiety and predictors of performance in the foreign language
classroom. System, 32(1), 21–36.
Mattheoudakis, M., & Alexiou, T. (2009). Early language instruction in Greece: Socioeconomic
factors and their effect on young learner’s language development. In M. Nikolov (Ed.), The age
factor and early language learning (pp. 227–252). Berlin, Germany: Mouton de Gruyter.
McKay, P. (2006). Assessing young learners. Cambridge: Cambridge University Press.
Mihaljević Djigunović, J. (2006). Role of affective factors in the development of productive skills.
In M. Nikolov & J. Horváth (Eds.), UPRT 2006: Empirical studies in English applied linguis-
tics (pp. 9–23). Pécs, Hungary: Lingua Franca Csoport.
Mihaljević Djigunović, J. (2009). Individual differences in early language programmes. In
M. Nikolov (Ed.), The age factor and early language learning (pp. 198–223). Berlin, Germany:
Mouton de Gruyter.
Mihaljević Djigunović, J. (2010). Starting age and L1 and L2 interaction. International Journal of
Bilingualism, 14(3), 303–314.
Mihaljević Djigunović, J. (2014). Developmental and interactional aspects of young EFL learners’
self-concept. In J. Horváth & P. Medgyes (Eds.), Studies in honour of Marianne Nikolov
(pp. 37–50). Pécs, Hungary: Lingua Franca Csoport.
Mihaljević Djigunović, J. (2016). Individual differences and young learners’ performance on L2
speaking tests. In M. Nikolov (Ed.), Assessing young learners of English: Global and local
perspectives. New York: Springer.
Mónus, K. (2004). Learner strategies of Hungarian secondary grammar school students. Budapest,
Hungary: Akadémiai Kiadó.
Mordaunt, O. G., & Olson, D. W. (2010). Listen, listen, listen and listen: building a comprehension
corpus and making it comprehensible. Educational Studies, 36(3), 249–258.
Moschener, B., Anschuetz, A., Wernke, S., & Wagener, U. (2008). Measurement of epistemologi-
cal beliefs and learning strategies of elementary school children. In M. S. Khine (Ed.), Knowing,
knowledge and beliefs (pp. 113–137). New York: Springer.
Münnich, Á., & Hidegkuti, I. (2012). Strukturális egyenletek modelljei: Oksági viszonyok és com-
plex elméletek vizsgálata pszichológiai kutatásokban [Models of structural equations: The
investigation of causal relations and complex tehories in psychological research]. Alkalmazott
Pszichológia, 12(1), 77–102.
National Core Curriculum (2003). Budapest, Hungary: Oktatási Minisztérium.
Nikolov, M. (2000). Issues in research into early FL programmes. In J. Moon & M. Nikolov (Eds.),
Research into teaching English to young learners (pp. 21–48). Pécs, Hungary: University Press
Pécs.
288 É. Bacsa and C. Csíkos
Sparks, R. L., Patton, J., & Ganschow, L. (2011). Subcomponents of second-language aptitude and
second-language proficiency. The Modern Language Journal, 95, 253–273.
Spinath, B., & Spinath, F. M. (2005). Longitudinal analysis of the link between learning motiva-
tion and competence beliefs among elementary school children. Learning and Instruction, 15,
87–102.
Sternberg, R. J., & Grigorenko, E. L. (2002). Dynamic testing: The nature and measurement of
learning potential. Cambridge, UK: Cambridge University Press.
Szabó, G., & Nikolov, M. (2013). An analysis of young learners’ feedback on diagnostic listening
comprehension tests. In M. J. Djigunovic, & M. Medved Krajnovic (Eds.), UZRT 2012:
Empirical studies in English applied linguistics. Zagreb, Croatia: FF press. http://books.
google.hu/books?id=VnR3DZsHG6UC&printsec=frontcover&source=gbs_ge_
summary_r&cad=0#v=onepage&q&f=false
Tóth, Z. (2008). A foreign language anxiety scale for Hungarian learners of English. WoPaLP, 2,
55–78.
Tóth, Z. (2009). Foreign language anxiety: For beginners only? In R. Lugossy, J. Horváth, &
M. Nikolov (Eds.), UPRT, 2008. Empirical studies in applied linguistics (pp. 225–246). Pécs,
Hungary: Lingua Franca Csoport.
Vandergrift, L. (2005). Relationship among motivation orientations, metacognitive awareness and
proficiency in L2 listening. Applied Linguistics, 26(1), 70–89.
Vandergrift, L. (2006). Second language listening: Listening ability or language proficiency? The
Modern Language Journal, 90, 6–18.
Vandergrift, L. (2012). Listening: Theory and practice in modern foreign language competence.
https://www.llas.ac.uk//resources/gpg/67
Wenden, A., & Rubin, J. (1987). Learner strategies in language learning. Hemel Hemstead, UK:
Prentice Hall.
Wilden, E., & Porsch, R. (2016). Learning EFL from Year 1 or Year 3? A Comparative study on
children’s EFL listening and reading comprehension at the end of primary education. In
M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives.
New York: Springer.
Woodrow, L. (2006). Anxiety and speaking English as a second language. RELC Journal, 37(3),
308–328.
Yim, S. Y. (2014). An anxiety model for EFL young learners: A path analysis. System, 42,
344–354.
Éva Bacsa is deputy headteacher at Kiss Bálint Reformed School in Szentes, Hungary. She holds
PhD in educational science. Her research interest includes individual learner differences in early
language learning.
Abstract Despite the recent focus on self-assessment (SA) as a tool for enhancing
learning, some researchers and practitioners have expressed concerns about its sub-
jectivity and lack of accuracy. Such concerns, however, originated from the tradi-
tional, measurement-based notion of assessment (assessment of learning) rather
than the learning-based notion of assessment (assessment for learning). In addition,
existing research on SA in second/foreign language education has been concen-
trated on adult learners, leaving us with limited information on SA among young
learners. In this chapter, I address both sets of issues: the confusion between the two
orientations for assessment and age-related concerns regarding SA. First, I clarify
the two orientations of assessment—assessment of learning and assessment for
learning—and demonstrate that most of the concerns about subjectivity and accu-
racy apply primarily to the former orientation. Second, I detail the current findings
on SA among young learners and identify the most urgent topics for future research
in this area. Finally, to help teachers and researchers examine and develop SA items
that are most appropriate for their purposes, I propose five dimensions that charac-
terize existing major SAs for young learners: (a) domain setting; (b) scale setting;
(c) goal setting; (d) focus of assessment; and (e) method of assessment.
1 Introduction
In the following sections, I discuss the two approaches for assessment (assessment
of and for learning) in turn. They originated from different theoretical and epistemo-
logical traditions, and the distinctions need to be clarified. That being said, however,
these approaches are not necessarily mutually exclusive but can instead be located
on a continuum according to the degree of emphasis on learning. In practice, the
same SA tool can be used for more evaluation-oriented means (assessment of learn-
ing) or for more learning-oriented means (assessment for learning).
Among adult learners, a great deal of research has been conducted with respect to
the validity and reliability of SA as well as its use. With a few exceptions (e.g.,
Matsuno, 2009; Patri, 2002; Pierce, Swain, & Hart, 1993), there is ample evidence
294 Y.G. Butler
indicating that SA results, at least among adults, are generally correlated with
external criteria such as teachers’ ratings, final grades in class, and objective tests
(Bachman & Palmer, 1989; Blanche, 1990; Brantmeier & Vanderplank, 2008;
Brantmeier, Vanderplank, & Strube, 2012; Dickinson, 1987; Hargan, 1994; Leach,
2012; Oscarson, 1997; Stefani, 1994). As a result, SA has been used for relatively
high-stakes purposes, such as program placement (Hargan, 1994; LeBlanc &
Painchaud, 1985) and choosing the appropriate level of tests (Malabonga, Kenyon,
& Carpenter, 2005). However, the degrees of correlations with external criteria var-
ied across studies. Factors that influenced accuracy of SA included the skill domain
being assessed, the ways in which items were constructed, and learners’ individual
characteristics.
With respect to the skill domains being assessed, if we assume that productive
skills (i.e., speaking and writing) require higher degrees of meta-awareness, such as
pre-planning and self-monitoring, than receptive skills (i.e., listening and reading),
we may expect that learners are better at self-assessing their productive skills than
their receptive skills. Interestingly, in a meta-analysis of SA, Ross (1998) found the
opposite to be the case: adult learners could self-assess their receptive skills (read-
ing in particular) in L2/FL more accurately than their productive skills. It is not
clear, however, if receptive skills are inherently easier to self-assess. In speculating
about which factors might explain the surprising result, Ross suggested such things
as learners’ experiences (e.g., adult L2/FL learners at college are more likely to have
engaged in reading activities more heavily than the other activities), the reference
points that they used (e.g., the adult learners might have judged themselves in rela-
tion to the performances of other students in class), and the scales that were used in
external measurements (e.g., writing assessments often use nominal or categorical
scales that may not be readily applicable to correlational analyses). In general, peo-
ple tend to more accurately self-assess lower order cognitive skills than they do
higher order cognitive skills (Zoller, Tsaparlis, Fastow, & Lubezky, 1997).
Second, how the items are worded and constructed influences learners’ responses
to SA. College students’ responses differed based on whether the items were nega-
tively worded (e.g., “I have trouble with…” and “I cannot do….”) or positively
worded (e.g., “I can do …”), although the degree of inconsistency varied greatly
depending on the items (Heilenman, 1990). Not too surprisingly, learners’ SA accu-
racy improved when the items were provided in their L1 rather than the target lan-
guage (Oscarson, 1997).
Finally, various factors associated with individual learners are also found to
influence their SA accuracy. One of the factors studied most extensively is learners’
proficiency levels and experiences with the target language (Blanche & Merino,
1989; Davidson & Henning, 1985; Heilenman, 1990; Orsmond, Merry, & Reiling,
1997; Stefani, 1994; Sullivan & Hall, 1997). These studies generally indicate that
students with lower proficiency and/or less experience with the target language tend
to overestimate their performance, whereas student with higher proficiency tend to
be more accurate or underrate their performance. Other influential factors over the
accuracy of SA responses include the ways in which learners understand and
respond to scales and items (Heilenman, 1990), the ways in which learners retrieve
Self-Assessment of and for Young Learners’ Foreign Language Learning 295
relevant memory to self-assess the given skills and performance (Ross, 1998;
Shameem, 1998), learners’ learning styles (Cassidy, 2007); their anxiety levels
(MacIntyre, Noels, & Clément, 1997), and their levels of self-esteem and motiva-
tion (AlFallay, 2004; Dörnyei, 2001). Another important factor, which is of particu-
lar relevance to the current discussion, is the age of the learners.
Fig. 1 Two major issues for self-assessment of learning for young learners (Note: SA of learning
primarily concerns how best to elicit children’s true abilities. In the process, there are two major
issues: measurement issues and interpretation issues)
sometime around the ages of 7–9, with another drop around the ages of 11–13. The
accuracy of children’s perceived competence (examined by calculating correlations
with external measures such as their teachers’ ratings) increases after the age of 8,
when they start using social-comparative information (information indicating that
one’s performance or ability is superior or inferior to others). Although social-
comparative information begins to influence children’s self-appraisal of perfor-
mance by the time they are around 7 years old, it does not influence self-appraisal
of their abilities until much later (around 11–12 years old) (R. Butler, 2005).
Researchers’ interpretations of children’s self-appraisal behaviors have been
changing in recent years. Traditionally, children’s unrealistically high self-appraisal
was mainly attributed to their lack of cognitive maturity for making accurate judg-
ments about their performance and abilities. Piaget’s (1926/1930) well-known stage
theory of cognitive development certainly made a tremendous impact on research-
ers’ interpretation. According to this theory, children at the preoperational stage
(ages 2–7) struggle with logical thinking; instead, their thoughts are dominated by
concrete reasoning and intuition. This theory also posits that children are egocentric
and have a hard time taking other people’s perspectives. The theory goes on to say
that children at the concrete operational stage (ages 7–11) gradually begin to oper-
ate logical thinking and to differentiate their own thoughts from those of others.
However, they still have difficulty handling multiple perspectives systematically
Self-Assessment of and for Young Learners’ Foreign Language Learning 297
and forming abstract and causal thinking. In line with this theory, Stipek (1984)
offered an explanation for why children are not only unrealistic but also excessively
positive in their perceived competence by proposing their “wishful thinking” inter-
pretation; namely, children cannot distinguish reality from their wishes, and they
tend to make decisions based on the latter.
Similarly, interpretations based on achievement goal theory assumed that chil-
dren’s accuracy in evaluating their own abilities would be partially based on the
development of their conception of ability. The theory proposed that there are two
distinctive goal perspectives when perceiving one’s ability: a task-goal perspective
and an ego-goal perspective. The task-goal perspective is based on one’s subjective
assessment of task achievement and mastery. The ego-goal perspective relies on
one’s demonstration of superior performance compared to others (Dweck, 1986;
Nicholls, 1989). According to this theory, children up to 7 years old cannot distin-
guish between ability and effort when it comes to determining performance on a
task (referred to as undifferentiated conception of ability); for them, effort is ability.
Thus, for young children, a person with high ability refers to a person who makes
effort or obtains a high score in a given task, but they do not understand how to
conceptualize a person who makes effort but achieves low in the given task, or vice
versa. Researchers believed that young children are relatively invulnerable to failure
and that they tend to respond to the failure by increasing effort. They also believed
that children do not fully develop the concept of normative difficulty; instead, they
tend to judge task difficulty in an egocentric fashion (e.g., this task is difficult
because it was hard for me) (Nicholls & Miller, 1983). As they grow, children grad-
ually understand that there is a cause-and-effect relationship between effort and
outcome (outcome is a result of effort). But according to this theory, it is only after
children reach the ages of 11–12 that they fully understand that one’s performance
(outcome) is also constrained by one’s ability (referred to as mature conception of
ability). After children reach this level, they can construct perceived competence in
relation to other people’s performance (Nicholls, 1978; also see Mihaljević
Djigunović, 2016 in this volume.)
If children’s self-evaluative abilities are mainly constrained by their underdevel-
oped internal mental structures, it makes sense to hold off on implementing SA of
learning until they reach a cognitively mature state. However, in contrast to the
results of experimental studies, anyone who spends sufficient time with children
may notice that they appear to have more sophisticated self-evaluative knowledge
and skills in naturalistic contexts than the cognitive-developmental theories predict.
Indeed, neo- or post-Piagetian researchers indicate that children’s self-evaluative
abilities vary greatly depending on contexts, domains, and tasks at a given age level
(see Flavell, 1999, for a review of such studies). Children’s self-appraisal becomes
more accurate if they can engage in familiar tasks and tasks that require lower levels
of cognitive demand to perform. Experiences with different domains (e.g., math,
music, language) help them develop distinct, domain-specific, and stable self-
evaluative competence. Children who have intensive social contacts with other
children can use normative information (information based on social comparison)
more appropriately and are less ego-centric than those who don’t, as we can see,
298 Y.G. Butler
for example, in the work of Vygotsky (1934/1978). Children may also be more
vulnerable to failure than was previously thought (Carless & Lam, 2014). R. Butler
(2005) argued that:
regarding competence assessment, one implication is that self-appraisal may indeed become
more accurate, differentiated and responsive to relevant information with age, in large part,
however, because of age-related changes in children’s typical experiences and contexts,
rather than their internal cognitive structures. (p. 208)
In addition, potential problems have been raised with respect to the methodolo-
gies of many earlier studies of cognitive development. Children’s failure in tasks
may not be a sign of their lack of abilities but may be due to their misunderstanding
the researchers’ questions or intentions. For example, children as young as 4–5 who
were once thought to be incapable of rating their performance based on temporal
comparison (i.e., comparing their current performance with that in the past) turned
out to be able to so as long as the information provided to them for evaluation was
meaningful and familiar to them (R. Butler, 2005). These more recent findings on
and interpretations of children’s assessment competence remind us that we need to
pay careful attention to contexts, assessment task choice, and the ways in which SA
is constructed and delivered.
they examined. Interestingly, the response bias observed in the youngest group was
not found to be a function of the number of choices in the Likert scales; their
responses did not differ between the three-level and five-level Likert scales. We do
not know, however, if dichotomous items would have made any difference on the
children’s responses. Judging from the previous studies conducted in domains other
than L2/FL, children do not seem to handle negatively worded items well (e.g., “I
am not good at doing math”) compared with positively worded items (e.g., “I am
good at doing math”) (e.g., Marsh, 1986). Considering the possible domain speci-
ficity of children’s responses, however, we need to examine whether a similar
response bias is observed when children self-evaluate their L2/FL.
SA items are often highly decontextualized—see, for example, the item “I can
ask questions in class,” which I quoted from O’Malley and Pierce (1996) in Sect.
2.1.2. However, depending on the age of children, the degree of contextualization
can be a potential threat to the validity of SA of learning. In a study I did with a col-
league (Butler & Lee, 2006), we compared children’s (9–10 year olds and 11–12
year olds) responses to two formats of SA, an off-task SA and an on-task SA, con-
cerning their oral performance in an FL. The off-task SA was a type of SA that
asked learners to self-evaluate their general performance in a decontextualized fash-
ion, as exemplified by the example item I quote above. The on-task SA was a con-
textualized SA in which learners were asked to self-evaluate their performance in a
specific task immediately after the task was completed. We compared the children’s
responses to these two types of SA items with an objective proficiency measurement
and an assessment of the children based on their teachers’ classroom observations.
We found that the children could self-assess their performance more accurately in
the contextualized format than the decontextualized format. Not too surprisingly,
the younger group (9–10 years) had a harder time with the decontextualized format
than the older group. We also found that the children’s responses to the contextual-
ized format, compared with the decontextualized format, were less influenced by
their attitudes and personality factors.
Considering the potential age- and experience-related challenges children may
face when making temporal and/or normative comparisons while self-evaluating
their abilities (see Sect. 2.1.2.1), it seems safe to assume that how researchers define
reference points for SA (e.g., setting learners’ own previous performance or other
people’s performance as a reference point) will influence children’s responses to the
SA items. Unfortunately, we know little about how children rely on different refer-
ence points when they assess their L2/FL abilities. In fact, our knowledge of the
self-assessing process is quite limited, even when considering adult learners.
Moritz’s (1995) exploratory study based on a think-aloud protocol and retrospective
interviews revealed that college students of French as FL used a variety of reference
points (both temporal and normative information) when self-assessing their French
abilities.
We can also assume that the extent to which young learners of L2/FL understand
the purpose of SA influences the accuracy of their responses. In an intervention
study of SA that I conducted with Lee (Butler & Lee, 2010), one of the challenges
that the participating primary school teachers reported was how to provide their
300 Y.G. Butler
students with initial guidance in order for them to treat SA seriously. It was particularly
challenging to implement SA in a competitive, exam-driven environment. A teacher
who taught in a competitive environment told us that she believed that SA had to be
tied to other assessments or grades in order to ensure the accuracy of her students’
responses. However, a teacher who taught in a much less competitive environment
did not see such measures as necessary. We know from the research on the develop-
ment of self-appraisal among children that their motivation for responding to SA
accurately seems to increase with age but not in a linear fashion. Moreover, their
motivation for accurate SA is also influenced by the amount of domain-specific
knowledge they have acquired as well as by the context in which SA is conducted.
For example, children’s positive bias is motivated if the context and culture value
positive self-appraisal. Accuracy of response is also constrained (more likely nega-
tively biased) if the child realizes that there is a social cost for aggrandizing self-
appraisal (R. Butler, 2005). In any event, we need more studies on how best to
situate SA so that children of different ages can understand the purpose of SA and
are motivated to respond to SA accurately in their specific learning environments.
In addition to the issues related to item construction and task choice, various
individual factors likely influence the accuracy of children’s SA responses. Such
factors include cognitive maturity, personality, motivation, proficiency in the target
language, and experience with SA. The role of individual differences in children’s
responses in SA is an unexplored area of inquiry, and so I can offer no practical,
research-based suggestions for ensuring the accuracy of SA of learning among
children.
2003, p. 2). In assessment for learning, validity refers to the extent to which both the
content of the assessment and the assessments’ methods and tasks are matched with
instruction. Thus, assessment for learning is deeply embedded in the particular con-
text of the assessment. In assessment for learning, learners are no longer merely
objects being measured; they are active participants who make inferences and take
actions, together with the teachers, for formative purposes. According to Brookhart
(2003), the validity concerns of assessment for learning include the degrees and the
ways in which learners can self-reflect and benefit from having assessment enhance
their learning. Similarly, teachers’ knowledge, beliefs, and practices are all part of
the validity concerns as well. In assessment for learning, reliability refers to the
degree of stability of “information about the gap between students’ work and ‘ideal’
work (as defined in students’ and teachers’ learning objectives)” (p. 9).
By engaging learners in self-reflection, SA is considered to be effective for
developing their self-regulation, which can be defined as “the self-directive process
by which learners transform their mental abilities into academic skills” (Zimmerman,
2002, p. 65), and should enhance their motivation and learning. However, empirical
studies examining the effect of SA on learners’ motivation and learning have been
limited, particularly in relation to L2/FL.
This may make sense, particularly considering that peer-assessment was found to be
psychometrically more internally consistent and to have higher correlations with
external measures than SA (Matsuno, 2009; Patri, 2002) but that SA helped to
increase learning more than peer-assessment (Sadler & Good, 2006).
Feedback is an essential part of SA for it to be effective for learning (Sadler,
1989), but having feedback itself does not guarantee positive outcomes. Hattie and
Timperley’s (2007) meta-analysis on feedback showed that there were substantial
differences in effect sizes across studies, indicating that the quality and timing of the
feedback greatly influenced learners’ performance. Nicol and Macfarlane-Dick
(2006) listed seven principles for good feedback practice:
(1) helps clarify what good performance is (goals, criteria, expected standards); (2) facili-
tates the development of self-assessment (reflection) in learning; (3) delivers high quality
information to students about their learning; (4) encourages teacher and peer dialogue
around learning; (5) encourages positive motivational beliefs and self-esteem; (6) provides
opportunities to close the gap between current and desired performance; and (7) provides
information to teachers that can be used to help shape teaching. (p. 205)
Nicol and Macfarlane-Dick also stated that once learners have developed their
self-evaluative skills to the point where they are able to engage in self-feedback,
they can improve themselves even if the quality of external feedback is “impover-
ished” (p. 204).
In order to benefit from SA, learners themselves need to meet certain conditions.
Sadler (1989) identified three such conditions: “(a) possess a concept for the stan-
dard (or goal, or reference level) being aimed for; (b) compare the actual (or cur-
rent) level of performance with the standards; and (c) engage in appropriate action
which leads to some closure of the gap” (p. 121). From a constructivist view of
learning, such as that of Vygotsky (1934/1978), such learners’ abilities are culti-
vated through having dialogues with and receiving assistance from their teachers or
capable peers. Orsmond et al. (1997) also showed that learners’ thorough under-
standing of the subject matter makes the SA results more useful.
In the field of L2/FL, empirical studies on the effect of SA on learning are lim-
ited. Among adult learners of French in Australia, de Saint Léger (2009) found that
SA had a positive influence on their perceived fluency, vocabulary, confidence, and
sense of responsibility for their own learning. Similarly, de Saint Léger and Storch
(2009) found that SA had a positive influence on adult learners’ willingness to com-
municate in an FL (e.g., perceived participation of class activities).
It is important to note, however, that many studies that examined the effect of SA
on learning conceptualized learning as one-dimensional, sequential, and largely
knowledge-based. Sadler (1989) reminded us that not all learning can be conceptu-
alized as such, and stated that “the outcomes are not easily characterized as correct
or incorrect, and it is more appropriate to think in terms of the quality of a students’
responses or the degree of expertise than in terms of facts memorized, concepts
acquired or content mastered” (p. 123). Indeed, we need more research examining
the effect of SA on learning when learning is conceptualized as multidimensional,
nonlinear, and nonstatic processes.
Self-Assessment of and for Young Learners’ Foreign Language Learning 303
Fig. 2 The process of self-assessment for learning for young learners (Note: Components in SA
described in dotted squares are key driving forces to facilitate learners’ self-reflection processes)
304 Y.G. Butler
difficult to assess listening compared with other domains. (Note, however, that
Dann’s study was conducted in a language arts context as opposed to an L2/FL
context.) Unfortunately, we know very little about the kinds of tasks and perfor-
mances that would be suitable for children—based on their cognitive maturity and
experience—to engage in during SA.
As with adults, children need to understand the reasons for doing SA and have a
clear understanding of the criteria. Children need to understand the goals and be
invested in them in order to advance themselves (Torrance & Pryor, 1998). This
appears to be the first hurdle to deal with, as indicated by Black et al.’s (2003) com-
ment about young learners: “the first and most difficult task is to get students to
think of their work in terms of a set of goals” (p. 49). In order to overcome this chal-
lenge, teachers may need to talk with children individually, perhaps on an ongoing
basis. Although we have limited information on how children interpret the criteria
for SA and make judgments using the criteria, it has been reported that children do
not necessarily make judgments rationally—at least from the point of view of adults
(Dann, 2002).
As suggested for adult learners, peer-assessment can help children understand
the criteria better, and so it may be effective to implement peer-assessment before
SA or along with SA (for a related discussion, see Hung, Samuelson, & Chen, 2016
in this volume). Dann’s (2002) case study indicated that when children engaged in
SA, they tended to draw on personal elements such as the effort that they had put
into it in order to complete the work. Evaluating their peers’ work (peer-assessment)
seemed to help them objectify the criteria. In conducting peer-assessment with
young learners, however, careful oversight is necessary. Research indicates that
children who evaluate their peers’ work and realize that their own progress and
learning are limited compared to others are likely to lower their self-efficacy
(Bandura, 1997), which in turn could negatively influence their further learning. In
my studies in China (Butler, 2014, 2015), by the 8th grade (ages 13–14), some chil-
dren started lowering their self-efficacy in FL learning at relatively early stages, and
their level of self-efficacy turned out to be a major predictor of their FL
performance.
It is also important to note that in assessment for learning, we do not necessarily
adhere to the criteria in a strict sense. Instead, Dann (2002) suggested that “the pri-
ority given to pupil learning required a large degree of sensitivity in balancing the
promotion of specific criteria with personal and individual factors” (p. 96–97). In
other words, instead of considering the criteria to be absolute and fixed and expect-
ing everybody to follow it uniformly, in assessment for learning the criteria should
be flexible so that it can be adjusted according to the specific learning goals and
needs of individual learners. Depending on the children’s cognitive maturity and
experience, they might even be able to actively participate in the process of develop-
ing criteria, in collaboration with their teachers.
SA can help teachers understand the gaps in a child’s current state of understand-
ing and his or her potential level of understanding (or an optimal goal for learning).
It is important to note that children’s judgment about their current understanding
can be very different from the teachers’ judgment, and thus dialogues are needed to
Self-Assessment of and for Young Learners’ Foreign Language Learning 305
close the perceptional gaps between students and teachers. In order to become com-
petent self-regulated learners, children have to develop metacognition to figure out
what they know and what they don’t know. As Harker (1998) stated, “only when
students know the state of their own knowledge can they effectively self-direct
learning to the unknown” (p. 13). And importantly, young learners are capable of
monitoring their knowledge when they are provided with sufficient training. To
facilitate the development of children’s monitoring skills, SA should include items
that capture the process of learning in addition to those that capture the learning
outcome itself (Butler & Lee, 2010). After the gaps are understood by both the
learner and the teacher, the teacher can help the learner set a goal within the zone of
proximal development (ZPD, to use a Vygotskian term) and offer concrete assis-
tance to help the learner reach the goal.
SA for learning is a recursive process. By repeating the process, SA ultimately
aims to help children become self-regulated and autonomous learners. SA should be
designed in such a way that learners can understand the goals of the tasks, self-
reflect on their learning in relation to the goals, monitor their process of learning,
and figure out what it takes to achieve the goals.
The teachers’ role in the process of SA for learning is substantial. Y. G. Butler
and Lee (2010) found that SA improved Korean primary school students’ (ages
11–12) learning in English as well as their confidence but, importantly, the effects
differed depending on individual teachers’ attitudes toward assessment and their
teaching context. When the teaching context was exam-driven and competitive, and
if the teacher could not fully subscribe to the spirit of the assessment for learning,
the effect of SA on the students’ learning was limited. In other words, in order for
SA to be effective, fostering a learning culture and the teachers’ understanding of
the assessment for learning appear to be indispensable.
Various types of SA items have been developed for young learners in recent years.
Some items are clearly designed for SA of learning, others are clearly designed for
SA for learning, and still others can be used for either purpose, depending on the
students’ and teachers’ needs and objectives. In this section, I examine major types
of existing SAs, classifying them based on the following five dimensions and where
they fall on the continua associated with those dimensions. These dimensions
should be helpful for teachers and students as well as researchers when using exist-
ing SA items or developing their own items.
Domain setting
More general (open ended) -------------------------------------- More specific
Scale setting
Fewer levels -------------------------------------- More levels
More general (open ended) -------------------------------------- More specific
306 Y.G. Butler
Goal setting
More externally regulated ------------------------------ More self-regulated
More static ------------------------------ More dynamic
Focus of assessment
More product oriented ------------------------------ More process oriented
Method of assessment
More individual based ------------------------------ More collaborative based
SAs can vary in terms of domain specifications. In Example 1, the domain is defined
very generally (i.e., speaking), and the assessment focuses only on fluency.
Oskarsson (1978) called this type of SA “global assessment” (p. 13). It allows us to
get only a rough picture of learners’ abilities.
Example 1 (Oskarsson, 1978, p. 37)1
SPEAKING
Put a cross in the box which corresponds to your estimated level.
However, in this format, the domain can be easily defined with increasing speci-
ficity, as in examples 2 and 3: “I can ask questions in class” (Example 2) is more
specific than “speaking” (Example 1), and “I can ask where someone lives”
(Example 3) is even more specific (ignore the scales of these examples for the time
being).
1
This item did not include descriptions for each level of the scale, and was not meant for children.
All other examples in this chapter were designed for young learners.
Self-Assessment of and for Young Learners’ Foreign Language Learning 307
2
Some of the items in the European Portfolio are more or less specific. The items listed here are
relatively specific.
308 Y.G. Butler
The scale setting can be examined in two ways: (a) the number of levels and (b) the
degree of specificity of each level. As I mentioned above, from the assessment of
learning point of view, we don’t know how many levels are optimal for young learn-
ers (i.e., yielding the most accurate responses). We can easily assume that the
answer to this question depends, in part, on the degree of specificity of each scale
level. Providing simple descriptions of each level, as in examples 2 and 4, may not
necessarily contribute to higher accuracy. The scales still may be interpreted differ-
ently across children and, within a child, across items. It is important to make sure
that children understand what each level means. While dichotomous SA items (can-
do items), such as in Example 3, are increasingly popular at the primary school
level, we still know very little about how children process and respond to dichoto-
mous SA items, as discussed above.
Some SAs have detailed descriptions for each scale; such scales are often referred
to as “descriptive rating scales” (Oskarsson, 1978, p. 16). In Example 5 (European
Language Portfolio), each scale description corresponds to the Common European
Self-Assessment of and for Young Learners’ Foreign Language Learning 309
Framework for Reference for Languages (CEFR, 2001). In general, the more
detailed the descriptors, the easier it is for learners to respond. However, children
may need assistance in comprehending the descriptors. Providing some concrete
examples, as in Example 5, enhances children’s comprehension of the descriptors.
Example 5 (CILT, European Language Portfolio, 2006, p. 32)
SPEAKING AND TALKING TO SOMEONE
A1 level: I can use simple phrases and sentences to describe where I live and people
I know.
Grade 1: I can say/repeat a few words and short simple phrases
e.g., what the weather is like; greeting someone; naming classroom objects…
Grade 2: I can answer simple questions and give basic information.
e.g., about the weather; where I live; whether I have brothers and sisters, or
a pet…
Grade 3: I can ask and answer simple questions and talk about my interests
e.g., taking part in an interview about my area and interests; a survey about
pets or favorite foods; talking to a friend about what we like to do and
wear…
From the assessment for learning point of view, scales can be useful if they are
designed in such a way that learners can see the process or progress of their learn-
ing, or can identify the gaps in the current and potential levels of their learning.
Scales can be set flexibly, according to individual learners’ needs and learning
trajectories.
Goal setting refers to the process of identifying the goals of the SA, and it can be
further divided into two sub-dimensions: (a) the extent to which learners have
autonomy to identify the goals; and (b) the degree of flexibility with which goals
can be defined. Granting autonomy and flexibility in goal setting may be a threat to
the validity in the traditional assessment of learning approach, but it can be a critical
feature for SA for learning, in order to help children to become autonomous and
self-reflective learners. In Example 6, learners can choose from a list of predefined
goals which goals they should aim for next. In Example 7, while some sample goals
are listed, children can either come up with their own goals or choose their goals
from the examples provided. The goals can be changed upon negotiation with the
teacher.
310 Y.G. Butler
Can you usually do these things?3 Use these yes myaim example
symbols: column 1 ✓ = I think I can ✓✓ = I know I
can column 2 ✓ = I aim to do this soon column 3 write
the date when you’ve done an example of this
1 I can understand what is said to me about everyday things if the
other person speaks slowly and clearly and is helpful.
2 I can show that I am following what people say, and can get
help if I can’t understand.
3 I can say some things to be friendly when I meet or leave
someone.
4 I can do simple ask-and-answer tasks with a partner in class,
using expressions we have learnt.
5 I can ask or tell the teacher about things we are doing in class.
:
:
3
The underline was original. There are 12 items for each category.
4
The original was in Japanese, and the select part was translated by the author.
Self-Assessment of and for Young Learners’ Foreign Language Learning 311
Example goals
To try my best to engage in conversations, songs, and games in class
To speak (English) confidently
To talk to a foreign teacher
To effectively use gestures when speaking
To make eye contact to the partner when speaking
To used newly-learned words in conversation…….
SAs can be designed as an individual assessment activity or can be meant for more
collaborative work. Although it is possible to use SAs for collaborative work even
though they were originally meant to be carried out individually, SA items can also
be designed in such a way that they invite other people’s participation. This is par-
ticularly important for an assessment for learning orientation, in which it is critical
to have a greater degree of collaboration (assistance from other capable individuals)
in the SA process, especially during initial stages of children’s SA practices. As
children develop higher self-regulated skills, SAs can be conducted more
independently.
Although recent policies often strongly encourage primary school language teach-
ers to implement SA as a tool for helping children to gain greater ownership of their
learning, many people continue to express concerns about the accuracy and subjec-
tivity of SA. Such concerns, however, primarily originate from the traditional,
measurement-based notion of assessment rather than learning-based notion of
assessment. In addition, the age factor has not been sufficiently discussed in the
previous research on SA. In this chapter, therefore, I clarified two notions of
312 Y.G. Butler
References
AlFallay, I. (2004). The role of some selected psychological and personality traits of the rater in
the accuracy of self- and peer-assessment. System, 32, 407–425.
Andrade, H., Wang, X., Du, Y., & Akawi, R. (2009). Rubric-referenced assessment and self-
efficacy for writing. The Journal of Educational Research, 102(6), 287–302.
Bachman, L. F., & Palmer, A. S. (1989). The construct validation of self-ratings of communicative
language ability. Language Testing, 6(1), 14–29.
Bandura, A. (1997). Self-efficacy: The exercise of control. New York: Freeman.
Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2003). Assessment for learning:
Putting it into practice. New York: Open University Press.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education:
Principles, Policy and Practice, 5(1), 7–74.
Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational
Assessment, Evaluation and Accountability, 21(1), 5–31.
Blanche, P. (1990). Using standardized achievement and oral proficiency tests for self-assessment
purposes: The DLIFC study. Language Testing, 6(1), 14–29.
Blanche, P., & Merino, B. J. (1989). Self-assessment of foreign-language skills: Implications for
teachers and researchers. Language Learning, 39(3), 313–340.
Brantmeier, C., & Vanderplank, R. (2008). Descriptive and criterion-referenced self-assessment
with L2 readers. System, 36(3), 456–477.
Brantmeier, C., Vanderplank, R., & Strube, M. (2012). What about me? Individual self-assessment
by skill and level of language instruction. System, 40, 144–160.
Brookhart, S. M. (2003). Developing measurement theory for classroom assessment purposes and
uses. Educational Measurement: Issues and Practice, 22(4), 5–12.
Butler, R. (2005). Competence assessment, competence, and motivation between early and middle
childhood. In A. J. Elliot & C. S. Dweck (Eds.), Handbook of competence and motivation
(pp. 202–221). New York: The Guilford Press.
Self-Assessment of and for Young Learners’ Foreign Language Learning 313
Butler, Y. G. (2014). Parental factors and early English education as a foreign language: A case
study in Mainland China. Research Papers in Education, 29(4), 410–437.
Butler, Y. G. (2015). Parental factors in the children’s motivation for learning English. Research
Papers in Education, 30(2), 164–191.
Butler, Y. G., & Lee, J. (2006). On-task versus off-task self-assessment among Korean elementary
school students studying English. The Modern Language Journal, 90(4), 506–518.
Butler, Y. G., & Lee, J. (2010). The effect of self-assessment among young learners. Language
Testing, 17(1), 1–27.
Carless, D., & Lam, R. (2014). The examined life: Perspectives of lower primary school students
in Hong Kong. Education 3–13, 42(3), 313–329.
Cassidy, S. (2007). Assessing ‘inexperienced’ students’ ability to self-assess: Exploring links with
learning style and academic personal control. Assessment & Evaluation in Higher Education,
32(3), 313–330.
Chambers, C. T., & Johnston, C. (2002). Developmental differences in children’s use of rating
scales. Journal of Pediatric Psychology, 27(1), 27–36.
CILT (The National Center for Languages). (2006). European language portfolio – Junior version:
Revised edition. Retrieved from http://www.primarylanguages.org.uk/resources/assessment_
and_recording/european_languages_portfolio.aspx
Council of Europe. (2001). Common European framework of reference for languages: Learning,
teaching, assessment. Cambridge, UK: Cambridge University Press.
Dann, R. (2002). Promoting assessment as learning: Improving the learning process. New York:
Routledge.
Davidson, F., & Henning, G. (1985). A self-rating scale of English difficulty. Language Testing, 2,
164–169.
de Saint Léger, D. (2009). Self-assessment of speaking skills and participation in a foreign lan-
guage class. Foreign Language Annals, 42, 158–178.
de Saint Léger, D., & Storch, N. (2009). Learners’ perceptions and attitudes: Implications for will-
ingness to communicate in an L2 classroom. System, 37, 269–285. doi:10.1016/j.
system.2009.01.001.
Dickinson, L. (1987). Self-instruction in language learning. Cambridge, UK: Cambridge
University Press.
Dörnyei, Z. (2001). Motivational strategies in the language classroom. Cambridge, UK: Cambridge
University Press.
Dweck, C. S. (1986). Motivational processes affecting learning. American Psychologist, 41,
1040–1048.
Flavell, J. H. (1999). Cognitive development: Children’s knowledge about the mind. Annual
Review of Psychology, 50, 21–45.
Hargan, N. (1994). Learner autonomy by remote control. System, 22, 455–462.
Harker, D. J. (1998). Definitions and empirical foundations. In D. J. Harker, J. Dunlosky, & A. C.
Graesser (Eds.), Metacognition in educational theory and practice (pp. 1–23). Mahwah, NJ:
Lawrence Erlbaum Associates.
Harris, M. (1997). Self-assessment of language learning in formal settings. System, 51(1), 12–20.
Hasselgren, A. (2003). Bergen ‘Can-Do’ project. Retrieved from http://blog.educastur.es/portfolio/
files/2008/04/bergen-can-do-project.pdf
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1),
81–112.
Heilenman, L. K. (1990). Self-assessment of second language ability: The role of response effect.
Language Testing, 7(2), 174–201.
Hung, Y.-J., Samuelson, B. L., & Chen, S.-C. (2016). The relationships between peer- and self-
assessment and teacher assessment of young EFL learners’ oral presentations. In M. Nikolov
(Ed.), Assessing young learners of English: Global and local perspectives. New York: Springer.
Kato, Y. (n.d.). Shogakko eigo katsudo-no hyoka-no arikata: jido-ga iyokuteki-ni torikumu
jikohyoka-no kufu-to hyoka-no kanten-no meikakuka [How to evaluate English activities at
314 Y.G. Butler
primary school: Self-assessment that children are motivated to engage in and the clarification
of evaluation criteria]. Retrieved from http://www.kyoiku-kensyu.metro.tokyo.jp/09seika/
reports/files/kenkyusei/h18/k-31.pdf
Leach, L. (2012). Optional self-assessment: Some tensions and dilemmas. Assessment &
Evaluation in Higher Education, 37(2), 137–147.
LeBlanc, R., & Painchaud, G. (1985). Self-assessment as a second language placement instrument.
TESOL Quarterly, 19(4), 673–687.
MacIntyre, P., Noels, K., & Clément, R. (1997). Biases in self-ratings of second language profi-
ciency: The role of language anxiety. Language Learning, 47(2), 265–287.
Malabonga, V., Kenyon, D., & Carpenter, H. (2005). Self-assessment, preparation and response
time on a computerized oral proficiency test. Language Testing, 22(1), 59–92.
Marsh, H. W. (1986). Negative item bias in rating scales for preadolescent children: A cognitive-
developmental phenomenon. Developmental Psychology, 22, 37–49.
Matsuno, S. (2009). Self-, peer-, and teacher-assessments in Japanese university EFL writing
classrooms. Language Testing, 26(1), 75–100.
McDonald, B., & Boud, D. (2003). The impact of self-assessment on achievement: The effects of
self-assessment training on performance in external examination. Assessment in Education,
10(2), 209–220.
Mihaljević Djigunović, J. (2016). Individual differences and young learners’ performance on L2
speaking tests. In M. Nikolov (Ed.), Assessing young learners of English: Global and local
perspectives. New York: Springer.
Moritz, C. E. B. (1995). Self-assessment of foreign language proficiency: A critical analysis of
issues and a study of cognitive orientations of French learners. Doctoral dissertation, Cornell
University, New York.
Morrison, F. J., Ponitz, C. C., & McClelland, M. M. (2010). Self-regulation and academic achieve-
ment in the transition to school. In S. D. Calkins & M. A. Bell (Eds.), Child development at the
intersection of emotion and cognition (pp. 203–224). Washington, DC: American Psychological
Association.
National Council of State Supervisors for Languages. (2014). Lingua Folio®. Retrieved from
http://www.ncssfl.org/LinguaFolio/index.php?checklists
Nicholls, J. G. (1978). The development of the concepts of effort and ability, perception of aca-
demic attainment, and the understanding that difficult tasks require more ability. Child
Development, 49, 800–814.
Nicholls, J. G. (1989). The competitive ethos and democratic education. Cambridge, MA: Harvard
University Press.
Nicholls, J. G., & Miller, A. T. (1983). The differentiation of the concepts of difficulty and ability.
Child Development, 54, 951–959.
Nicol, D., & Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: A
model and seven principles of good feedback practice. Studies in Higher Education, 31(2),
199–218.
O’Malley, J. M., & Pierce, L. V. (1996). Authentic assessment for English language learners:
Practical approaches for teachers. New York: Longman.
Orsmond, P., Merry, S., & Reiling, K. (1997). A study in self-assessment: Tutor and students’
perceptions of performance criteria. Assessment and Evaluation in Higher Education, 22(4),
357–368.
Orsmond, P., Merry, S., & Reiling, K. (2000). The use of student derived marking criteria in peer
and self-assessment. Assessment and Evaluation in Higher Education, 25(1), 23–38.
Oscarson, M. (1989). Self-assessment of language proficiency: Rationale and applications.
Language Testing, 6(1), 1–13.
Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C. Clapham
& D. Corson (Eds.), Encyclopedia of language and education: Vol. 7. Language testing and
assessment (pp. 175–187). Dordrecht, the Netherlands: Kluwer.
Self-Assessment of and for Young Learners’ Foreign Language Learning 315
1 Introduction
Social learning theory does not accept that learners simply imitate a model’s actions,
but that they form new response patterns by organizing behavioral elements they
observe. This modeling learning is governed by four processes. The first is atten-
tional processes. Learners select from the model’s numerous characteristics and
attend to the most relevant ones. Associational preferences are another essential
factor. Learners associate with members in their social groups. In other words,
learners relate to their peers in classroom settings. The second is retention pro-
cesses. Verbal coding of the observed information facilitates cognitive processing
and storage. Also, rehearsals, or actually performing or mentally rehearsing,
enhance long-term retention. The third involves motoric reproduction processes.
Learners first acquire symbolic representations of modeled activities; thus, they
achieve approximations of the desired behavior. They refine the new patterns of
behavior through self-corrective adjustments according to feedback from their own
performance. The fourth is reinforcement and motivational processes. Positive feed-
back or incentives activate the acquired skills to actual performance. Anticipation of
positive consequences is one of the best motivators to reinforce and generate an
effective, high level of observational learning (Bandura, 1971).
Similarly, students rated their peers’ performances based on the criteria in the
evaluation rubrics in the present study, so they selectively attended to features of
their peers’ oral presentations. After each presentation, they discussed and decided
the scores on individual assessment criteria as a group. Each group and the teacher
then gave oral feedback about the strengths and weaknesses of the presentation.
This verbalizing process helped them understand and retain the criteria. The assess-
ing experiences also provided students opportunities for self-reflection by casting
themselves in a similar context, a form of mental rehearsal to facilitate their future
performance. Afterwards, their SA reinforced their assessment ability for their own
presentation and benefited their learning. Self-observation and self-judgment in the
process of SA informed learners how well they were progressing toward their goals
and motivated behavioral change (Schunk, 2001).
Within the framework of social learning theory, an effective, high level of observa-
tional learning of modeled behaviors is shaped and activated by three functions: the
informative function, motivational function, and cognitive function. Informative
320 Y. Hung et al.
Relevant studies of PA and SA have been carried out extensively in various fields in
L1 higher education contexts, but fewer studies combine both forms of assessment
of target oral performance in L2 contexts, especially with young learners. This sec-
tion reviews PA and SA in higher education first and then narrows the scope to
discuss empirical studies incorporating both forms of student-assessment with
young learners.
The PA process, in which students benefit from social interaction between assessors
and assessees, enhances development of cognition and meta-cognition, affect, and
social skills (Topping, 1998). Reviews of PA studies find general agreement between
student and teacher ratings. Falchikov and Goldfinch (2000) analyzed 48 quantita-
tive studies in L1 settings from 1959 to 1999 and found the mean value of correla-
tion coefficients was 0.69, indicating general agreement between peer and teacher
ratings. Consistent with the previous findings, van Zundert, Sluijsmans, and van
Merriënboer (2010) reviewed 26 studies of L1 PA from 1990 to 2007 and further
pointed out that peer feedback helped students revise their work, higher achievers
were more skillful in PA than lower achievers, and students had mixed attitudes
toward PA. The problems of friendship marking (Pond, UI-Hag, & Wade, 1995),
also referred to “reciprocity effects” (Panadero, Romero, & Strijbos, 2013, p. 195),
and insufficient differentiation (Murphy & Cleveland, 1995), where learners gave
ratings higher than their peers deserved and tended to give their peers a narrower
range of ratings to avoid inaccurate evaluations, were commonly shown in adult
learners.
Given opportunities to assess and reflect on their individual progress by engaging
in SA, learners focus on their own learning, locate their strengths and weaknesses,
and take responsibility for their own learning (Harris, 1997).
The review of SA research shows self-appraisal improves students’ achievement,
though the correlations for self- and teacher agreement are not as good as for PA
Relationships between Peer- and Self-Assessment and Teacher Assessment of Young… 321
(Blanche & Merino, 1989; Ross, 2006). SA of oral skills is found to be more
difficult because speaking can be highly intangible (Harris, 1997). Self-ratings may
be affected by subjective errors due to past academic record, and peer or parental
expectations (Blanche & Merino, 1989). Cultural factors, such as the pressure to
display overt modesty, which is valued in Chinese culture, may make students more
critical of their own performance (Chen, 2008; Oscarson, 1997). In contrast, Iranian
students are lenient when rating themselves since overt or false modesty concerning
one’s accomplishments is not emphasized in their culture (Esfandiari & Myford,
2013). Young children tend to over-estimate due to their wishful thinking and lack
of the cognitive skills needed to evaluate their abilities accurately (Ross, 2006).
The above reviews show benefits as well as potential problems of PA and
SA. Dochy et al. (1999) argued that incorporating both types of student assessment
could overcome the defects of over-marking and under-marking. However, the fol-
lowing studies show that this proposal still remains questionable and that additional
empirical studies are needed to verify this argument.
Student assessment has been found to have a positive effect on young learners’
achievement, but an age-related difference appears to be a factor. Ross, Hogaboam-
Gray, and Rolheiser (2002) found that 5th and 6th graders who received self-
evaluation training had a higher math achievement than who did not. In another
study of 6th graders in English class in Korea, repeated SA improved students’
assessing ability as well as English performance on objective tests (Butler & Lee,
2010). Butler (1990) compared ratings by children at ages 5, 7, and 10 with adult
judges after they copied drawings. Young learners were interested in and capable of
comparing drawings using agreed upon standards. However, when the young learn-
ers were put in a competitive condition, the desire to outperform others and difficul-
ties in evaluating relative abilities caused inflated perceptions of their own work and
decreased their interest.
Butler and Lee (2006) compared 4th and 6th graders’ responses to an off-task SA
and an on-task SA with teacher assessment and results of standardized tests in
Korea. In the off-task SA, learners self-evaluated their general performance in a
decontextualized way. The on-task SA was in a contextualized format, in which
learners self-assessed their performance in a specific task. The study showed that
the validity of SA in the contextualized format was higher than SA in the decontex-
tualized format. The results also indicated that the 6th graders out-performed 4th
graders in terms of student assessment accuracy. Though age-differences in SA
were found in this study, the reasons behind the differences remained unclear.
In Mok’s (2010) study, four secondary students expressed serious concerns that
they were not good enough to evaluate their peers, even though they agreed PA
helped them reflect upon their own performances. Mok called for preparation of the
students both methodologically and psychologically for the role of peer assessor.
Hung, Chen, and Samuelson (under review) examined group PA of 4th to 6th grad-
ers’ oral performance in EFL classes in Taiwan. The results showed that the 5th and
6th graders were able to assess their peers much as their teacher did, whereas the 4th
graders were not. The majority of the students in all levels reported they enjoyed
playing the role of assessor and indicated this process benefited their subsequent
performance and English learning. However, challenges of accepting diverse opin-
ions and conducting discussions of evaluating their peers within groups, particularly
for the 4th graders, were indicated.
Though there are some preliminary findings of practicing PA and SA with young
learners in the related literature, the effect of combining the two remains uninvesti-
gated and therefore is the main focus of this empirical study.
Relationships between Peer- and Self-Assessment and Teacher Assessment of Young… 323
4 Research Method
This classroom-based research used both quantitative and qualitative data to examine
the assessment process as well as the opinions of the students and their teacher.
Chen worked collaboratively with two university researchers to plan and implement
student assessment procedures in her class. Hung observed all classes in which
student assessment was conducted. Samuelson assisted with research data analysis,
and her prior experience as an English teacher in southern Taiwan helped her to be
familiar with the educational context of the study.
The setting for this study was a public elementary school in southern Taiwan. The
school was established in 1996 to serve a new high socioeconomic status (SES)
suburban community. The total student population was about 800 students, divided
into 30 classes (grades 1–6). This school was regarded as a high performing school
where the teachers as well as the students had received awards for excellence from
the local government and the national Ministry of Education.
Approximately 90 % of the students were Taiwanese; 10 % were Hakka (an eth-
nic Chinese group comprising 15–20 % of Taiwan’s population) or immigrants from
provinces in Mainland China or other countries. When the study was conducted,
students were required to study English from 3rd grade in elementary school (age 9)
in accordance with the national policy. However, local educational policy promot-
ing English proficiency required all students at this school to start English courses
from the 2nd grade (age 8).
Chen held a MA degree of English teaching and had been teaching English at ele-
mentary school for 14 years. After attending a workshop on student assessment held
by the Ministry of Education, she carried out the PA and SA activities in two 6th-
grade classes (age 12). These intact classes were selected because they were taught
by the same teacher. Sixty-nine students participated in the study, with three stu-
dents excluded due to absences. Forty-two were female students and twenty-seven
were male. All of the students began learning English in the 2nd grade and received
two 40-min English classes every week. In addition to the formal English instruc-
tion in elementary school, 58 % of the students (N = 40) started to learn English
from tutors or in private institutes before entering elementary school, and an addi-
tional 16 % of them (N = 11) started in 1st grade. Approximately 96 % of the partici-
pants (N = 66) learned English out of class when this study was conducted.
324 Y. Hung et al.
Based on routine placement tests in the beginning of the semester and the students’
final English grades the previous semester, all 6th graders had been divided into
advanced, intermediate, and basic levels and separated into different classes. The
participants in the current study were assigned to advanced classes. For the purpose
of the study, the students were arranged in groups of six for PA. There were twelve
peer groups in the two classes, six groups in each class.
Training students to ensure they are aware of the objectives and procedures of the
assessment and understand evaluation criteria is the key to successful PA and SA
activities. Several important steps mentioned in the literature include clarifying the
purpose of the kind of assessment done and expectations of the students as asses-
sors; involving participants in developing assessment criteria; providing practice
and examples of student performance; providing written checklists or guidelines,
specifying activities and timescale; giving feedback; and examining the quality of
feedback (Oscarson, 1997; Topping, 2009). Accordingly, the researchers designed
the following procedure. The entire procedure of PA and SA lasted seven weeks to
complete for each class: two class periods per week and 40 min per class period.
After Chen taught the textbook content in each class, she set aside approximately
one third of the course time for the student assessment activity. Training took one
whole class period. The process writing activity took 3 weeks. Presentations took
3 weeks. Six to eight presentations were done per class.
Step 1. Introducing PA and SA
Chen informed students that PA and SA would be used to evaluate their oral
presentations. Students’ final grades would include peer, self- and teacher ratings.
The purpose and rationale of student assessment were introduced. Students were
told that evaluation should be decided from different perspectives, not only by their
teacher, but also by their fellow students. When they did PA, they were learning
English from others at the same time. They could reflect on their own performance
Relationships between Peer- and Self-Assessment and Teacher Assessment of Young… 325
by rating others and themselves and improve their own future presentation.
Chen encouraged the students to take responsibility for the process and learn from
the assessing process. After she introduced PA and SA, students moved on to pre-
pare for their oral presentations.
Step 2. Preparing oral presentations
This class used the English textbook, Enjoy 10, issued by the local Bureau of
Education (Shen et al., 2001). The first unit covered the topic of traveling, after
students had just returned from their summer vacation. Chen decided to use “My
Summer Vacation” as the presentation topic. Since the English level of this group of
students was still at the beginner’s stage, she guided the students to draft their pre-
sentation content via process writing. After the students composed draft 1 at home
and submitted it to Chen, she indicated the parts that the students could elaborate
and taught them how to look up English words online. In the second draft, Chen
underlined obvious language errors. In the final draft, she corrected language errors
that the students could not revise by themselves. Figure 1 was a final draft by one of
the students. In the presentation, the student memorized the content and recited it in
front of their classmates.
Step 3. Discussing evaluation criteria
Involving students in the development of evaluation criteria has been recom-
mended in the literature to help learners understand what constitutes a good presen-
tation and to develop a sense of ownership (Harris, 1997; Topping, 2009). Chen
discussed the evaluation criteria with the whole class who decided on the criteria
together (see Fig. 2). The students agreed that the four criteria should be weighted
differently. From Chen’s previous experience of practicing student assessment, stu-
dents tended to focus on their peers’ weaknesses instead of strengths, so strengths
and suggestions were used in the comment to lead the students to pay more attention
to their peers’ strengths and give feedback constructively. Finally, she discussed
with the students what should be considered the standard for each criterion.
Step 4. Presenting and evaluating
Right before the first presentations, the students reviewed again the evaluation
criteria. After each presentation, the audience discussed their classmate’s performance
My Summer Vacation
In my summer vacation, I went to day care center every
day. On Saturday in July, the day care center took me and
many other students out. We did some interesting things.
We saw a movie, Despicable Me 2, played bowling, and
then ate dinner. I enjoyed that movie. I had fun playing
bowling. The dinner was great. I was very happy in my
summer vacation.
Evaluation Rubric
Voice (6 points)
Content (6 points)
Interaction with audience (6 points)
Body language & facial expression (2 points)
total (20 points)
Strength:
Suggestion:
within their groups and assessed their peer by deciding the grades as a group.
Meanwhile, each presenting student sat apart and did a SA using the same rubric.
Then the teacher and each student group gave oral feedback on the performance.
The assessment of all presentations followed the same pattern. Since the students’
English abilities were developing, the discussion within groups and in the whole
class was conducted in their native language, Chinese.
Step 5. Reflecting
Chen calculated the final scores across groups and compiled all the comments
from each group. In the next class, she gave each group its results. She then led the
whole class in a reflective discussion on the assessment process.
In addition to peer, self-, and teacher ratings for each presentation, data included a
post-assessment survey filled out by the students and a teacher interview. The sur-
vey items and their Chinese translations were examined by Chen to establish the
content validity, based on the premise that a subject matter expert’s judgment of
whether a measure includes the appropriate content for the construct it aims to mea-
sure is an acceptable way to establish validity (Cohen, & Swerdlik, 2005). Chinese
versions of the questionnaire along with a parental consent form were given to the
students. Only students who completed both the survey and returned the consent
form were included (N = 69). The design of the five-point Likert scale questionnaire
for the ratings and interactions between assessors and assessees as well as among
team members was framed by social learning theory (Bandura, 1971). In addition to
students’ demographic information, the items were constructed on the basis of three
functions of reinforcement in observational learning, including informational func-
tion (Items 1–7), motivational function (Items 8–11), and cognitive function (Items
12–16). One open-ended question elicited the students’ general reflection on this
process (see Table 1).
Relationships between Peer- and Self-Assessment and Teacher Assessment of Young… 327
Paired samples t-tests were used to compare differences between mean scores of
peer, self-, and teacher ratings to reveal whether students’ perception of their perfor-
mance accorded with their teacher (Isaac & Michael, 1995). Correlation was used
to analyze agreement of total scores and scores on each evaluation criterion between
peer, self-, and teacher ratings. Agreement was confirmed if the peer or self- ratings
lay within one standard deviation of the teacher’s ratings (Kwan & Leung, 1996).
The maximum and minimum scores of PA, SA, and teacher assessment were also
compared to examine the range of their ratings.
328 Y. Hung et al.
4.6.2 Questionnaires
Descriptive analysis was used to tabulate numbers, percentages, and mean scores of
the results of the questionnaires. Cronbach’s alpha coefficient for the 16 items is
.873, suggesting high reliability of the questionnaire.
Students’ responses to the open-ended question in the survey and the teacher inter-
view were coded using the three functions of reinforcement of observational learn-
ing given above. Hung and Chen coded all the data independently. A Kappa measure
of the two raters’ coding was greater than 0.85, indicating acceptable inter-rater
reliability (Landis & Koch, 1977). Agreement on each coding was reached through
discussion.
5 Results
We present our findings in terms of each of the research questions given at the
beginning of this article. The peer, self-, and teacher ratings are used to show the
correlations between their evaluations, and the student survey and teacher interview
are used to delineate their perceptions.
The analyses of PA, SA, and teacher assessment reveal peer, self-, and teacher rat-
ings were correlated to a certain extent in the present study. Over-marking, under-
marking, and range restriction, which appeared in previous studies of PA or SA, did
not exist in this study. As Table 2 shows, the ranges of peer and self-ratings are 9–20
and 7–20, respectively; whereas the range of teacher rating is 12–20. The ranges of
both peer and self-ratings are larger than the teacher ratings. The mean differences
between peer and teacher ratings and between self- and teacher ratings lay within
one standard deviation of the teacher ratings, which indicates agreement between
peer and teacher ratings as well as self- and teacher ratings (Kwan & Leung, 1996).
Though the mean scores of peer- and self-ratings are slightly lower than the mean
score of the teacher ratings, paired sample t tests reveal no significant differences
between peer and teacher ratings (p > .05) and between self- and teacher ratings
(p > .05). As displayed in Table 3, the Pearson correlation coefficient between peer
and teacher ratings is .73 (p < .01), while the correlation coefficient between self-
and teacher ratings is .48 (p < .01). A correlation of 0.5 is large, 0.3 is moderate, and
0.1 is small (Cohen, 1988). The results show that PA and teacher assessment had a
strong positive correlation, whereas the correlation between SA and teacher assess-
ment was moderate and positive. Both correlations were significant.
In the interview, the teacher stated she had noticed the difference between PA
and SA. She speculated some students might have over-marked themselves because
they were more aware of and took into account their effort. Chen thought that the
students’ self-assessment of their effort was a good supplementation to other
assessments, since it was difficult for the instructor to evaluation the students’ prep-
aration process. As she stated in the interview,
When a student rated their peers’ performance, he watched the performance of the student
critically. When the presenter evaluated himself, he thought ‘How much effort did I put into
this? How was my performance from my point of view?’ He evaluated his own performance
from his own perspective, not from the perspective of an outsider. I could compare the dif-
ferences of the evaluations from two perspectives. (Teacher Interview)
⮦ᆨ⭏ᒛᆨ⭏ᢃ࠶ᮨᱲ , ԆفㄉⲴᱟ∄䔳ᢩࡔⲴ䀂ᓖ৫ⴻ䙉ػᆨ⭏Ⲵ㺘⨮ , ਟᱟ
ྲ᷌ᱟㄉ൘䙉 ػpresenter Ⲵ䀂ᓖ , ᡁᗎѝࣚ࣋Ҷཊቁ , ᡁⴻࡠᡁ㠚ᐡⲴ㺘⨮ᱟཊቁ ,
䛓ᡁᜣ㾱ⴻѝ䯃Ⲵн਼唎 , 㘼нᱟਚᱟㄉ൘ outsider Ⲵ䀂ᓖ , 㾱ᗎᡁ㠚ᐡⲴ䀂ᓖ৫
ⴻDŽ
Table 3 also shows correlations between peer, self-, and teacher ratings for each
evaluation criterion. Though all of the criteria are positively correlated between
peer, self-, and teacher correlation, slight differences exist in correlation between
PA and teacher assessment. For the criteria of voice and interaction with audience,
PA and teacher assessment are strongly correlated (r = .76 and r = .60); in contrast,
the correlations of content and body language and facial expression are relatively
weak (r = .44 and r = .40). The criteria of voice and interaction with audience are
probably easier to observe and evaluate. For the content, the students might not have
comprehended their peers’ presentation completely or they might have had different
standards from the teacher. The total number of points for the criterion of body
language and facial expression was only 2, which may also help to explain the weak
correlation.
The students clearly recognized what they had learned from the assessing activity.
Approximately 95 % of the students strongly agreed or somewhat agreed that they
paid attention to their peers’ presentation, learned some English because of it,
Relationships between Peer- and Self-Assessment and Teacher Assessment of Young… 331
learned how to do a presentation, and gave and got feedback to improve themselves
(see Table 4). As one student stated in the survey,
This was a great activity! By rating our classmates’ presentations, we gave ratings, and we
also learned to accept others’ opinions. When others evaluated us, they gave us some sug-
gestions. Their suggestions made us understand our strengths and weaknesses. We could
reflect on our presentations and think how to improve ourselves. It also let us experience
doing a presentation in front of others. We improved our performance on the stage. We
learned more and more broadly, not just limited to the content of the textbook. (Student 7)
䙉ᱟᖸἂⲴаػ⍫अ , 䘿䙾਼ᆨف , 䇃ᡁ⛪فԆ䂅࠶ , 䂅࠶Ⲵ䙾〻 , ҏ䇃ᡁف
ᆨ㘂᧕ਇࡕӪⲴ㾻 , ࡕӪ⛪ᡁف䂅࠶ᱲ , ᴳ㎖аӋᔪ䆠 , ਼ᆨⲴᔪ䆠ਟ֯ᡁفҶ䀓
㠚ᐡⲴݚ㕪唎 , ৽ⴱ㠚ᐡⲴіф৫ᙍ㘳ྲօ᭩䙢 , ҏਟԕ䇃ᡁفᴹкਠⲴ㏃
傇 , 䇃ਠ付䆺ᗇᴤྭ , ҏ֯ᡁفᆨ㘂ᴤཊǃᴤᔓ , нਚᴹᆨ㘂䃢ᵜкⲴᶡ㾯㘼ᐢDŽ
Three of the 69 students reported that they did not learn any English from doing
the PA (Item 2), but that they did learn to give suggestions (Item 6). Since these
students only experienced this type of student assessing activity once, they might
332 Y. Hung et al.
need practice doing PA and SA before they would be able to identify the long-term
improvement in their English abilities. Also, giving concrete suggestions is rela-
tively more difficult than giving ratings and therefore needs more guidance.
The majority of the students enjoyed being empowered to be assessors, and there-
fore they tried to fulfill the responsibilities of assessors and learn to be fair. In Item
8 and Item 9, the students reported they liked the assessing activity and they were
able to assess their peers objectively (see Table 5). They knew they were playing the
role of a teacher.
When doing peer assessment, I felt like a judge because I could evaluate my classmates.
(Student 27)
਼݅䂅࠶ᱲ , ᡁ㿪ᗇᡁػۿ䂅ሙа⁓ , ഐ⛪ਟԕᒛ਼ᆨ䂅࠶DŽ
I think peer assessment has to be fair and just. We can’t favor a particular classmate
because he is a friend. Peer assessment is also a process to test whether I can give ratings in
the stance of a teacher, so I think this is a very good activity. (Student 40)
ᡁ㿪ᗇ਼݅䂅࠶аᇊ㾱࢜ᒣ࢜↓ , н㜭ഐ⛪Ԇᱟ㠚ᐡⲴᴻ৻㘼ٿ㻂Ԇ , ᡰԕ਼݅
䂅࠶ᱟ൘㘳傇ᱟ㜭ԕ㘱ᑛⲴ・৫䂅࠶ , ᡰԕᡁ㿪ᗇ䙉ػ⍫अᖸྭDŽ
As Chen mentioned above, she thought most students could assess their peers
and themselves objectively whereas only a few of them could not. In Table 5, five
students reported that they could not assess themselves objectively (Item 10), and
three students reported that they disagreed with the statement that their peers
assessed them objectively (Item 11). One student doubted the fairness of PA and
their group played safe by giving a restricted range of ratings for all of the
presenters:
I don’t oppose this activity, but honestly a little more than half of the class didn’t take giving
ratings seriously. It was always the same students [in the group] doing ratings. Some of the
students couldn’t get the standard, just like our group. We were terrible in assessing. We
gave two thirds of our classmates 16 [out of 20 possible points]. Once the teacher said one
presenter was good, they changed the rating to 18. Also, friends and enemies influenced
ratings more or less (I am not sure whether my class has this problem or not). (Student 13)
Through discussion, the students learned how to accept diverse opinions and to
work together to decide on a rating as a group.
When we gave ratings through group discussion, we learned not to raise or lower the stan-
dard because of particular people. (Student 50)
䘿䙾㍴ޗ䀾䄆ᒛ਼ᆨ䂅࠶ , ᡁفቡਟԕᆨᴳྲօнഐሽ䊑㘼ᨀ儈ᡆ䱽վ䂅࠶⁉
ⓆDŽ
Sometimes everyone had different opinions. After discussion, we could give a rating
that everyone was satisfied with. (Student 31)
ᴹᱲىབྷᇦ㾻ᖸнਸ , ն㏃䙾䀾䄆ᖼ , ቡᴳ䀾䄆ࠪབྷᇦ䜭┯Ⲵ࠶ᮨDŽ
However, some students did not learn how to participate in and conduct an effec-
tive group discussion. A few students reported not every member was given a chance
to express their opinions, and that some of them did not accept each other’s opinions
(Items 13–15) (See Table 6). As Student 38 said, “Some people didn’t respect oth-
ers’ opinions. They didn’t learn to how to work well with each other.” [ᴹӪнሺ䟽
ࡕӪⲴ㾻 , ⋂䗖⌅ᆨᴳਸDŽ]
6 Discussion
The finding of a strong positive correlation between PA and teacher assessments and
the finding of a moderate positive correlation between SA and teacher assessments
together imply that PA has a positive impact on SA. This is similar to what was sug-
gested by Topping and Ehly (2001). In the combination of both PA and SA, chal-
lenges that appear in either PA or SA alone in the previous studies are overcome.
Contrary to previous arguments that young learners are not able to evaluate them-
selves fairly due to age-related issues of under-development of cognition and wish-
ful thinking (Ross, 2006), this group of learners demonstrated that they were able to
conduct PA and SA as their teacher did, at least to a moderate extent. The problems
of over-marking and under-marking were minimized, as Dochy et al. (1999) argued,
though subjective issues still appeared in a few SA cases and therefore should be
discussed and eliminated in training.
As suggested in social cognitive theory, learning is regulated by interaction
between external influence and self-directedness (Bandura, 1991). The integration
of group PA and SA serves informative, motivational, and cognitive functions to
reinforce students’ learning to assess and assessing to learn (Bandura, 1991). For
the informative function, the reflecting experience was amplified and had a positive
impact on students in terms of being an assessor as well as a language learner. In
this context combining both PA and SA, the students observed their peers’ perfor-
mance from the perspective of an outsider whereas they scrutinized their own per-
formance from the viewpoint as an insider. The process of comparing, contrasting,
and cross-checking the perceptions of an outsider, an insider, and other outsiders
crystalized the standard of each evaluation criterion for the students, who therefore
benefited from the experience and developed the abilities to be assessors in both PA
and SA. Meanwhile, attending to and reflecting on their peers’ as well as their own
Relationships between Peer- and Self-Assessment and Teacher Assessment of Young… 335
Firstly, although the results of this study shed light on benefits of combining PA and
SA, future experimental research can be undertaken to compare practice of both
forms of assessment with each individual form. Secondly, the validity of SA
appeared to be lower than that of PA; thus, how to facilitate learners to self-assess
their performance needs to be investigated. Finally, differences between group and
individual implementation of PA and SA can be examined for future practitioners
to successfully implement various approaches to using PA and SA in their
classrooms.
References
Bandura, A. (1971). Social learning theory. New York: General Learning Press.
Bandura, A. (1991). Social cognitive theory of self-regulation. Organizational Behavior and
Human Decision Processes, 50, 248–287.
Birjandi, P., & Tamjid, N. H. (2012). The role of self-, peer and teacher assessment in promoting
Iranian EFL learners’ writing performance. Assessment and Evaluation in Higher Education,
37, 513–533.
Blanche, P., & Merino, B. J. (1989). Self-assessment of foreign-language skills: Implications for
teachers and researchers. Language Learning, 39, 313–338. doi:10.1111/j.1467-1770.1989.
tb00595.x.
Boud, D. (1995). Enhancing learning through self-assessment. London: Kogan Page.
Butler, R. (1990). The effects of mastery and competitive conditions on self-assessment at differ-
ent ages. Child Development, 61, 201–210. doi:10.1111/1467-8624.ep9102040554.
Butler, Y. G. (2016). Self-assessment of and for young learners’ foreign language learning. In
M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives.
New York: Springer.
Butler, Y. G., & Lee, J. (2006). On-task versus off-task self-assessments among Korean elementary
school students studying English. The Modern Language Journal, 90, 506–518.
doi:10.1111/j.1540-4781.2006.00463.x.
Butler, Y. G., & Lee, J. (2010). The effects of self-assessment among young learners of English.
Language Testing, 27, 5–31. doi:10.1177/0265532209346370.
Chen, Y.-M. (2008). Learning to self-assess oral performance in English: A longitudinal case study.
Language Teaching Research, 12, 235–262. doi:10.1177/1362168807086293.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Cohen, R. J., & Swerdlik, M. E. (2005). Psychological testing and assessment: An introduction to
tests and measurement. Boston: McGraw Hill.
De Grez, L., Valcke, M., & Roozen, I. (2012). How effective are self- and peer assessment of oral
presentation skills compared with teachers’ assessments? Active Learning in Higher Education,
13, 129–142. doi:10.1177/1469787412441284.
Dochy, F., Segers, M., & Sluijsmans, D. (1999). The use of self-, peer and co-assessment in higher
education: A review. Studies in Higher Education, 24, 331–350. doi:10.1080/0307507991233
1379935.
Esfandiari, R., & Myford, C. M. (2013). Severity differences among self-assessors, peer-assessors,
and teacher assessors rating EFL essays. Assessing Writing, 18, 111–131. doi:10.1016/j.
asw.2012.12.002.
Relationships between Peer- and Self-Assessment and Teacher Assessment of Young… 337
Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-analysis
comparing peer and teacher marks. Review of Educational Research, 70, 287–322.
Fallows, S., & Chandramohan, B. (2001). Multiple approaches to assessment: Reflections on use
of tutor, peer and self-assessment. Teaching in Higher Education, 6, 229–246.
doi:10.1080/13562510120045212.
Harris, M. (1997). Self-assessment of language learning in formal settings. ELT Journal, 51,
12–20. doi:10.1093/elt/51.1.12.
Hung, Y.-J., Chen, S.-C., & Samuelson, B. L. (under review). Peer assessment of oral English
performance in a Taiwanese elementary school.
Isaac, S., & Michael, W. (1995). Handbook in research and evaluation for education and the
behavioral sciences (3rd ed.). San Diego, CA: Educational and Industrial Testing Services.
Kwan, K.-P., & Leung, R. (1996). Tutor versus peer group assessment of student performance in a
simulation training exercise. Assessment and Evaluation in Higher Education, 21, 205–214.
Landis, J. R., & Koch, G. D. (1977). The measurement of observer agreement for categorical data.
Biometrics, 33, 159–174.
Langan, A. M., Shuker, D. M., Cullen, W. R., Penney, D., Preziosi, R. F., & Wheater, C. P. (2008).
Relationships between student characteristics and self‐, peer and tutor evaluations of oral pre-
sentations. Assessment and Evaluation in Higher Education, 33, 179–190.
doi:10.1080/02602930701292498.
Lanning, S. K., Brickhouse, T. H., Gunsolley, J. C., Ranson, S. L., & Willett, R. M. (2011).
Communication skills instruction: An analysis of self, peer-group, student instructors and fac-
ulty assessment. Patient Education and Counseling, 83, 145–151. doi:10.1016/j.
pec.2010.06.024.
Mok, J. (2010). A case study of students’ perceptions of peer assessment in Hong Kong. ELT
Journal, 65, 230–239. doi:10.1093/elt/ccq062.
Murakami, C., Valvona, C., & Broudy, D. (2012). Turning apathy into activeness in oral commu-
nication classes: Regular self- and peer-assessment in a TBLT programme. System, 40, 407–
420. doi:10.1016/j.system.2012.07.003.
Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organi-
zational, and goal-based perspectives. Thousand Oaks, CA: Sage.
Orsmond, P., Merry, S., & Reiling, K. (2002). The use of exemplars and formative feedback when
using student derived marking criteria in peer and self-assessment. Assessment and Evaluation
in Higher Education, 27, 309–323.
Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C. Clapham
& D. Corson (Eds.), Encyclopedia of language and education (Vol. 7, pp. 175–187). Dordrecht,
The Netherlands: Kluwer Academic Publishers.
Panadero, E., Romero, M., & Strijbos, J.-W. (2013). The impact of a rubric and friendship on peer
assessment: Effects on construct validity, performance, and perceptions of fairness and com-
fort. Studies in Educational Evaluation, 39, 195–203. doi:10.1016/j.stueduc.2013.10.005.
Pond, K., UI-Hag, R., & Wade, W. (1995). Peer review: A precursor to peer assessment. Innovation
in Education and Training International, 32, 314–323.
Ross, J. A. (2006). The reliability, validity, and utility of self-assessment. Practical Assessment,
Research and Evaluation, 11, 1–13.
Ross, J. A., Hogaboam-Gray, A., & Rolheiser, C. (2002). Student self-evaluation in grade 5–6
mathematics effects on problem-solving achievement. Educational Assessment, 8, 43–59.
Sadler, D. R. (1998). Formative assessment: Revisiting the territory. Assessment in Education:
Principles, Policy and Practice, 5, 77–85.
Schunk, D. H. (2001). Social cognitive theory and self-regulated learning. In B. J. Zimmerman &
D. H. Schunk (Eds.), Self-regulated learning and academic achievement (2nd ed.). Mahwah,
NJ: Lawrence Erlbaum Associates, Inc.
Shen, C., Lin, F., Xu, Y., Guo, W., Guo, F., Chen, M., …, & Liu, S. (2001). Enjoy 10. Tainan,
Taiwan: Bureau of Education of Tainan City Government.
338 Y. Hung et al.
Topping, K. J. (1998). Peer assessment between students in colleges and universities. Review of
Educational Research, 68, 249–276.
Topping, K. J. (2009). Peer assessment. Theory Into Practice, 48, 20–27.
Topping, K. J. (2010). Methodological quandaries in studying process and outcomes in peer
assessment. Learning and Instruction, 20, 339–343.
Topping, K. J., & Ehly, S. W. (2001). Peer assisted learning: A framework for consultation. Journal
of Educational and Psychological Consultation, 12, 113–132.
van Zundert, M., Sluijsmans, D., & van Merriënboer, J. (2010). Effective peer assessment pro-
cesses: Research findings and future directions. Learning and Instruction, 20, 270–279.
Wells, G., & Wells, J. (1984). Learning to talk and talking to learn. Theory Into Practice, 23,
190–197.
Ye, X. (2001). Alternative assessment. In Y. Shi (Ed.), English teaching and assessment in primary
and middle schools (pp. 42–73). Taipei, Taiwan: Ministry of Education.