Gijbels Et Al. ROAR-PA Pre Print

Rapid Online Assessment of Reading and Phonological Awareness (ROAR-PA)
Liesbeth Gijbels1,2, Amy Burkhardt 3,4, Wanjing Ma4, Jason D. Yeatman3,4,5
1. Department of Speech & Hearing Sciences, University of Washington, Seattle, WA, USA
2. Institute for Learning & Brain Sciences, University of Washington, Seattle, WA, USA
3. Division of Developmental-Behavioral Pediatrics, Stanford University School of Medicine,
Stanford, CA, USA
4. Stanford University Graduate School of Education, Stanford, CA, USA.
5. Stanford University Department of Psychology, Stanford, CA, USA.
1. Abstract
Phonological awareness (PA) is at the foundation of reading development: PA is introduced
before formal reading instruction, has predictive value for later reading abilities, is a primary
target for early intervention, and is considered one of the core mechanisms in developmental
dyslexia. Conventional approaches to assessing PA are time-consuming and resource intensive:
assessments must be individually administered, require expertise, and scoring verbal responses
is challenging and subjective. Therefore, we introduce a rapid, automated, online measure of PA
— The Rapid Online Assessment of Reading - Phonological Awareness (ROAR-PA) — that can
be widely implemented in classrooms and research studies without a test administrator. We
explored whether this gamified, online task, that relies on touchscreen/click responses, can
serve as an accurate and reliable measure of PA and as a good predictor of reading
development. We found that ROAR-PA is well correlated with standardized measures of PA
(CTOPP-2, r = .80) and reading (Woodcock-Johnson, r = .50), for children from Pre-K through
fourth grade and achieves exceptional reliability (𝜶 = .96) in a 12-minute automated, online
assessment. Furthermore, validation in 50 first and second grade classrooms shows reliable
implementation in large, public school classrooms with predictive value of future reading
development.
2. Introduction
Literacy is critical to educational success. Individual differences in reading skills are predictive of
educational outcomes throughout schooling and have been linked to socioeconomic and health
disparities (Janus & Duku, 2007; Noble et al., 2006). Although the focus on learning to read
begins in kindergarten, there are large individual differences in reading abilities that persist
throughout adulthood. Understanding the mechanisms underlying these individual differences is
a continuing topic of debate (Goswami, 2015; Joo et al., 2018; Pennington et al., 2012; Ramus,
2003; Spiro & Myers, 1984; Stanovich, 2017; Vidyasagar & Pammer, 2010). However, there is a
near consensus on the importance of foundational skills like phonological awareness for reading
development (Scarborough, 1998; Snowling, 1998; Vellutino et al., 2004; Wagner & Torgesen,
1987). Phonological awareness (PA), or phonological processing more broadly, refers to an
individual’s ability to reflect upon and manipulate the sound structure of spoken language, at the
level of (1) words and syllables, (2) onset and rimes, or (3) phonemes (Stanovich, 2017;
Treiman & Zukowski, 1991). PA skills emerge early, develop rapidly throughout childhood
(Nittrouer et al., 1989), and have important implications for literacy achievement (Moyle et al.,
2013). To learn how to read a child needs to be aware of the arbitrary and conventional
correspondence between the sound structure of language and the rules for how it is written
(Anthony & Francis, 2005).
An important question that remains regarding the relationship between PA and reading
development is whether some phonological skills are more closely related to the development of
reading than others (i.e., syllabic awareness: Ziegler & Goswami, 2005; intrasyllabic awareness:
Gough et al., 1992; phonemic awareness: Boyer & Ehri, 2011; and prosodic awareness:
Wade-Woolley, 2016), or whether all aspects of PA reflect the same latent construct (Anthony &
Lonigan, 2004; Papadopoulos et al., 2009). This question has practical implications for reading
instruction regarding the importance of distinguishing between different aspects of PA in relation
to learning to read (for a review: Melby-Lervåg et al., 2012). In other words, one might ask how
much information about an individual’s PA skills is necessary in order to tailor a personalized
reading curriculum. Individual differences in PA are predictive of reading development (Anthony
& Francis, 2005; Bradly & Bryant, 1978; Wagner & Torgesen, 1987), and training PA early on is
useful for establishing a solid foundation for learning to read (e.g., Bentin & Leshem, 1993;
Brady et al., 1994; Shanahan, 2005), but it is also clear that PA is only one of a constellation of
skills that contribute to ongoing differences in reading development (Bus & Van Ijzendoorn,
1999). For example, even though PA is a useful target for early intervention (e.g., Karami et al.,
1
2013; Schneider et al., 1999), it is also clear that PA is not the only important dimension of
variability. How PA should be considered in multifactorial models of reading development is still
an active area of research (Catts et al., 2017; 2021; Compton, 2020; Levan & Pammer, 2008;
O’Brien & Yeatman, 2021; Pennington, 2006; Pennington et al., 2012; Vidyasagar & Pammer,
2010; Zuk et al., 2021).
One of the difficulties in resolving the constellation of factors that interact to confer risk for
reading difficulties such as dyslexia is that many findings represent small samples, in unique lab
situations, and therefore are not necessarily generalizable, replicable, or representative of the
general population (Munafo, 2017; Yarkoni, 2020). A major barrier to scale research is
personnel requirements, like the need for trained test administrators. This problem is
exacerbated for multifactorial models where a researcher wants to collect measures of PA
alongside a battery of other measures in a large and diverse sample. The most well-known and
standardized PA assessments (e.g., PAT-2:NU; Robertson & Salter, 1997, CTOPP-2; Wagner,
Torgesen & Pearson, 1999, PPA; Williams, 2014) require a test administrator to read
instructions and interpret responses. Administering these PA assessments requires a high level
of expertise, and participant responses are often difficult and subjective to score. Moreover,
development of the PA skills that are measured by these verbal tasks often goes hand in hand
with development of articulatory skills, and confidence in expressive language more generally,
leading to additional sources of variability in scoring (Peeters et al., 2009). Finally, PA tests are
often long and tedious, and different subtests have variable instructions. This necessitates
multiple training items for each subtest, making testing complicated and straining for the limited
attention skills of young children.
Online experiments have grown in popularity as an alternative to one-on-one testing because

they: a) increase the efficiency of collecting larger, more diverse and representative samples, b)
reduce experimenter effects/biases, and c) can be short, gamified and engaging for young
children (Long et al., 2023; Nussenbaum et al., 2020; Scott & Schulz., 2017; Yeatman et al.,
2021). Although implementations of online tests are still limited in developmental and
educational research, first steps are being taken in reading and PA research. A first example is
EarlyBird Education (Gaab & Petscher, 2021). This self-administered, tablet-based game
generates a literacy profile for early elementary school students. EarlyBird is now used as a
screener for dyslexia (and reading difficulties more broadly). A second example is the Access to
Literacy Assessment System–Phonological Awareness (ATLAS-PA; Skibbe et al., 2020), a
computer adaptive measure of PA designed for children with speech and/or language
2
impairments that is administered using an online platform. Finally, our previous work introduced
the Rapid Online Assessment of Reading Ability (ROAR; https://roar.stanford.edu; Yeatman et
al., 2021). This is a self-administered, lightly gamified assessment of single word recognition
that overcomes the constraints of resource intensive, in-person reading assessment, and
provides an efficient and automated tool for online research into the mechanisms of reading
(dis)ability. The ROAR is a short (5-10 min), age-appropriate, lexical decision task that
correlates highly with in-person standardized measures of reading ability like the
Woodcock-Johnson Letter-Word Identification test (r = .91; Woodcock et al., 2007). The success
of this accurate, reliable (Cronbach’s 𝜶 = 0.98), expedient and automated online measure of
single word reading ability led to the present work on the development of the Rapid Online
Assessment of Reading - Phonological Awareness (ROAR-PA).
Our goals were to 1) develop an efficient, open-source, online task that measures PA skills
without verbal responses, 2) validate this measure against widely used in-person PA tests like
CTOPP-2 (Wagner, Torgesen & Pearson, 1999) and 3) assess the utility of this measure in a
classroom setting. In contrast to other measures of PA, ROAR-PA was designed as an engaging
game for young children such that it would not require a test administrator. Items are narrated
by animated characters, and the participant selects responses with a touch screen (or mouse),
meaning that scoring is completely automated. The intention of implementing a PA assessment
to run in the web-browser was to measure phonological processing skills efficiently and
accurately across a broad age range — kindergarten through 8th grade. Specifically, we were
interested to see a) whether a receptive (prerecorded 1-interval 3-alternative-forced-choice;
1I-3AFC) online task could accurately capture comparable PA skills to the CTOPP-2 which
requires verbal responses, b) whether different subtests measure unique latent traits of PA, c)
whether the test is developmentally appropriate from kindergarten up to 8th grade and d)
whether it is appropriate for large scale assessment in schools and predictive of future reading
skills. In Part 1, we assess the feasibility of using an online PA measure for five different
subtests: First Sound Matching (FSM), Last Sound Matching (LSM), Rhyming (RHY), Blending
(BLE) and Deletion (DEL) (See Methods section for more details). Furthermore we assess the
ideal age range for this task. In Part 2, Factor Analysis is used to examine the coherence of the
subtests in measuring a latent PA construct. Based on item response theory (IRT) we then
select suitable items spanning different difficulty levels for an efficient assessment. In Part 3 we
look at the predictive value of our task for reading performance and develop an automated
score report. And, finally, in Part 4 we test the predictive validity of these innovations by
3
implementing ROAR-PA in 50 first and second-grade classrooms and examining sensitivity and
specificity of this measure for risk classifications based on individually administered reading
assessments.
3. Results
Part 1: Feasibility of online PA measure, Subtest and Age

selection
ROAR-PA was initially developed consisting of 5 subtests with 1I-3AFC responses. Three
subtests (FSM, LSM and RHY) each consisted of 25 items with 2 difficulty levels. The last two
subtests (BLE and DEL) each had 24 items, consisting of 3 difficulty levels (for more details see
Methods section). Initially, 143 participants (Age: 3.87 - 13.00, μ = 7.13, 𝝈 = 1.89; Sex: 67 F, 76
M) completed the full task (5 subtests). Participants completed 3 training trials per subtest and
then all items. Subtests were administered in a fixed order, but the items were presented
randomly within each difficulty level. The majority of these children (N = 110) were also
administered the CTOPP-2, a standardized PA assessment, by a trained researcher
(speech-language pathologist) via a video conferencing platform.
Validation of ROAR-PA composite score

To validate the feasibility of a web-browser based PA task that only requires clicks/touchscreen
responses, we performed a correlation analysis between performance on the ROAR-PA and the
well-established standardized CTOPP-2. The results (Fig. 1, left panel) revealed strong
correlations between the CTOPP-2 and all ROAR-PA subtests: LSM (r = .65), DEL (r = .62),
FSM (r = .61), RHY (r = .60), and BLE (r = .55). The correlations between the subtests (.47 ⪬ r ⪬
.65) of the ROAR-PA itself were slightly lower and more variable than we would have expected
based on the correlation of the CTOPP-2 subtests in this dataset (.70 ⪬ r ⪬ .73). This could
indicate that the ROAR-PA subtests tap into different latent constructs, or it could reflect the
reliability of each subtest. Each subtest of the ROAR-PA, except for Blending, shows high
internal consistency based on Cronbach’s α (LSM: α = .92, CI95 = [.89 ; .93], FSM: α = .90, CI95
= [.87 ; .93], RHY: α = .86, CI95 = [.81 ; .89], DEL: α = .84, CI95 = [.77 ; .88], BLE: α = .70, CI95 =
[.57 ; .78]) and the the composite scores of both CTOPP-2 (α = .88, CI95 = [.85 ; .91]) and
ROAR-PA (α = .85, CI95 = [.80 ; .89]) have good (0.8 ≤ α < 0.9) internal consistency.
4
Next we sought to create a ROAR-PA composite score that best approximated the CTOPP-2
composite index. Looking at the relationship between ROAR-PA and CTOPP-2, we found the
highest correlation when a composite was created by summing the scores on all 5 ROAR-PA
subtests (r = .76). However, a composite score based on 3 (FSM, LSM, DEL) or 4 ROAR-PA
subtests (FSM, LSM, RHY, DEL) was equally correlated with CTOPP-2 (r = .75). The 3-subtest
composite and 4-subtest composite both achieved good reliability as well: Cronbach’s alpha of
α3_subtests = .78, CI95 = [.67 ; .84] and α4_subtests = .84, CI95 = [.77 ; .88] respectively. A correlation of
r = .75 indicates the feasibility of the ROAR-PA as an automated, online PA measure and
suggests three subtests (FSM, LSM, DEL) are sufficient to obtain a useful PA measure. This
inference is further supported by an item analysis examining which items are not predictive of
overall performance. We calculated the correlations between the item responses of ROAR-PA
for each of the 123 test items and CTOPP-2 scores. Performance on items from the subtest
LSM were especially highly correlated with overall ROAR-PA performance and CTOPP-2
performance (Fig. 1, right panel). This suggests LSM items are most informative about overall
PA abilities. Items from the BLE subtest were least informative: the correlation between most
blending items and ROAR-PA total score and CTOPP-2 total score was close to zero.
Figure 1: Left: A Pearson correlation matrix between the ROAR-PA subtests (% correct), CTOPP-2
subtests (raw scores), and overall (raw) scores on both tasks. Orange box: correlation coefficients
between ROAR-PA and overall CTOPP-2, Blue Box: correlation coefficients between ROAR-PA and
subtests of CTOPP-2, Black boxes: correlation boxes within subtests of ROAR-PA and subtests of
CTOPP-2. Right: Item Correlation Analysis: For each stimulus, we plot the point-biserial correlation
5
between performance on the item and ROAR-PA accuracy (x-axis) as well as the correlation between
performance on the item and CTOPP-2 raw score (y-axis). Items with low correlations (threshold r ≤ .10;
dotted red line) with overall test performance or CTOPP-2 performance were removed from the test.
Age selection
After selecting 3 subtests that make an efficient and reliable ROAR-PA composite score, we
then collected ROAR-PA data for an additional group of 127 participants (including mostly older
children) resulting in a total of 270 participants (Age: 3.87 – 14.92, μ= 9.12, 𝝈 = 2.71; Sex: 125
F, 145 M) completing ROAR-PA FSM, LSM and DEL subtests. 266 of these participants were
also administered the CTOPP-2 PA assessment. The Pearson correlation analysis with the
CTOPP-2 for this extended group of participants resulted in an overall correlation between the
CTOPP-2 and ROAR-PA composite (3 subtests) of r = .70 (as opposed to r = .75 in the initial
sample of participants). The correlation between the CTOPP-2 and the individual subtests also
went down for FSM and LSM (Fig 2. Left top). The decrease in correlation likely reflected ceiling
effects in older participants (Fig.2, right top). To examine the effect of age on the correlation
between ROAR-PA and CTOPP-2, we split our sample into 3 different age bins (3.87-6.99 years
old (N=71), 7.00-9.99 years old (N=91), 10.00-14.92 years old (N=106)). We found a correlation
coefficient between the composite scores of ROAR-PA and CTOPP-2 of r = .79 (CI95 = [.68 ;
.86], Cronbach’s 𝜶 = .88 ) for the youngest group, r = .69 (CI95 = [.56 ; .78], Cronbach’s 𝜶 = .79)
for the middle group, and r = .31 (CI95 = [.13 ; .48], Cronbach’s 𝜶 = .65) for the oldest group of
children. Further analysis (correlational measures and Rasch analysis) provided an ideal age
range up to 9.50 years old for the ROAR-PA (Fig. 2, bottom left and right), leading to a
correlation of r = .80 (CI95 = [.73 ; .85]) between the ROAR-PA composite and CTOPP-2, and an
increase of the correlations for individual subtests (FSM, LSM, DEL) to the CTOPP-2. For this
age range, ROAR-PA composite showed good reliability, with a Cronbach’s 𝜶 of .80 (CI95 = [.73 ;
.86]). This indicates that the ROAR-PA in its current form is predictive of PA skills for children in
(pre-) kindergarten through fourth grade (Fig 2., bottom right) but has ceiling effects above
fourth grade. Interestingly, the correlation analysis in our sample shows a similar effect for the
CTOPP-2 scores, indicating both PA tasks (ROAR-PA and CTOPP-2) are most suited for
younger children.
6
Figure 2: Left: Pearson correlation matrix between the ROAR-PA subtests (% correct) and between the
CTOPP-2 subtests (raw scores) and overall scores on both tasks for all children (N=233) age 3.87 –
14.92 years (top) and for children (N=145) age 3.87 – 9.50 years (bottom). Orange box: correlation
coefficients between ROAR-PA and overall CTOPP-2, Black boxes: correlation boxes within subtests of
ROAR-PA and subtests of CTOPP-2. Right top: Pearson correlation plot between the ROAR-PA (%
correct) and between the CTOPP-2 (raw scores), for all children (N=233) age 3.87 – 14.92 years. The red
dotted oval points to the ceiling effect of the oldest children. Right bottom: Test information functions for
the ROAR-PA. The x-axis shows ability estimates based on the Rasch model. The upper x-axis shows the
estimated CTOPP-2 raw score equivalent based on the linear relationship between ability estimates and
CTOPP-2 scores. The pink lines indicated age equivalents for CTOPP-2 scores (based on the CTOPP-2
manual). Test information is high for participants scoring between 4.5 and 9.5 years age equivalent on the
CTOPP-2.
Part 2: Factor, Item, and Difficulty Analysis
7
Factor analysis (FA)
Exploratory FA with oblique rotation was used to evaluate the dimensionality of the assessment
and is the first step toward the ultimate goal of identifying a subset of items from ROAR-PA to
remove in order to both improve model fit as well as reduce the length of the assessment. FA
poses the question of whether there is evidence that all of these items are measuring the same
underlying phonological process, or whether the items of these subtests better represent
separable (but correlated) dimensions of PA.
Our results from the FA suggest a multi-dimensional framework. First, the scree plot (Fig. 3.) of
the different items (N=74) on these three subtests (FSM, LSM, DEL) indicates three factors
before the rate of decrease flattens. Second, the magnitude of the loadings for the three-factor
model are larger than the one-factor model. Finally, examining the factor loadings (Table 1), the
items from each of the three subtests cleanly separate into separate factors (with the exception
of a single item: FSM_13).
Figure 3: Parallel Analysis of Scree plots of 74 items on three subtests (FSM, LSM, DEL). Inspection
of the scree plot suggests three factors before the amount of variation represented by the
eigenvalues flattens out. The dotted red lines represent extracted eigenvalues from data sets that are
randomly created; there are three factors with observed eigenvalues that are larger than those
extracted from the simulated data.
8
Table 1: Exploratory Factor Analysis with all subtest items (N=74) loaded on 1,2 or 3 factors. An oblique
rotation was used as we are not assuming orthogonal relationships between factors, and loadings that are
less than .30 are suppressed for the ease of interpretation. The table supports the three-factor model and
loadings are overall larger compared to the one-factor model.
Item analysis
9
Because there is evidence for a multi-dimensional framework, we proceed by calibrating a
Rasch Model separately for each of the subtests (FSM, LSM and DEL). In this IRT analysis we
include data for all participants between 3.87 and 9.50 years old. For each subtest we reviewed
four criteria, compiled from both the factor analysis and Rasch Model item fit statistics, to
determine the best subset of items: (1) Does the item load on the subtest factor with a
relationship > .30? (Tabachnick et al., 2012) (2) Does the item resemble a functional form when
looking at empirical plots? (Allen & Yen, 2001) (3) Is the item flagged based on Rasch model fit
statistics? (Wright, 1996) (4) Finally, as we want items to be informative and not redundant, is
the item located near two or more items based on difficulty distribution, to create a test length
that seems appropriate for children's attention spans (Fig. 4, left top)?
Analysis of the FSM subtest (Fig 4., right top) suggests removing 6 of the 25 items. After
removing these items, there is no major degradation or change in the key item statistics for this
assessment. Cronbach’s 𝜶 remains high (𝛼FSM_all_items = .90, CI95 = [.87 ; .93] & 𝛼FSM_adjusted = .89,
CI95 = [.85 ; .92]), and the distributions of the proportion-correct values and the point-biserial
correlations for all items remain similar. The correlation between FSM total scores and
CTOPP-2 stays about the same (rFSM_all_items = .70, rFSM_adjusted = .67). Analysis of the LSM subtest
(Fig 4., left bottom) also suggests removing 6 out of 25 items. Similar to FSM, Cronbach’s 𝜶 of
LSM remains high (𝛼LSM_all_itemsl = .92, CI95 = [.90 ; .94] & 𝛼LSM_adjusted = .92, CI95 = [.90 ; .93]), the
distributions of the proportion-correct values, the point-biserial correlations, and the correlation
between the total scores and the CTOPP-2 remain similar (rLSM_all_itemsl = .71, rLSM_adjusted = .70).
Analysis of the DEL subtest (Fig 4., right bottom) indicates removal of 5/24 items. Again,
Cronbach’s 𝜶 remains high (𝛼DEL_all_items = .86, CI95 = [.79 ; .89] & 𝛼DEL_adjusted = .85, CI95 = [.78 ;
.89]), the distributions of the proportion-correct values, the point-biserial correlations, and the
correlation between the total scores and the CTOPP-2 remain similar (r DEL_all_itemsl = .65,
rDEL_adjusted = .63).
This Rasch item analysis suggests that every subtest of this ROAR-PA task has a good (DEL) to
excellent (FSM, LSM) internal consistency, based on Cronbach’s 𝜶, with a strong correlation of
every subtest (r > .65) to the overall CTOPP-2 scores. Item analysis based on meeting at least
2/4 suggested criteria, results in 19 items per subtest, and an overall task of 57 items + 2
practice items per subtest.
10
Figure 4: Item analysis: Rasch models with .333 fixed guess rate, including students between 3.87 and
9.50 years old, that were not excluded based on clicking or foil patterns for the three remaining subtests
(FSM, LSM and DEL). Left Top: Deleted items based on a minimum of meeting 2/4 criteria. Right top &
Bottom: Representation of correlation with CTOPP-2, Cronbach's 𝜶, Distributions of % correct values and
the point-biserial correlations for all items per subtest and for the subtests with items deleted based on the
four criteria.
Difficulty analysis
Based on previous work (e.g., Stanovich, 2017; Treiman & Zukowski, 1991), ROAR-PA items
were designed to span different theoretical levels of difficulty. For the DEL subtest, difficulty
levels were based on manipulation of (1) words and syllables (item 1-8), (2) onset and rimes
(item 9-16), or (3) phonemes in the middle of the word (item 17-24). For FSM and LSM we could
not follow these levels, as the task itself focuses on the first or last phoneme(s) of the word. We
tried to create difficulty levels by manipulating single phonemes (level 1: item 1-16) or clusters of
phonemes (level 2: item 17-25). Surprisingly, based on the Rasch Model item-person maps for
the three subtests (Fig. 5), we only find that the subtest DEL approximately follows the expected
difficulty pattern. This analysis also shows that for FSM most items are closer to the lower-range
of ability, for LSM and DEL most items are close to the mid-range of ability.
11
Figure 5: Difficulty analysis: Item-person maps (or “Wright Map”) for the three subtests. Per subtest, the
distribution of the ability of the students is plotted on the left-hand side (higher ability is closer to the top of
the map). The distribution of the item difficulty is plotted on the right-hand side. The deleted items from
the item analysis are grayed out.
Summary
Analysis Part 1 and Part 2 led to the development of a new version of the ROAR-PA task
(ROAR-PA_v.2), designed for children kindergarten to 5th grade. The task is a receptive PA
assessment (1I-3AFC) consisting of 57 items (19 per subtest), divided into three subtests: First
Sound Matching (FSM), Last Sound Matching (LSM) and Deletion (DEL). Each of these
subtests load on their own factor, and together they span a range of abilities of phonological
processing. The ROAR-PA_v.2 presents 2 training items per subtest, followed by the test items
in a predetermined difficulty order (based on the IRT analysis). Selection of the most appropriate
age group (kindergarten to 5th grade) and the most reliable items (19 per subtest) results in a
highly reliable (Cronbach’s 𝜶 for ROAR-PA_v.2 composite = .96; CI95 = [.95 ; .97]) automated
self-scored task that takes, on average, 12 minutes.
Part 3: Correlation to reading skills, and score reports

The ultimate goal of developing an efficient, online PA task is to create a task that a) predicts
reading performance and b) provides actionable information about the individual’s PA skills. To
accomplish the first goal, we correlate ROAR-PA performance to performance on another task
of the ROAR test battery: ROAR-Single Word Recognition (SWR) (Yeatman et al., 2021) as well
as reading skills assessed with two of the most widely used standardized, individually
12
administered measures of reading, the Woodcock-Johnson Letter-Word-Identification (WJ
LW-ID) and Word-attack (WJ-WA) tests (Schrank et al., 2014). Reading scores can only be
assessed after children learn how to read, therefore we only acquired reading performance of
111 participants (Age: 5.66 – 9.50, μ = 7.67, 𝞼 = 1.13; Sex: 59 F, 52 M) on the WJ and
ROAR-SWR. Results show a Pearson correlation coefficient of r = .47 between the ROAR-PA
task and WJ LW-ID, r = .50 between the ROAR-PA and WJ-WA , and r = .50 between the
ROAR-PA task and ROAR-SWR. For sake of comparison, CTOPP-2 was correlated at r=.66
with WJ-WA, r=.73 with WJ LW-ID, and r=.55 with ROAR-SWR.
To accomplish the second goal, we used a diagnostic classification model (DCM) to develop a
ROAR-PA score report that provides information about mastery of each PA subskill. More
specifically, we use a specific DCM known as the deterministic-input, noisy-and-gate model, or
the DINA model (Templin & Henson, 2010). The score report presents information about N=163
(Age: 53.87 – 9.92, μ = 7.30, 𝞼 = 1.63; Sex: 80 F, 83 M), from pre-k to the end of fourth grade.
This analysis used the 57 final items selected in Part 2. Mastery of a subtest was defined as
scoring > 50% correct per subtest. The average ROAR-PA score was 44.76 (out of 57) (min =
12, max = 57, 𝞼 =11.06). Participants’ mastery of the three skills (FSM, LSM and DEL) was
quantified into three categories: Full Mastery (i.e., mastery of all 3 subtests), Some Mastery (i.e.,
mastery of 1 or 2 subtests), or No Mastery (i.e., mastery of 0 subtests). The results of these
analyses show that with increasing grade level, the level of mastery improves. There is
substantial variability in mastery at each grade level indicating that ROAR-PA provides
additional information for identifying students who may benefit from additional support in specific
PA skills (Fig 6., left).
Furthermore, students’ performance can be interpreted in relation to percentile scores

representing each student’s phonemic awareness skills relative to CTOPP-2 national norms,
(Fig 6., right), by fitting a regression model relating ROAR-PA scores to CTOPP-2 scores. The
distribution of the estimated CTOPP-2 percentile scores allows categorization of the students as
“needs extra support” (children performing below the 25th percentile; N = 18), “needs some
support” (pc 25-50; N = 47), and “at or above average” (> pc 50; N = 96), indicating that all
children ⪬ pc 50, should be monitored or followed-up with other measures.
13
Figure 6: Left: Different mastery levels represented as percentage of students by grade. Results show
(as expected) that mastery increases by grade. Right top: Predicted CTOPP-2 percentiles for the
ROAR-PA scores by age. Red line shows ceiling effects of the test for 10-year-olds, indicating that the
ability range is uninformative for the oldest participants. Right bottom: Distribution of the estimated
CTOPP-2 percentile scores, based on ROAR-PA performance, identifying children at risk (< pc 25), some
risk (pc 25-50) or doing well (> pc 50).
Part 4: Assessment of ROAR-PA_v.2 in school districts

To assess the value of ROAR-PA as a screener in a classroom setting we ran a study in
collaboration with California school district to a) investigate feasibility of district-wide
administration and b) assess predictive validity of the automated ROAR measures against
standard-of-practice individually administered reading assessments. The new version of the
ROAR-PA task (ROAR-PA_v.2), which consists of 57 items of the original task divided into three
subtests (FSM, LSM, and DEL), was used in a longitudinal study across six schools in the state
of California. In this version an early stopping criteria was implemented (for more details, see
methods) to improve task efficiency in students performing at chance on any subtest. Data was
collected from 18 1st grade classrooms (n = 214) in spring 2022 and 15 1st grade and 17 2nd
grade classrooms (N = 687, n = 324 1st grade; n = 363 2nd grade) in fall 2022. Students
14
completed ROAR-PA and ROAR-SWR on chromebooks in their classroom and all students in a
class completed the assessments simultaneously. Additionally, each student was individually
administered the Fountas and Pinnell Benchmark Assessment (F&P; Fountas et al., 2009) as
part of standard practice by their classroom teacher. The following analyses are based on the
results of 130 1st grade students who completed both ROAR-PA and ROAR-SWR in spring 2022
and were administered the F&P assessment in November 2022, as well as 215 1st grade
students and 237 2nd grade students who completed both ROAR-PA and ROAR-SWR in fall
2022 and were administered the F&P assessment in November 2022.
First, ROAR-PA showed a moderate correlation with F&P scores in 1st grade (Fig. 7; r= .59) and
2nd grade (r= .52). This is in line with previously established correlations linking PA to reading
ability in early elementary school (r = .46, Scarborough, 1998; r = .41 (real words) - r = .43
(non-words), Swanson et al., 2003).
15
Figure 7: Top: Pearson correlation plot between total scores of the ROAR-PA and Fountas & Pinnell
Benchmark scores in 1st and 2nd grade. The colorbar shows ROAR-SWR theta score for each point.
Bottom: Pearson correlation plot between the ROAR-SWR theta scores and Fountas & Pinnell
Benchmark scores in 1st and 2nd grade. The colorbar shows ROAR-PA composition score for each point.
The colormap of these plots indicates that high ROAR-PA scores correspond to high ROAR-SWR scores,
but ROAR-PA also provides additional information beyond ROAR-SWR.
Second, we ask whether a linear regression model with ROAR-PA composite scores provides
additional information about F&P reading level above and beyond ROAR-SWR scores. A
regression model with ROAR-PA and grade level predicted 56% of the variance in F&P;
ROAR-SWR and grade level predicted 74% of the variance in F&P; ROAR-PA, ROAR-SWR and
grade level predicted 76% of the variance in F&P. A model comparison indicated that the model
with ROAR-PA, ROAR-SWR and grade was a significantly better fit than the model with just
ROAR-SWR and grade (F(1,448) = 46.74, p = 2.66e-11). Thus, ROAR-PA and ROAR-SWR
provide complementary information.
Third, we examined classification accuracy of ROAR-PA and ROAR-SWR for students who
were deemed “at risk” versus “not at risk” by their school district based on F&P scores. 1st
grade students were deemed at risk if they scored below E and 2nd grade students were
deemed at risk if they scored below J. Generalized additive models (GAMs) with a binomial
distribution were used to predict risk classifications based on ROAR scores using the mgcv
package in R. Table 2 shows sensitivity and specificity of models containing ROAR-PA,
ROAR-SWR or ROAR-PA and ROAR-SWR with tensor smoothing. In 1st grade, a generalized
additive model (GAM; Wood, 2006) with ROAR-PA achieved an area under the curve (AUC) =
.82, ROAR-SWR achieved an AUC = .91 and a GAM with ROAR-PA and ROAR-SWR achieved
an AUC=0.93. In 2nd grade, ROAR-PA achieved an AUC = .79, ROAR-SWR achieved an AUC =
.87 and a model with ROAR-PA and ROAR-SWR achieved an AUC = .91. Thus the ROAR
assessments are able to accurately classify nearly all the students at risk based on F&P
Benchmark scores (Table 2) and a model combining ROAR-PA and ROAR-SWR scores
performed better than either score on its own.
Grade Predictors Accuracy Sensitivity Specificity AUC
1 s(SWR) 0.85 0.89 0.77 0.91

1 s(PA) 0.76 0.86 0.56 0.82
1 te(SWR, PA) 0.87 0.91 0.79 0.93
2 s(SWR) 0.84 0.91 0.65 0.86
16
2 s(PA) 0.77 0.92 0.36 0.79
2 te(SWR, PA) 0.84 0.92 0.60 0.91
Table 2: Sensitivity and specificity of ROAR as a screener for reading difficulties. Results are shown
separately for first and second grade. The mgcv package in R was used to fit GAMs with predictors as
specified in the predictors column. Combining ROAR-PA with ROAR-SWR achieved exceptional
sensitivity in both 1st and 2nd grade.
Finally, we examined the classification accuracy of ROAR-PA and ROAR-SWR administered in

the winter of 1st grade (March, 2022) for classifying F&P levels 8 months later in the fall of 2nd
grade (N=130; November/December). A GAM with ROAR-PA achieved an AUC = .70,
ROAR-SWR achieved and AUC = .83 and a GAM with ROAR-PA and ROAR-SWR achieved an
AUC = .84. These results demonstrate the promise of ROAR-PA and ROAR-SWR as screening
tools in the typical classroom setting.
4. Discussion
Phonological Awareness is an important skill at the foundation of reading development
(Scarborough, 1998; Snowling, 1998; Wagner & Torgesen, 1987), often introduced in preschool
or kindergarten, which can serve as a predictor for later reading performance, and a target for
early interventions (Hogan et al., 2003; Kirby et al., 2003). The present work developed and
validated an automated, online measure of PA for use in research and practice
(https://roar.stanford.edu/pa). Most PA assessments are time-consuming, resource intensive
and subjective to scoring. Our primary goal was to evaluate the suitability and efficiency of an
online PA assessment to both increase accessibility for practitioners and reduce bottle-necks to
large-scale research studies. We found that in just 12 minutes, ROAR-PA achieved reliability
akin to in person assessment (𝛼 = .96), and had predictive validity when administered in the
classroom setting by teachers with little training, making this a highly-efficient screening and
assessment tool for both practitioners and researchers.
The initial version of ROAR-PA included 5 subtests, modeled after subtests of well-known,
standardized, in-person PA assessments (Robertson & Salter, 1997; Wagner, Torgesen &
Pearson, 1999; Williams, 2014). Results from Part 1 indicated that a composite measure based
on 3 subtests (First Sound Matching; FSM, Last Sound Matching; LSM and Deletion; DEL) was
sufficient to characterize a wide range of PA skills and reliably predict performance on the
CTOPP-2 (Fig 2., right bottom). The fourth subtest (i.e., Blending) was not sufficiently
17
discriminative and the subtest Rhyming did not add additional information to the ROAR-PA
composite above and beyond what was measured by the other 3 subtests. IRT analysis helped
to narrow down the assessment to a brief version with a max of 57 (1I-3AFC) PA items. Based
on an analysis of a) the correspondence between ROAR-PA and individually administered
CTOPP-2 assessments, and b) the reliability of ROAR-PA composite scores, we determined
that ROAR-PA is ideal for children between 4 and 9.5 years old. This version of the ROAR-PA
spans a continuum of PA abilities across the three subtests, and serves as a quick measure of
PA well-suited for screening in pre-k through fourth grade. Even though our score report
analysis (Part 3) showed ceiling effects for older children, it is still worth conducting future
research to examine whether ROAR-PA can detect PA difficulties in older children with reading
disabilities.
Most PA measures rely on verbal responses. An important goal of the present work was to
ascertain whether a purely receptive test (touch screen responses) could achieve a similar
measure of PA without verbal responses. We reported a strong correlation (r = .74) between the
ROAR-PA and the well-established standardized “Comprehensive Test Of Phonological
Processing - 2” (CTOPP-2; Wagner, Torgesen & Pearson, 1999). This is a promising result as
Skibbe et al. (2020) report correlations of r = .25 - .57 between the verbal-response PA
measures CTOPP-2 and the DIBELS (Dynamic Indicators of Basic Early Literacy Skills; Dewey
et al., 2012), r = .57 between the CTOPP-2 and PELI (Preschool Early Literacy Indicators;
Kaminski et al., 2014), and a Pearson correlation of r = .58 - .65 between their own developed
receptive PA measure and the CTOPP-2.
After establishing the convergent validity of the test, the ultimate goal of any PA assessment is
to create a measure that can a) identify children who are struggling to develop PA skills, b)
identify specific PA skills that could be targeted for training, and c) predict future reading
performance, so that educators can intervene before reading difficulties arise. The automated
score reports from this fully automated online measure provide detailed information about the
mastery of skills underlying each subtest as well as an overall percentile score that can be used
to flag children who are at-risk of reading difficulties. Reporting mastery on individual subskills
was supported by factor analysis, which revealed that First Sound Matching, Last Sound
Matching and Deletion subtests represent a multi-dimensional framework (3 factors). Each
subtest represented a correlated but distinct skill contributing to the latent PA construct.
Therefore, this work supports the hypothesis that PA assessments do not necessarily reflect a
single dimension (Boyer & Ehri, 2011; Gough et al., 1992; Wade-Woolley, 2016 ;Ziegler &
18
Goswami, 2005), and even within subtests (i.e., deletion) a hierarchical structure of skills
(Stanovich, 2017; Treiman & Zukowski, 1991) should be considered (see Item difficulty
assessment in Part 2). By establishing a good correlation between the ROAR-PA and
standardized reading scores such as the Woodcock-Johnson (r = .46 real words; r = .50
non-words; Schrank et al., 2014) ,and between the ROAR-PA and ROAR-SWR (r = .53;
Yeatman et al., 2021), this assessment confirms the expected relationship between PA and
reading skills (r = .46, Scarborough, 1998; r = .41 (real words) - .43 (non-words), Swanson et
al., 2003).
Lastly, the ROAR-PA was designed not just to work in a research environment or clinical setting,
but to be available as a universal screener that could be used in schools. To validate the use of
ROAR-PA in schools, we ran a large study of first and second classrooms in California schools,
substantiating the value of ROAR-PA in real-world, classroom settings. We found that the task
could be administered simultaneously for a whole classroom and that it is predictive of current
and future reading scores. Moreover, ROAR-PA in combination with ROAR-SWR achieved very
high sensitivity for identifying students with reading difficulties. However, specificity was lower
indicating that the ROAR identified reading difficulties in students who performed adequately on
the Fountas and Pinnell Benchmark Assessment. Understanding these discrepancies is an
important challenge for future research to determine if a) ROAR is sensitive to sub-skills that are
missed by the courser measures like the running records used by F&P versus b) the automated
ROAR measures are identifying some students who, in fact, perform well on measures with
verbal responses.
In future work ROAR-PA will be made into an adaptive task to make it even more efficient, and
new items will be added so this task can be assessed at more regular intervals. And although
PA is most valuable in early elementary school, adding more complex items could extend the
utility of ROAR-PA to older children and adolescents with reading difficulties. In conclusion, we
have shown through a series of design and validation studies that rapid, automated, online
measures can lift the burden of resource-intensive, one-on-one assessment throughout
childhood. These open-source tools can facilitate developmental researchers to pursue larger,
more diverse samples and catalyze discoveries into the mechanisms of learning that have
applicability in clinics and classrooms.
19
5. Methods
Receptive phonological awareness task (Version 1a). Data and analysis code are available
at: https://github.com/yeatmanlab/roar-pa-manuscript. A one-interval, three-alternative forced
choice task was created in the online study builder Lab.js (Henninger et al., 2020), converted to
Javascript and uploaded to Pavlovia, an online experiment platform for hosting experiments
(MacaSkill et al., 2022). This task, the ROAR-PA, was split into 5 subtests (25 or 24 items per
subtest), with each subtest consisting of 2 or 3 blocks (divided by difficulty level). Each subtest
started off with 3 training items with feedback. Training items had to be completed correctly
before the task would continue to the test items. To engage children from pre-k to 8th grade the
task was embedded in a story where the child had to help a monkey and his four friends (rabbit,
bear, otter and squirrel) collect their favorite foods. At the end of every trial some images of food
would be displayed and at the end of every block a visualization of the collected food would be
displayed, and the character would provide encouragement (e.g., “Great job! So many bananas!
Let’s get a few more!”). Every character would provide the instructions of their own subtest,
guide throughout the task, motivate the participants to take short breaks between blocks, and
introduce their next friend.
Participants were instructed to to work on a computer (no table or phone) sitting at a desk.
Sound had to be turned on and set to a comfortable level, based on the instructions of one of
the characters. All game instructions were provided in both text and audio. For all trials of all
subtests, one image accompanied by an audio fragment, recorded by a native English-speaking
male, would provide the specific instruction of that trial (e.g., subtest FSM: “Which picture starts
with the same sound as dog?”). This screen would be followed by the instruction image (top)
plus three answer options (left, middle, right). All images would be verbalized (e.g., “dam”
(target), “goat” (foil 1), “mop” (foil 2)) and the position of the images was randomized for each
trial. For the BLE and DEL subtests the images were not verbalized as this could give away the
correct answer. In contrast, participants were allowed to listen to the instruction phrase two
times for these subtests. We did not implement this throughout the entire task to stay as
consistent as possible with standardized PA tasks like the CTOPP-2 in which instructions are
not repeated. Participants could pick a response by clicking the image, followed by a
visualization of the response with a random number of food images as motivation (Fig. 8). All
participants completed a total of 15 practice items and 123 trials. None of the trials, nor the
breaks were limited in time, giving participants of all ages the opportunity to complete the tasks
at their own pace.
20
Figure 8: Visualization of experimental set-up. Every subtest (e.g., FSM) started with visual + auditory
instructions, followed by some practice items with feedback. After the practice items 25 (or 24) items were
presented in a 1I-3AFC task with random feedback (independent of answer, as motivation). The items
were presented in semi-random order (within every block) and there were 2-3 sections per subtest. All
items were presented visually (via clip-art images) + auditory.
Subtests and stimuli (Version 1a). The subtests were selected based on well-known
standardized in-person PA tasks (e.g., PAT-2:NU; Robertson & Salter, 1997, CTOPP-2; Wagner,
Torgesen & Pearson, 1999, PPA; Williams, 2014): First Sound Matching (FSM), Last Sound
Matching (LSM), Rhyming (RHY), Blending (BLE) and Deletion (DEL). In FSM the participant
had to find a word with the same first sound as a provided word; in LSM the last sound had to
be recognized. In RHY, the participant had to find the word that rhymed with the provided word.
In BLE parts of words had to be merged to one word. In DEL, a section of the word had to be
omitted. FSM, LSM and RHY each consisted of 25 trials, divided into 2 blocks (16 and 9 items).
The difference between blocks of these 3 subtests was finding the first sound (FSM), last sound
(LSM), or word that rhymed (RHY) of a CVC word (difficulty level 1, 16 items) or a (C)CVC(C)
word (difficulty level 2, 9 items). For FSM the three answer options were either the target (i.e.,
same first sound), a foil that started with the last sound of the provided word (Foil 1), or a foil
with the same vowel (Foil 2). For LSM the same reasoning was made, but for the last sound of
the word. For RHY the target word would rhyme, whereas Foil 1 would have the same vowel but
not rhyme and Foil 2 would have the same first sound. BLE and DEL each consisted of 24
21
items, divided into 3 difficulty levels (i.e., syllable level, onset or rime level, phoneme level) with
each 8 items. These difficulty levels were based on a suggested hierarchy within PA skills
(Stanovich, 2017; Treiman & Zukowski, 1991; Anthony & Lonigan, 2004).
Participant recruitment and consent procedure (Version 1a). The parent or guardian of each
participant provided informed consent under a protocol that was approved by the University of
Washington’s (UW) Institutional Review Board and all methods were carried out in accordance
with these guidelines. Each child participant was provided with assent via a conferencing call,
previous to task assessment. Participants were recruited from one of two participant databases
at UW: the University of Washington Reading & Dyslexia Research Program
(http://ReadingAndDyslexia.com) or the UW Communication Participant Pool. A total of 143
participants (Age: 3.87 - 13.00,= 7.13, = 1.89; Sex: 67 F, 76 M) completed the five subtests of
ROAR-PA.
Subtests and stimuli and participants (Version 1b). Based on a first round of analysis (Item
Response Theory analysis and correlation measures with the standardized CTOPP-2), the 3
most suitable subtests (FSM, LSM, and DEL) were selected to be included in version 1b of the
ROAR-PA. The stimuli for this task were not adjusted (25, 25 and 24 items with 3 practice items
per subtest). This updated task was completed by an additional group of 127 participants with
mainly older children, resulting in 270 participants (Age: 3.87 – 14.92,= 9.12, = 2.71; Sex: 125 F,
145 M) completing ROAR-PA FSM, LSM and DEL.
Assessment of other tasks. Of these 279 participants (Age: 3.87 – 14.92) from ROAR-PA_v.1,
266 completed the CTOPP-2 PA assessment via a video conferencing call. This task was
assessed by a Speech Language Pathologist, as similar as possible to in-person testing. The
Phonological Awareness Score was determined by 3 subtests: elision, blending, and phoneme
isolation/ sound matching (depending on the age). In the same virtual session, we also
assessed reading scores of 230 participants on the standardized Woodcock-Johnson
Letter-Word-ID and Woodcock-Johnson Word-Attack task. Additionally, these participants
completed the ROAR-LDT task via the web-browser.
Subtests and stimuli and participants (Version 2). After the analysis, a second version of the
ROAR-PA was developed (ROAR-PA_v.2). We selected 3 subtests (FSM, LSM and DEL) with
19 items and 2 practice items per subtest. This resulted in a total of 57 trials + 6 practice trials.
The task was broken up into 2 blocks per subtest. The items would now be presented in a fixed
22
order, based on difficulty analysis, rather than being randomized within each block. To allow for
efficient assessment, and to make the task accessible for both young and older children, a
ceiling rule was implemented in this version of the ROAR-PA where the subtest will end if the
participant gets either a) 3 trials in a row incorrect, or b) 4 out of 6 consecutive trials incorrect.
This design is similar to the ceiling rule implemented in CTOPP-2. Similar to the original task,
feedback would be given in the practice trials, but the actual trials only provided motivational
rewards. For this validation, ROAR-PA_v.2 was administered to 50 1st and 2nd grade classrooms
including 901 children in schools in the state of California. All study procedures were designed
in collaboration with participating school districts, and ethical approval was obtained from the
Stanford University institutional review board. At least two weeks prior to the administration of
the ROAR, letters were sent home to all the families in the school districts describing the study
and providing parents the opportunity to opt their children out of participation. The ROAR was
not administered to children whose families opted out of the study. Children completed the
ROAR-PA and ROAR-SWR tasks on Chromebooks in their classroom and were provided
support in logging in but no assistance was given on the tasks. ROAR administration was done
simultaneously for all the children in a classroom. Furthermore, reading levels were acquired via
the Fountas and Pinnell Leveled Literacy Intervention (Fountas et al., 2009) through one-on-one
assessments administered as standard of practice by the classroom teacher.
Data analysis. All analysis code and data are publicly available at:
https://github.com/yeatmanlab/roar-pa-manuscript. Data analysis was conducted in RStudio.
Data collected in schools are not publicly available due to privacy agreements.
6. References
1. Allen, M. J., & Yen, W. M. (2001). Introduction to measurement theory. Waveland Press.
2. Anthony, J. L., & Francis, D. J. (2005). Development of phonological awareness. Current
directions in psychological Science, 14(5), 255-259.
3. Anthony, J. L., & Lonigan, C. J. (2004). The nature of phonological awareness: Converging
evidence from four studies of preschool and early grade school children. Journal of educational
psychology, 96(1), 43
4. Bentin, S., & Leshem, H. (1993). On the interaction between phonological awareness and reading
acquisition: It’sa two-way street. Annals of dyslexia, 43(1), 125-148.
5. Bosse, M. L., Tainturier, M. J., & Valdois, S. (2007). Developmental dyslexia: The visual attention
span deficit hypothesis. Cognition, 104(2), 198-230
6. Boyer, N., & Ehri, L. C. (2011). Contribution of phonemic segmentation instruction with letters and
articulation pictures to word reading and spelling in beginners. Scientific Studies of Reading,
15(5), 440-470
23
7. Bradley, L., & Bryant, P. E. (1983). Categorizing sounds and learning to read—a causal
connection. Nature, 301(5899), 419-421
8. Brady, S., Fowler, A., Stone, B., & Winbury, N. (1994). Training phonological awareness: A study
with inner-city kindergarten children. Annals of Dyslexia, 44, 26-59
9. Bus, A. G., & Van IJzendoorn, M. H. (1999). Phonological awareness and early reading: A
meta-analysis of experimental training studies. Journal of educational psychology, 91(3), 403
10. Castles, A., & Coltheart, M. (2004). Does phonological awareness help children learn to read.
Cognition, 91, 77-111.
11. Catts, H. W., & Petscher, Y. (2022). A cumulative risk and resilience model of dyslexia. Journal of
Learning Disabilities, 55(3), 171-184.
12. Catts, H. W., McIlraith, A., Bridges, M. S., & Nielsen, D. C. (2017). Viewing a phonological deficit
within a multifactorial model of dyslexia. Reading and writing, 30, 613-629.
13. Compton, D. L. (2021). Focusing our view of dyslexia through a multifactorial lens: A
commentary. Learning Disability Quarterly, 44(3), 225-230.
14. Dewey, E. N., Latimer, R. J., Kaminski, R. A., & Good, R. H. (2012). DIBELS Next development:
Findings from Beta 2 validation study (No. 10). Technical Report.
15. Fountas, I. C., & Pinnell, G. S. (2009). Leveled literacy intervention. Portsmouth, NH
16. Gaab, N., & Petscher, Y. (2021). Earlybird Technical Manual.
17. Goswami, U. (2015). Sensory theories of developmental dyslexia: three challenges for research.
Nature Reviews Neuroscience, 16(1), 43-54.
18. Gough, P. B., Ehri, L. C., & Treiman, R. E. (1992). Reading acquisition. In This volume is based
on a conference held at the University of Texas-Austin, Cognitive Science Center, Mar 1986..
Lawrence Erlbaum Associates, Inc
19. Henninger, F., Shevchenko, Y., Mertens, U. K., Kieslich, P. J., & Hilbig, B. E. (2021). lab. js: A
free, open, online study builder. Behavior Research Methods, 1-18
20. Hogan, T. P., Catts, H. W., & Little, T. D. (2005). The relationship between phonological
awareness and reading
21. Janus, M., & Duku, E. (2007). The school entry gap: Socioeconomic, family, and health factors
associated with children's school readiness to learn. Early education and development, 18(3),
375-403
22. Joo, S. J., White, A. L., Strodtman, D. J., & Yeatman, J. D. (2018). Optimizing text for an
individual's visual system: The contribution of visual crowding to reading difficulties. Cortex, 103,
291-301
23. Kaminski, R. A., Abbott, M., Bravo Aguayo, K., Latimer, R., & Good III, R. H. (2014). The
preschool early literacy indicators: Validity and benchmark goals. Topics in Early Childhood
Special Education, 34(2), 71-82.
24. Karami, J., Abbasi, Z., & Zakei, A. (2013). The effect of phonological awareness training on
speed, accuracy and comprehension of students with dyslexia. Journal of Learning Disabilities,
2(3), 38-53
25. Kevan, A., & Pammer, K. (2008). Visual deficits in pre-readers at familial risk for dyslexia. Vision
research, 48(28), 2835-2839
26. Kirby, J. R., Parrila, R. K., & Pfeiffer, S. L. (2003). Naming speed and phonological awareness as
predictors of reading development. Journal of Educational Psychology, 95(3), 453.
27. Long, B., Simson, J., Buxó-Lugo, A., Watson, D. G., & Mehr, S. A. (2023). How games can make
behavioural science better. Nature, 613(7944), 433-436.
28. MacAskill, M., Hirst, R., & Peirce, J. (2022). Building Experiments in PsychoPy. Building
Experiments in PsychoPy, 1-100.
29. Melby-Lervåg, M., Lyster, S. A. H., & Hulme, C. (2012). Phonological skills and their role in
learning to read: a meta-analytic review. Psychological bulletin, 138(2), 322.
24
30. Míguez‐Álvarez, C., Cuevas‐Alonso, M., & Saavedra, Á. (2022). Relationships Between
Phonological Awareness and Reading in Spanish: A Meta‐Analysis. Language Learning, 72(1),
113-157.
31. Moyle, M. J., Heilmann, J., & Berman, S. S. (2013). Assessment of early developing phonological
awareness skills: A comparison of the preschool individual growth and development indicators
and the phonological awareness and literacy screening–preK. Early Education & Development,
24(5), 668-686.
32. Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Percie du Sert, N., ...
& Ioannidis, J. (2017). A manifesto for reproducible science. Nature human behaviour, 1(1), 1-9
33. Nittrouer, S., Studdert-Kennedy, M., & McGowan, R. S. (1989). The emergence of phonetic
segments: Evidence from the spectral structure of fricative-vowel syllables spoken by children
and adults. Journal of Speech, Language, and Hearing Research, 32(1), 120-132
34. Noble, K. G., Farah, M. J., & McCandliss, B. D. (2006). Socioeconomic background modulates
cognition–achievement relationships in reading. Cognitive Development, 21(3), 349-368
35. Nussenbaum, K., Scheuplein, M., Phaneuf, C. V., Evans, M. D., & Hartley, C. A. (2020). Moving
developmental research online: comparing in-lab and web-based studies of model-based
reinforcement learning. Collabra: Psychology, 6(1)
36. O’Brien, G., & Yeatman, J. D. (2021). Bridging sensory and language theories of dyslexia: Toward
a multifactorial model. Developmental Science, 24(3), e13039
37. Papadopoulos, T. C., Spanoudis, G., & Kendeou, P. (2009). The dimensionality of phonological
abilities in Greek. Reading Research Quarterly, 44(2), 127-143.
38. Peeters, M., Verhoeven, L., de Moor, J., & Van Balkom, H. (2009). Importance of speech
production for phonological awareness and word decoding: The case of children with cerebral
palsy. Research in developmental disabilities, 30(4), 712-726
39. Pennington, B. F. (2006). From single to multiple deficit models of developmental disorders.
Cognition, 101(2), 385–413.
40. Pennington, B. F., Santerre-Lemmon, L., Rosenberg, J., MacDonald, B., Boada, R., Friend, A., ...
& Olson, R. K. (2012). Individual prediction of dyslexia by single versus multiple deficit models.
Journal of abnormal psychology, 121(1), 212
41. Ramus, F. (2003). Developmental dyslexia: specific phonological deficit or general sensorimotor
dysfunction?. Current opinion in neurobiology, 13(2), 212-218
42. Robertson, C., & Salter, W. (1997). The phonological awareness test. LinguiSystems. Inc., East
Moline, IL.
43. Scarborough, H. S. (1998). Predicting the future achievement of second graders with reading
disabilities: Contributions of phonemic awareness, verbal memory, rapid naming, and IQ. Annals
of Dyslexia, 48, 115-136.
44. Schneider, W., Ennemoser, M., Roth, E., & Küspert, P. (1999). Kindergarten prevention of
dyslexia: Does training in phonological awareness work for everybody?. Journal of learning
disabilities, 32(5), 429-436.
45. Schrank, F. A., Mather, N., & McGrew, K. S. (2014). Woodcock-Johnson IV tests of achievement.
Riverside Publ..
46. Scott, K., & Schulz, L. (2017). Lookit (part 1): A new online platform for developmental research.
Open Mind, 1(1), 4-14
47. Shanahan, T. (2005). The National Reading Panel Report. Practical Advice for Teachers.
Learning Point Associates/North Central Regional Educational Laboratory (NCREL)
48. Skibbe, L. E., Bowles, R. P., Goodwin, S., Troia, G. A., & Konishi, H. (2020). The access to
literacy assessment system for phonological awareness: An adaptive measure of phonological
awareness appropriate for children with speech and/or language impairment. Language, speech,
and hearing services in schools, 51(4), 1124-1138.
25
49. Snowling, M. (1998). Dyslexia as a phonological deficit: Evidence and implications. Child
Psychology and Psychiatry Review, 3(1), 4-11.
50. Spiro, R. J., & Myers, A. (1984). Individual differences and underlying cognitive processes in
reading. Handbook of reading research, 1, 471-501.
51. Stanovich, K. E. (2017). Speculations on the causes and consequences of individual differences
in early reading acquisition. In Reading acquisition (pp. 307-342). Routledge.
52. Swanson, H. L., Trainin, G., Necoechea, D. M., & Hammill, D. D. (2003). Rapid naming,
phonological awareness, and reading: A meta-analysis of the correlation evidence. Review of
Educational Research, 73(4), 407-440
53. Tabachnick, B. G., Fidell, L. S., Tabachnick, B. G., & Fidell, L. S. (2012). Chapter 13 principal
components and factor analysis. Using multivariate statistics, 6, 612-680.
54. Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and
applications. Guilford Press.
55. Treiman, R., & Zukowski, A. (2013). Levels of phonological awareness. In Phonological
processes in literacy (pp. 95-112). Routledge
56. Vellutino, F. R., Fletcher, J. M., Snowling, M. J., & Scanlon, D. M. (2004). Specific reading
disability (dyslexia): What have we learned in the past four decades?. Journal of child psychology
and psychiatry, 45(1), 2-40.
57. Vidyasagar, T. R., & Pammer, K. (2010). Dyslexia: a deficit in visuo-spatial attention, not in
phonological processing. Trends in cognitive sciences, 14(2), 57-63
58. Wade-Woolley, L. (2016). Prosodic and phonemic awareness in children’s reading of long and
short words. Reading and Writing, 29(3), 371-382
59. Wagner, R. K., & Torgesen, J. K. (1987). The nature of phonological processing and its causal
role in the acquisition of reading skills. Psychological bulletin, 101(2), 192
60. Wagner, R. K., Torgesen, J. K., Rashotte, C. A., & Pearson, N. A. (1999). Comprehensive test of
phonological processing: CTOPP. Austin, TX: Pro-ed.
61. Williams, K. (2014). Phonological and Print Awareness Scale. WPS Publishing.
62. Wood, S. N. (2006). Generalized additive models: an introduction with R. chapman and hall/CRC
63. Woodcock, R., McGrew, K., Mather, N., & Schrank, F. (2007). Woodcock-Johnson III NU tests of
achievement. The International Journal of Neuroscience, 117(1), 11-23
64. Wright, B. D. (1996). Reasonable mean-square fit values. Rasch measurement transactions, 2,
370.
65. Yarkoni, T. (2022). The generalizability crisis. Behavioral and Brain Sciences, 45, e1.
66. Yeatman, J. D., Tang, K. A., Donnelly, P. M., Yablonski, M., Ramamurthy, M., Karipidis, I. I., ... &
Domingue, B. W. (2021). Rapid online assessment of reading ability. Scientific reports, 11(1),
1-11.
67. Ziegler, J. C., & Goswami, U. (2005). Reading acquisition, developmental dyslexia, and skilled
reading across languages: a psycholinguistic grain size theory. Psychological bulletin, 131(1), 3.
68. Zuk, J., Dunstan, J., Norton, E., Yu, X., Ozernov-Palchik, O., Wang, Y., Hogan, T. P., Gabrieli, J.
D. E., & Gaab, N. (2021). Multifactorial pathways facilitate resilience among kindergarteners at
risk for dyslexia: A longitudinal behavioral and neuroimaging study. Developmental Science,
24(1), e12983.
26

Gijbels Et Al. ROAR-PA Pre Print

Uploaded by

Copyright:

Available Formats

Gijbels Et Al. ROAR-PA Pre Print

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gijbels Et Al. ROAR-PA Pre Print

Uploaded by

Copyright:

Available Formats

Rapid Online Assessment of Reading and Phonological Awareness (ROAR-PA)

Liesbeth Gijbels1,2, Amy Burkhardt 3,4, Wanjing Ma4, Jason D. Yeatman3,4,5

Online experiments have grown in popularity as an alternative to one-on-one testing because

Part 1: Feasibility of online PA measure, Subtest and Age

Validation of ROAR-PA composite score

Part 2: Factor, Item, and Difficulty Analysis

Part 3: Correlation to reading skills, and score reports

Furthermore, students’ performance can be interpreted in relation to percentile scores

Part 4: Assessment of ROAR-PA_v.2 in school districts

Grade Predictors Accuracy Sensitivity Specificity AUC

1 s(SWR) 0.85 0.89 0.77 0.91

Finally, we examined the classification accuracy of ROAR-PA and ROAR-SWR administered in

You might also like