Computerised Adaptive Testing Accurately Predicts Cleft-Q Scores by Selecting Fewer, More Patient-Focused Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Journal of Plastic, Reconstructive & Aesthetic Surgery (2019) 72, 1819–1824

Computerised adaptive testing accurately


predicts CLEFT-Q scores by selecting fewer,
more patient-focused questions
Conrad J. Harrison a,b,∗, Daan Geerards b,c,d,
Maarten J. Ottenhof b,c,d, Anne F. Klassen e,
Karen W.Y. Wong Riff f, Marc C. Swan a,g, Andrea L. Pusic b,c,
Chris J. Sidey-Gibbons b,c
a
Department of Plastic Surgery, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation
Trust, Oxford, UK
b
Patient-Reported Outcomes, Value & Experience (PROVE) Centre, Department of Surgery, Brigham and
Women’s Hospital, Boston, MA, USA
c
Department of Surgery, Harvard Medical School, Boston, Massachusetts, USA
d
Department of Plastic and Reconstructive Surgery, Catharina Hospital, Eindhoven, the Netherlands
e
Department of Pediatrics, McMaster University, Hamilton, Ontario, Canada
f
Department of Plastic and Reconstructive Surgery, Hospital for Sick Children, Toronto, Ontario, Canada
g
Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK

Received 8 September 2018; accepted 15 May 2019

KEYWORDS Summary Background: The International Consortium for Health Outcome Measurement
Computerised adaptive (ICHOM) has recently agreed upon a core outcome set for the comprehensive appraisal of cleft
testing; care, which puts a greater emphasis on patient-reported outcome measures (PROMs) and, in
Computerized particular, the CLEFT-Q. The CLEFT-Q comprises 12 scales with a total of 110 items, aimed to
adaptive testing, CAT; be answered by children as young as 8 years old.
Patient-reported Objective: In this study, we aimed to use computerised adaptive testing (CAT) to reduce the
outcome, PRO; number of items needed to predict results for each CLEFT-Q scale.
PROM; Method: We used an open-source CAT simulation package to run item responses over each of
CLEFT-Q the full-length scales and its CAT counterpart at varying degrees of precision, estimated by
standard error (SE). The mean number of items needed to achieve a given SE was recorded for

Conflicts of Interest: The CLEFT-Q is owned by McMaster University and The Hospital for Sick Children, and it was developed by Anne Klassen
and Karen Wong Riff. The CLEFT-Q can be used free of charge for non-profit purposes (e.g. by clinicians, researchers and students). The
other authors declare no potential conflicts of interest with regard to the research, authorship and publication of this article.
∗ Corresponding author at: Department of Plastic Surgery, John Radcliffe Hospital, Oxford University Hospitals NHS Foundation Trust,

Oxford, UK.
E-mail addresses: [email protected] (C.J. Harrison), [email protected] (C.J. Sidey-Gibbons).

https://doi.org/10.1016/j.bjps.2019.05.039
1748-6815/© 2019 British Association of Plastic, Reconstructive and Aesthetic Surgeons. Published by Elsevier Ltd. This is an open access
article under the CC BY-NC-ND license. (http://creativecommons.org/licenses/by-nc-nd/4.0/)
1820 C.J. Harrison, D. Geerards and M.J. Ottenhof et al.

each scale’s CAT, and the correlations between results from the full-length scales and those
predicted by the CAT versions were calculated.
Results: Using CATs for each of the 12 CLEFT-Q scales, we reduced the number of questions
that participants needed to answer, that is, from 110 to a mean of 43.1 (range 34–60, SE < 0.55)
while maintaining a 97% correlation between scores obtained with CAT and full-length scales.
Conclusions: CAT is likely to play a fundamental role in the uptake of PROMs into clinical
practice given the high degree of accuracy achievable with substantially fewer items.
© 2019 British Association of Plastic, Reconstructive and Aesthetic Surgeons. Pub-
lished by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license.
(http://creativecommons.org/licenses/by-nc-nd/4.0/)

Introduction approximately that level of person ability. As more items are


administered, the estimation becomes more accurate. This
Cleft lip and/or palate (CL/P) is one of the most preva- process is known as computerised adaptive testing (CAT) and
lent birth defects, affecting approximately one in 700 live is most clearly illustrated with an example: in a maths test,
births, with major implications for patients’ appearance, one aims to measure mathematical ability (person ability);
speech and psychosocial development.1 Outcome measures if the first question is of medium difficulty and the candi-
in CL/P have traditionally been objective and derived by date answers correctly, one can estimate that the candidate
care providers rather than reported by patients them- has a medium to high level of mathematical ability. At this
selves.2 The International Consortium for Health Outcome point, there is little merit in asking an easy question, as
Measurement (ICHOM) has recently agreed upon a core out- the candidate is likely to respond to that correctly. It would
come set for the comprehensive appraisal of cleft care, be more informative to ask the candidate a difficult ques-
which puts a greater emphasis on patient-reported outcome tion to discriminate between a medium and a high level of
measures (PROMs).3 Sub-scales of the CLEFT-Q form a large mathematical ability. The easier questions can be deleted
proportion of this recommended outcome set. from the test and an estimate of the candidate’s mathe-
The CLEFT-Q is a unique cross-cultural PROM designed to matical ability can be obtained with fewer, more relevant
measure the outcomes that matter to children and young questions.
adults with CL/P. The CLEFT-Q was field tested in an inter- A CAT will typically start with the item that provides
national study that included 2434 children from 12 coun- the most information for a patient with an average level
tries.4 This PROM is composed of 12 scales and a check- of person ability, then select items base on a candidate’s
list that measures different aspects of appearance, facial responses until a predefined measurement reliability (stan-
function and health-related quality of life. The instrument dard error, SE) has been achieved.7,8 Alternative stopping
was designed with a Rasch measurement theory (RMT) ap- rules can be set to terminate a CAT, for example, after a
proach, which means that not only do the scales function certain number of questions or a time limit.
independently from one another (not all scales need to be CAT has been used to administer educational and psycho-
administered; the assessor can choose which scales are most logical tests since the 1980s, including nursing and medical
relevant to a specific clinical scenario), but also each item licensing examinations, and aptitude tests for military per-
functions independently from others on that scale (meaning sonnel.9 More recently, the technology has been applied to
that two individuals’ scores can be compared irrespective PROM scales used to measure quality of life in the fields of
of the items they have answered).5 psychiatry,10,11 rheumatology8 and orthopaedics.12 The de-
In RMT, the trait that a scale measures (e.g. social func- velopment of the Q-portfolio, a series of psychometrically
tion, cleft scar appearance, etc.) is referred to as ‘per- robust PROMs designed in accordance with RMT, has been a
son ability’. A high value of person ability represents a significant advancement in the field of plastic surgery, en-
large amount of that trait and vice versa. Each CLEFT-Q abling accurate measurement of quality of life in a broad
scale provides three or four response options. For exam- range of conditions treated by plastic surgeons. A CAT can
ple, the cleft scar appearance scale includes seven items be produced for any PROM scale developed with RMT and ad-
that ask how much a patient likes their scar (e.g. colour, ministered using Concerto, a highly adaptable, open-source,
width, size, shape, etc.) and provides the following four re- R-based computer adaptive testing platform that is free to
sponse options: not at all, a little bit, quite a bit and very use.13
much. RMT analysis of the field test data provides estimates By eliminating items that are not relevant to an individ-
for the thresholds of person ability that would cause a re- ual, CAT has the ability to reduce the number of items that
spondent to pick one response option over another.6 It is need to be administered in a PROM while maintaining a high
therefore possible to estimate a new respondent’s person degree of accuracy.6 , 10–12 , 14–16 This approach of data collec-
ability level (with some degree of error) based on their re- tion is particularly appealing in the case of the CLEFT-Q,
sponse to an individual item. Using the person ability esti- which has a total of 110 items if all scales are used and is
mate obtained from a single response, a computer algorithm intended for children as young as 8 years old.17 CATs can
can select the next item in a scale to administer based on be administered on computers or electronic tablets with an
that item’s ability to discriminate between respondents at engaging user interface, and results are uploaded instantly
CLEFT-Q prediction by Computerised adaptive testing 1821

Figure 1 Relationship between scale length and mean item reduction (%).

to a secure server. They have the potential to provide real- Results


time graphical feedback on performance in an easily inter-
pretable format. The results of the CAT simulations are displayed in Table 1.
The aim of this study was to simulate responses to the With the CATs programmed to terminate after the SE
CAT algorithms for each scale of the CLEFT-Q, evaluating the dropped below 0.55, we were able to reduce the mean num-
performance of each CAT against its full-length counterpart. ber of items administered in all scales combined from 110
to 43.1 (a 60.8% reduction) while maintaining an accuracy
of 97%. In some simulations, the CATs were able to predict
combined CLEFT-Q scale scores with this accuracy from as
Methods few as 34 responses. When the stopping rules were set to an
SE of < 0.45, we achieved a 35.8% reduction in total items
The CLEFT-Q CATs were developed using person ability with a 99% accuracy.
threshold levels obtained from the field-test sample using The CAT that achieved highest item reduction was the
the RUM2030 platform.18 The CLEFT-Q field-test study col- School CAT, which reduced the mean number of items ad-
lected data between October 2014 and November 2016 from ministered from 10 to 2.2 with a 93.7% accuracy (SE < 0.55)
30 hospitals across 12 countries. A total of 2434 participants or from 10 to 4.2 with a 97.7% accuracy (SE < 0.45). The CAT
aged 8–29 years were recruited. A more in-depth description that was able to reduce item administration the least was
of the field-test study and its participants has been pub- the Nostrils CAT, which reduces the mean number of items in
lished elsewhere.4 the Nostrils scale from 6 to 4.0 (with an accuracy of 99.1%,
Performance of each CAT was evaluated using a CAT sim- SE < 0.55).
ulation package called FireStar19 in the R statistical com- As might be expected, CATs with the greatest propen-
puting environment.20 Each FireStar simulation computes sity to reduce items were those for the longer scales; this
the item responses that would be endorsed by a participant relationship is represented graphically in Figure 1.
with a randomly allocated level of person ability in both the
full-length scale and its CAT counterpart. CATs were set to
terminate at three degrees of SE: <0.32, <0.45 and <0.55 Discussion
(which approximately equate to Cronbach’s alpha scores of
0.9, 0.8 and 0.7, respectively).7 Simulations were iterated CAT algorithms are an appealing tool for the collec-
1000 times at each degree of SE. The mean numbers of tion of patient-reported outcome data, reducing the
items needed to achieve a given SE were recorded, along length of questionnaires without compromising their ac-
with the standard deviation of each mean, and the minimum curacy.6 , 10–12 , 14–16 A range of PROMs have recently been
and maximum number of items needed to produce that SE. developed with RMT in the field of plastic surgery, all of
The correlations between person ability values calculated which could potentially be refined by CAT.21–24 Work has be-
from the full-length scales and those predicted by the CAT gun to test the use of CAT in other patient-centred outcome
versions were calculated. measures at the Patient-Reported Outcomes, Value and Ex-
Hereafter, ‘accuracy’ will be defined as the Pearson’s perience (PROVE) Centre, Brigham and Women’s Hospital,
correlation coefficient of person ability estimates obtained Harvard Medical School. CAT is likely to revolutionise the
from the simulated CATs and those obtained from the way plastic surgeons collect quality of life data across a
simulated fixed-length forms. range of sub-specialties. In addition to its use in day-to-day
1822 C.J. Harrison, D. Geerards and M.J. Ottenhof et al.

Table 1 Item reduction characteristics of each CAT and their correlation with fixed-length form scores.
Number of Standard Mean number Standard Minimum Maximum Correlation
items error of items used deviation number of number of with patient
items needed items needed response
All scales 110 0.32 105.546 3.068 80 110 1.000
combined 0.45 70.594 4.175 51 110 0.990
0.55 43.112 2.207 34 60 0.970
Cleft Lip Scar 7 0.32 7.000 0.000 7 7 1.000
0.45 6.315 0.465 6 7 0.998
0.55 3.934 0.248 3 4 0.987
Face 9 0.32 9.000 0.000 9 9 1.000
0.45 5.106 1.183 4 9 0.986
0.55 3.204 0.680 3 6 0.967
Jaws 7 0.32 7.000 0.000 7 7 1.000
0.45 6.459 0.655 5 7 0.999
0.55 4.446 0.787 4 6 0.991
Lips 9 0.32 9.000 0.000 9 9 1.000
0.45 6.515 0.794 6 9 0.994
0.55 4.000 0.000 4 4 0.983
Nose 12 0.32 11.030 0.862 10 12 0.999
0.45 5.800 1.958 4 12 0.986
0.55 3.353 0.653 3 5 0.970
Nostrils 6 0.32 6.000 0.000 6 6 1.000
0.45 6.000 0.000 6 6 1.000
0.55 4.000 0.000 4 4 0.991
Teeth 8 0.32 8.000 0.000 8 8 1.000
0.45 5.088 1.466 4 8 0.988
0.55 2.905 0.902 2 5 0.966
Psychological 10 0.32 9.637 1.178 5 10 1.000
0.45 6.008 1.441 3 10 0.989
0.55 3.513 0.730 2 5 0.971
School 10 0.32 8.544 1.677 4 10 0.998
0.45 4.174 1.077 3 10 0.977
0.55 2.238 0.519 2 4 0.937
Social 10 0.32 8.770 1.574 4 10 0.998
0.45 4.531 1.199 3 10 0.976
0.55 2.996 0.595 2 5 0.951
Speech 10 0.32 9.795 0.929 5 10 1.000
Distress 0.45 6.878 1.444 3 10 0.989
0.55 4.066 0.870 2 6 0.962
Speech 12 0.32 11.770 1.062 6 12 1.000
Function 0.45 7.720 1.419 4 12 0.992
0.55 4.457 0.758 3 6 0.968

monitoring of clinical progression, CAT will facilitate the to administer. However, if all scales were to be used, the
study of disease severity, treatment effectiveness, compar- length (in terms of number of items) exceeds that of other
ative treatment effectiveness and treatment value from the paediatric quality of life measures.25–28 The response burden
perspective of a patient, in a way that is less burdensome of questionnaires is of particular concern in the paediatric
than our current means. population, and the development of a CAT for the CLEFT-Q
A software platform is required to administer CATs, is an exciting advancement.
record their results and display clinically meaningful feed- In this proof-of-concept study, we demonstrate the abil-
back in a way that is accessible to both the clinician and ity of CAT algorithms to substantially reduce the number
the patient. The authors of this paper currently recom- of items in the CLEFT-Q, while maintaining a remarkably
mend the administration of CATs through Concerto, a highly high degree of accuracy. Acceptable levels of accuracy
adaptable, open-source, R-based computer adaptive testing and SE for different situations (e.g. population-based re-
platform that is free to use for non-profit purposes. search, clinical practice, etc.) will become inferable with
An advantage of the CLEFT-Q is that each scale is inde- more work to establish the minimal important difference of
pendently functioning; therefore, researchers and clinicians CLEFT-Q scores. CLEFT-Q scales have recently been demon-
can reduce the response burden by choosing which scales strated to have content validity for use in other paediatric
CLEFT-Q prediction by Computerised adaptive testing 1823

craniofacial conditions,29 and future work may broaden the 9. Seo DG. Overview and current management of computerized
potential clinical applications of these scales and their CAT adaptive testing in licensing/certification examinations. J Educ
counterparts. Eval Health Prof 2017. doi:10.3352/jeehp.2017.14.17.
The ICHOM has recently agreed upon a holistic set of 10. Smits N, Cuijpers P, van Straten A. Applying computerized adap-
tive testing to the CES-D scale: a simulation study. Psychiatry
outcome measures for CL/P.3 This core outcome set will
Res 2011. doi:10.1016/j.psychres.2010.12.001.
facilitate patient-centred, evidence-based practice, inform 11. Smits N, Zitman FG, Cuijpers P, Den Hollander-Gijsman ME, Car-
clinical commissioning groups and improve the patient ex- lier IV. A proof of principle for using adaptive testing in routine
perience. CAT is likely to play a fundamental role in bringing outcome Monitoring: the efficiency of the mood and anxiety
the CLEFT-Q scales advocated by the ICHOM into practice. symptoms questionnaire -Anhedonic depression CAT. BMC Med
Res Methodol 2012. doi:10.1186/1471- 2288- 12- 4.
12. Hart DL, Mioduski JE, Stratford PW. Simulated computerized
Conclusion adaptive tests for measuring functional status were efficient
with good discriminant validity in patients with hip, knee, or
foot/ankle impairments. J Clin Epidemiol 2005. doi:10.1016/j.
The potential for CAT to decrease the number of items
jclinepi.2004.12.004.
needed to obtain CLEFT-Q scale scores has been demon- 13. Psychometrics Centre. University of Cambridge. Concerto
strated. By using CATs for each of the 12 CLEFT-Q scales, Adaptive Testing Platform. https://www.psychometrics.cam.
we reduced the number of questions asked overall from 110 ac.uk/newconcerto. Published 2013. Accessed 09 August 2018.
to a mean of 43.1 (range 34–60), predicting the final result 14. Cella D, Riley W, Stone A, et al. The patient-reported out-
with a 97% accuracy. Further work is required to refine mod- comes measurement information system (PROMIS) developed
ern outcome measures into focused, interactive tools that and tested its first wave of adult self-reported health outcome
provide engaging feedback to patients and clinically useful item banks: 2005-2008. J Clin Epidemiol 2010. doi:10.1016/j.
results to clinicians. CAT will play a key part in the design of jclinepi.2010.04.011.
these tools. 15. Forbey JD, Ben-Porath YS. Computerized adaptive personality
testing: a review and illustration with the MMPI-2 computerized
adaptive version. Psychol Assess 2007. doi:10.1037/1040-3590.
19.1.14.
Supplementary materials 16. Gibbons RD, Weiss DJ, Kupfer DJ, et al. Using computerized
adaptive testing to reduce the burden of mental health assess-
Supplementary material associated with this article can be ment. Psychiatr Serv 2008. doi:10.1176/appi.ps.59.4.361.
found, in the online version, at doi:10.1016/j.bjps.2019.05. 17. Tsangaris E, Wong Riff KWY, Goodacre T, et al. Establishing con-
039. tent validity of the CLEFT-Q. Plast Reconstr Surg - Glob Open
2017. doi:10.1097/GOX.0000000000001305.
18. Andrich D, Sheridan B, Luo G. RUMM2030: Rasch Unidimensional
References Models for Measurement. 2010.
19. Choi SW. Firestar: computerized adaptive testing simulation
1. Mossey PA, Little J, Munger RG, Dixon MJ, Shaw WC. Cleft program response theory models. Appl Psychol Meas 2009.
lip and palate. Lancet 2009. doi:10.1016/S0140-6736(09) doi:10.1177/0146621608329892.
60695-4. 20. R Development Team. R: A Language and Environment for Sta-
2. Jones T, Al-Ghatam R, Atack N, et al. A review of outcome tistical Computing. https://www.r-project.org. Accessed 8 Au-
measures used in cleft care. J Orthod 2014. doi:10.1179/ gust 2018.
1465313313Y.0000000086. 21. Klassen AF, Cano SJ, Scott A, Snell L, Pusic AL. Measuring
3. Allori AC, Kelley T, Meara JG, et al. A standard set of outcome patient-reported outcomes in facial aesthetic patients: devel-
measures for the comprehensive appraisal of cleft care. Cleft opment of the FACE-Q. Facial Plast Surg 2010. doi:10.1055/
Palate-Craniofacial J 2017. doi:10.1597/15-292. s- 0030- 1262313.
4. Klassen AF, Riff KWYW, Longmire NM, et al. Psychometric find- 22. Pusic AL, Klassen AF, Scott AM, Klok JA, Cordeiro PG, Cano SJ.
ings and normative values for the CLEFT-Q based on 2434 chil- Development of a new patient-reported outcome measure for
dren and young adult patients with cleft lip and/or palate from breast surgery: the BREAST-Q. Plast Reconstr Surg 2009. doi:10.
12 countries. CMAJ 2018. doi:10.1503/cmaj.170289. 1097/PRS.0b013e3181aee807.
5. Nguyen TH, Han H-R, Kim MT, Chan KS. An introduction to item 23. Klassen AF, Cano SJ, Alderman A, et al. The BODY-Q: a patient-
response theory for patient-reported outcome measurement. reported outcome instrument for weight loss and body con-
Patient 2014. doi:10.1007/s40271- 013- 0041- 0. touring treatments. Plast Reconstr Surg - Glob Open 2016.
6. Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation doi:10.1097/GOX.0000000000000665.
and calibration of health-related quality of life item banks: 24. Klassen AF, Ziolkowski N, Mundy LR, et al. Development of a
plans for the patient-reported outcomes measurement infor- new Patient-reported outcome instrument to evaluate treat-
mation system (PROMIS). Med Care 2007. doi:10.1097/01.mlr. ments for Scars: the SCAR-Q. Plast Reconstr Surg Glob open
0000250483.85507.04. 2018;6(4):e1672. doi:10.1097/GOX.0000000000001672.
7. Gibbons C, Bower P, Lovell K, Valderas J, Skevington S. Elec- 25. Broder HL, Wilson-Genderson M, Sischo L. Reliability and va-
tronic quality of life assessment using computer-adaptive test- lidity testing for the child oral health impact profile-reduced
ing. J Med Internet Res 2016. doi:10.2196/jmir.6053. (COHIP-SF 19). J Public Health Dent 2012. doi:10.1111/j.
8. Khanna D, Krishnan E, Dewitt EM, Khanna PP, Spiegel B, 1752-7325.2012.00338.x.
Hays RD. The future of measuring patient-reported outcomes 26. Broder HL, Wilson-Genderson M. Reliability and convergent and
in rheumatology: patient-Reported outcomes measurement in- discriminant validity of the child oral health impact profile
formation system (PROMIS). Arthritis Care Res 2011. doi:10. (COHIP child’s version). Commun Dent Oral Epidemiol 2007.
1002/acr.20581. doi:10.1111/j.1600-0528.2007.0002.x.
1824 C.J. Harrison, D. Geerards and M.J. Ottenhof et al.

27. Varni JW, Seid M, Rode CA. The PedsQL: measurement model for 29. Longmire NM, Wong Riff KWY, O’Hara JL, et al. Development
the pediatric quality of life inventory. Med Care 1999. doi:10. of a new module of the FACE-Q for children and young adults
1186/1477- 7525- 11- 47. with diverse conditions associated with visible and/or func-
28. Hullmann SE, Ryan JL, Ramsey RR, Chaney JM, Mullins LL. Mea- tional facial differences. Facial Plast Surg 2017. doi:10.1055/
sures of general pediatric quality of life: child health ques- s- 0037- 1606361.
tionnaire (CHQ), DISABKIDS chronic generic measure (DCGM),
KINDL-R, pediatric quality of life inventory (PedsQL) 4.0 generic
core Scales, and quality of my life questionnaire (QoML).
Arthritis Care Res 2011. doi:10.1002/acr.20637.

You might also like