Neuro
Neuro
Neuro
SERIES EDITOR
L. Stephen Miller
Glenn J. Larrabee
Martin L. Rohling
Civil Capacities in Clinical Neuropsychology
Edited by George J. Demakis
Secondary Influences on Neuropsychological Test Performance
Edited by Peter A. Arnett
Neuropsychological Aspects of Substance Use Disorders: Evidence-Based
Perspectives
Edited by Daniel N. Allen and Steven Paul Woods
Neuropsychological Assessment in the Age of Evidence-Based Practice: Diagnostic
and Treatment Evaluations
Edited by Stephen C. Bowden
iii
Neuropsychological Assessment
in the Age of Evidence-Based
Practice
Diagnostic and Treatment Evaluations
1
iv
1
Oxford University Press is a department of the University of Oxford. It furthers
the University’s objective of excellence in research, scholarship, and education
by publishing worldwide. Oxford is a registered trade mark of Oxford University
Press in the UK and certain other countries.
9 8 7 6 5 4 3 2 1
Printed by Sheridan Books, Inc., United States of America
v
Contents
viii C ontents
Index 289
ix
x Series P reface
the NAN Series and, as can be seen in the pages herein, has spared no effort or
expense to provide the finest-quality venue for the success of the Series.
The Series is designed to be a dynamic and ever-growing set of resources for
the science-based clinical neuropsychologist. As such, the volumes are intended
to individually focus on specific, significant areas of neuropsychological inquiry
in depth, and together over time to cover the majority of the contemporary and
broad clinical areas of neuropsychology. This is a challenging endeavor, and one
which relies on the foremost experts in the neuropsychological field to provide
their insight, knowledge, and interpretation of the empirically supported evi-
dence within each focused topic. It is our hope that the reader recognizes the
many established scholars from our field who have taken on the task of volume
editor and/or chapter author.
While each volume is intended to provide an exhaustive review of its particu-
lar topic, there are numerous constants across the volumes. Importantly, each
volume editor and respective chapter authors have committed to constraining
themselves to providing only evidence-based information that meets that defini-
tion. Second, each volume maintains a broad consistency in format, including
an introductory chapter outlining the volume, and a final discussion chapter
summarizing the state of the art within that topic area. Each volume provides
a comprehensive index, and each chapter provides relevant references for the
reader. Third, each volume is designed to provide information that is directly
and readily usable, in both content and format, to the clinical neuropsycholo-
gist in everyday practice. As such, each volume and chapter within the volume
is obliged to provide information in such a way as to make it accessible as a “pull
off the shelf” resource. Finally, each volume is designed to work within a peda-
gogical strategy such that it educates and informs the knowledgeable neuropsy-
chologist, giving a greater understanding of each particular volume focus, and
provides meaningful (read “useful”) information geared towards enhancing her/
his empirical practice of neuropsychology. In keeping with the educational focus
of the Series, a unique aspect is a collaboration of the Series contributors and the
NAN Continuing Education Committee such that each series volume is available
to be used as a formal continuing education text via the Continuing Education
Units system of NAN.
It is my hope, and the hope of the consulting editors who provide their time,
expertise, and guidance in the development of the NAN Series, that this will
become an oft-used and ever-expanding set of efficient and efficacious resources
for the clinical neuropsychologist and others working with the plethora of per-
sons with brain disorders and dysfunction.
L. Stephen Miller
Editor-in-Chief
National Academy of Neuropsychology
Series on Evidence-Based Practices
xi
Editor’s Preface
Contributors
xvi C ontributors
S T E P H E N C . B OW D E N
Paul Meehl argued that knowledge gained through clinical experience in pro-
fessional practice was inevitably a mixture of truths, half-truths, and myth
(Meehl, 1997). The possibility that learning through clinical experience gives
rise to knowledge that is not valid, or is based on myth, creates challenges for
any discipline that claims scientific credentials. These challenges have an impact
on educational practices, the development of scientific thinking in graduate stu-
dents, and on methods of professional development for mature professionals. As
is well known, scientifically unfounded practices have been described through-
out the history of clinical psychology, including the use of tests without estab-
lished validity and reliance on clinical decision-making methods that preclude
scientific evaluation (Garb, 1988; Wood, Nezworski, Lilienfeld, & Garb, 2003).
And clinical neuropsychology is not free from a history of myth, mostly aris-
ing from a neglect of scientific methods. Instead, we need methods that allow
students, young professionals, and mature professionals alike to identify clinical
knowledge that is based on good evidence and so limit the potentially mislead-
ing effects of unscientific thinking. Unscientific thinking risks wasting patients’
time, misusing scarce health-care resources, and may be potentially harmful
(Chelmsford Royal Commission, 1990; Wood et al., 2003).
As Meehl (1997) argued, scientific methods are the only way to distinguish
valid clinical knowledge from myth. Many older professional colleagues were
trained in an era when scientific methods for the refinement of professional
knowledge were less well taught. As a consequence, many colleagues developed
their approach to professional practice in an era when scientific methods to
guide clinical practice were less valued or less accessible (Grove & Meehl, 1996;
Lilienfeld, Ritschel, Lynn, Cautin, & Latzman, 2013; Wood et al., 2003). One
effect of the less rigorous scientific training in the past has been to encourage
clinicians to believe that a reliance on “clinical experience” is a valid source of
knowledge, without the need for explicit evaluation of knowledge claims (Arkes,
2
1981; Garb, 2005; Meehl, 1973). Younger colleagues trained in clinical neuro-
psychology at the present time, and critically, older colleagues who choose to
engage in effective professional development, have access to scientific methods to
refine clinical thinking that were relatively little known just two to three decades
ago. Using resources that are readily available on the Internet, professionals of
any age can train in methods for the scientific evaluation of clinical knowledge
that are widely adopted across health care disciplines (see www.cebm.net/; www.
equator-network.org/). These are the methods of evidence-based practice (see
Chelune, this volume).
In fact, methods of evidence-based practice are not new, but they have often
been neglected (Faust, 2012; Garb, 2005; Lilienfeld et al., 2013; Meehl, 1973). The
methods provide a refinement of scientific thinking that has been at the center
of scientific psychology for many years (Matarazzo, 1990; Meehl & Rosen, 1955;
Paul, 2007; Schoenberg & Scott, 2011; Strauss & Smith, 2009). However, in con-
trast to many conventional approaches to evaluating validity in psychology, the
methods of evidence-based practice provide skills that are quickly learned, easily
retained if practiced (Coomarasamy, Taylor, & Khan, 2003), and provide infor-
mation of more direct relevance to clinical decisions than the broad principles of
test validity and research methods typically taught to most graduate psycholo-
gists. While good research-methods training is critical for development of the
scientific foundations of practice, evidence-based practice builds on, and brings
into sharp clinical focus, the relevance of a strong foundation of scientific educa-
tion. As Shlonsky and Gibbs (2004) have observed, “Evidence-based practitio-
ners may be able to integrate research into their daily practice as never before”
(p. 152). Ironically, however, “evidence-based practice” is in danger of becoming
a catchphrase for anything that is done with clients that can somehow be linked
to an empirical study, regardless of the quality of the study or its theoretical
rationale, any competing evidence, or consideration of clients’ needs (Shlonsky
& Gibbs, 2004, p. 137).
evidence are described in detail in chapters by Berry and Miller, where methods
of critical appraisal are illustrated. The methods of critical appraisal are designed
to allow practitioners to quickly evaluate the quality in a published study and
so to grade the level of validity from weaker to stronger (www.cebm.net/; www.
equator-network.org/). As these chapters show, it is not necessary to be an active
researcher to be a sophisticated consumer of research and a provider of high-
quality evidence-based practice (Straus et al., 2011). Rather, a clinician needs to
understand how to identify high-quality scientific research methods and how
to communicate the relevance of study results to patients. The latter techniques
are facilitated by the methods of critical appraisal described by Berry and Miller
herein.
As Meehl (1997) also argued, the adoption of careful scientific scrutiny to
guide clinical practice is not merely the best way to refine scientific understand-
ing, but is also a fundamental ethical stance. We owe our patients accurate guid-
ance regarding which of our practices rest on good evidence and which of our
practices rely on less certain evidence or unfounded belief (Barlow, 2004). The
American Psychological Association Ethical Principles and the Standards for
Psychological Testing and Assessment require that clinicians undertake treat-
ment and assessment practices that are founded on scientific evidence (American
Educational Research Association, American Psychological Association, & the
National Council on Measurement in Education, 2014; American Psychological
Association, 2010). By extension, the ethical guidelines also require clinicians
to be explicitly cautious when practices sought by a patient, or offered by a cli-
nician, exceed the limits of our scientific knowledge, that is, lack strong scien-
tific support. The methods of evidence-based practice provide some of the most
time-efficient techniques to identify practices based on strong evidence and
to help identify when assessment or treatment practices exceed the limits of
knowledge based on well-designed studies. When supportive evidence from a
well-designed study cannot be found, then a clinician is obliged to infer that the
assessment or treatment practice does not rest on quality evidence and may be
of uncertain value.
MISUNDERSTANDING PSYCHOMETRICS
Another essential technical aspect of test score interpretation relates to the
understanding of psychometric principles. The dictionary of the International
Neuropsychological Society (Loring, 2015) defines psychometrics as the “sci-
entific principles underlying clinical and neuropsychological assessment.”
Although psychometric principles are covered in most graduate courses, many
practitioners gain only a relatively superficial appreciation of their importance
in the interpretation of test scores. As a consequence, imprecise or frankly inde-
fensible test-score interpretation is sometimes observed in clinical practice and
research. Psychometric principles underlie the scientific interpretation of diag-
nosis or the observation of changes in response to treatment interventions or
changing brain function. It is difficult to be a successful evidence-based practi-
tioner if one is using poor assessment tools or does not know how to distinguish
good tools from poor (Barlow, 2005). Unfortunately, there is a common view
that practitioners are not adequately trained in psychometric principles, and that
clinical psychology (including neuropsychology) on one hand, and psychomet-
rics on the other, have diverged as specializations when they should be more
closely integrated to better inform clinical practice (Aiken, West, & Millsap,
2008; Cronbach, 1957; Cumming, 2014; Sijtsma, 2009; Soper, Cicchetti, Satz,
Light, & Orsini, 1988).
In fact, some unfortunate misunderstandings of psychometrics persist. Rather
than psychometrics being seen as the scientific foundation of clinical assessment
for diagnosis or evaluation of change, as it should be, it is instead characterized
as, for example, an American-style fixed-battery approach to assessment (for
9
diverse views see Macniven, 2016). The diversity of North American approaches
to the practice of clinical neuropsychology, including the popularity of flexible
approaches, is well described by Russell (2012). In other approaches, psychomet-
rics is described as of lesser importance for true clinical insights that are best
derived from a reliance on experience and subjective intuitions, thereby down-
playing norms and tests standardization. Any approach that places low emphasis
on test norms and test reliability and validity is an illustration of the older under-
standing of clinical expertise, which elevates the role of subjective judgment and
downplays the importance of well-designed research to inform clinical think-
ing (Isaacs & Fitzgerald, 1999). In this light, a rejection of psychometrics risks
throwing the scientific ‘baby’ out with the psychometric ‘bath water’ (Meehl,
1973; Wood et al., 2007).
Four chapters in the current volume provide a summary of how psychomet-
ric principles of validity and reliability inform theoretical development and
assessment precision in clinical neuropsychology. Lee and colleagues describe
the ways validity methods have been used to refine models of psychopathology
for diagnostic assessment. Riley and colleagues show how assessment of cogni-
tive disorder has been refined using validity methods. Bowden and Finch review
the interpretation of reliability and the dramatic impact on precision in clinical
assessment associated with use of test scores with lower or unknown reliability.
Hinton-Bayre shows how reliable-change criteria can be used to improve preci-
sion in the interpretation of clinical change. These four chapters review founda-
tional knowledge in scientific practice of neuropsychology.
ORGANIZATION OF THE BOOK
After this introductory chapter, the next three chapters review the validity of
evidence for theories of cognitive function and psychopathology relevant to
neuropsychological practice. In Chapter 2, Riley and colleagues review the fun-
damental importance of theoretical refinement in clinical neuropsychology,
showing how the validity of tests is always enhanced by a strong theoretical
framework. Riley and colleagues show that there is a strong, reciprocal relation-
ship between the quality of our theories of neuropsychological assessment and
the validity of our assessment practices. In Chapter 3, Jewsbury and Bowden
review current models of cognitive assessment, suggesting that one particular
model stands out as a comprehensive schema for describing neuropsychological
assessment. These authors provide a provisional taxonomy of neuropsychologi-
cal tests to guide practice and promote further research. In Chapter 4, Lee and
colleagues show that refinements in models of psychopathology provide a strong
empirical guide to the assessment of psychopathology across a wide variety of
patient populations and clinical settings.
In the subsequent chapters, reviews and applications of the principles of
evidence-based practice are explained and illustrated. In Chapter 5, Bowden and
Finch outline the criteria for evaluating the reliability of test scores, showing that
simple techniques allow clinicians to estimate the precision of their assessments
and also to guard against the potentially distracting influences of tests with low
reliability, an epistemological trap for the unwary. The specific application of
reliability concepts to the detection of change over time is then reviewed by
Hinton-Bayre in Chapter 6, showing the variety of techniques that are available
to clinicians to improve detection of change related, for example, to therapeu-
tic interventions or changing brain function. Chelune describes, in Chapter 7,
the broad framework of evidence-based practice in clinical neuropsychology,
showing how clinicians, if they are conversant with the principles, can bring the
best evidence to bear on their clinical decisions. Chelune draws together best-
evidence techniques that have a long history in clinical psychology and neuro-
psychology and broader health-care research. In Chapter 8, Bigler describes the
current state of evidence supporting the clinical interpretation of neuroimaging
studies, delineating imaging techniques that have established clinical validity
and those that are under development.
The final chapters in this volume illustrate the clinical application of best-
evidence criteria and techniques for evaluation of published studies. Schoenberg
describes the EQUATOR network criteria in Chapter 9. These criteria form
the basis of study design and reporting standards that have been adopted by a
large number of biomedical journals, including an increasing number of neuro-
psychology journals (e.g., Lee, 2016; Bowden & Loring, 2016). The EQUATOR
network criteria highlight the importance of well-designed clinical studies to
11
REFERENCES
Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral training in statistics, mea-
surement, and methodology in psychology: Replication and extension of the Aiken,
West, Sechrest, and Reno’s (1990) survey of PhD programs in North America.
American Psychologist, 63, 32–50.
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (2014). Standards for Educational
and Psychological Testing. Washington, DC: American Educational Research
Association.
American Psychiatric Association. (2000). Diagnostic and Statistical Manual of Mental
Disorders, Fourth Edition, Text Revision. Washington, DC: Author.
American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental
Disorders, Fifth Edition. Arlington, VA: Author.
American Psychological Association. (2010). Ethical Principles of Psychologists and
Code of Conduct. Association ethical principles. Available from: http://w ww.apa.
org/ethics/code/. Accessed June 1, 2016.
Arkes, H. R. (1981). Impediments to accurate clinical judgement and possible ways to
minimise their impact. Journal of Consulting and Clinical Psychology, 49, 323–330.
Barlow, D. H. (2004). Psychological treatments. American Psychologist, 59, 869–878.
Barlow, D. H. (2005). What’s new about evidence-based assessment? Psychological
Assessment, 17, 308–311.
Baldessarini, R. J., Finklestein, S., & Arana, G. W. (1983). The predictive power of diag-
nostic tests and the effect of prevalence of illness. Archives of General Psychiatry,
40, 569–573.
Bowden, S. C. (1990). Separating cognitive impairment in neurologically asymptom-
atic alcoholism from Wernicke-Korsakoff syndrome: Is the neuropsychological dis-
tinction justified? Psychological Bulletin, 107, 355–366.
Bowden, S. C. (2010). Alcohol related dementia and Wernicke-Korsakoff syndrome.
In D. Ames, A. Burns, & J. O’Brien (Eds.), Dementia (4th ed., pp. 722–729).
London: Edward Arnold.
Bowden, S. C., & Loring, D. W. (2016). Editorial. Neuropsychology Review, 26, 107–108.
Bowden, S. C., & Ritter, A. J. (2005). Alcohol-related dementia and the clinical spec-
trum of Wernicke-Korsakoff syndrome. In A. Burns, J. O’Brien, & D. Ames (Eds.),
Dementia (3rd ed., pp. 738–744). London: Hodder Arnold.
12
Bowden, S. C., & Scalzo, S. J. (2016). Alcohol-related dementia and Wernicke-Korsakoff
syndrome. In D. Ames, J. O’Brien, & A. Burns. (Eds.), Dementia (5th ed., pp. 858–
868). Oxford: Taylor & Francis.
Brehmer, B. (1980). In one word: Not from experience. Acta Psychologica, 45, 223–241.
Butters, N., & Cermak, L. S. (1980). Alcoholic Korsakoff’s Syndrome: An Information-
Processing Approach to Amnesia. London: Academic Press.
Chelmsford Royal Commission. (1990). Report of the Royal Commission into Deep
Sleep Therapy. Sydney, Australia: Government Printing Service.
Cohen, P., & Cohen, J. (1984). The clinician’s illusion. Archives of General Psychiatry.
41, 1178–1182.
Coomarasamy, A., Taylor, R., & Khan, K. (2003). A systematic review of postgradu-
ate teaching in evidence-based medicine and critical appraisal. Medical Teacher,
25, 77–81.
Cronbach, L. J. (1957). The two disciplines of scientific psychology. American
Psychologist, 12, 671–684.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7–29.
Davison, G. C., & Lazarus, A. A. (2007). Clinical case studies are important in the
science and practice of psychotherapy. In S. Lilienfeld & W. O’Donohue (Eds.), The
Great Ideas of Clinical Science: 17 Principles That Every Mental Health Professional
Should Understand (pp. 149–162). New York: Routledge.
Devinsky, O. (2009). Delusional misidentifications and duplications: Right brain
lesions, left brain delusions. Neurology, 72, 80–87.
Einhorn, H. J. (1986). Accepting error to make less error. Journal of Personality
Assessment, 50, 387–395.
Faust, D. (1986). Learning and maintaining rules for decreasing judgment accuracy.
Journal of Personality Assessment, 50, 585–600.
Faust, D. (2007). Decision research can increase the accuracy of clinical judgement and
thereby improve patient care. In S. Lilienfeld & W. O’Donohue (Eds.), The Great
Ideas of Clinical Science: 17 Principles That Every Mental Health Professional Should
Understand (pp. 49–76). New York: Routledge.
Faust, D. (Ed.). (2012). Coping with Psychiatric and Psychological Testimony (6th ed.).
New York: Oxford University Press.
Fowler, R. D., & Matarazzo, J. (1988). Psychologists and psychiatrists as expert wit-
nesses. Science, 241, 1143.
Garb, H. N. (1988). Comment on “The study of clinical judgment: An ecological
approach.” Clinical Psychology Review, 8, 441–4 44.
Garb, H. N. (1998). Studying the Clinician: Judgment Research and Psychological
Assessment. Washington, DC: American Psychological Association.
Garb, H. N. (2005). Clinical judgment and decision making. Annual Review Clinical
Psychology, 1, 67–89.
Gates, N. J., & March, E. G. (2016). A neuropsychologist’s guide to undertak-
ing a systematic review for publication: Making the most of PRISMA guide-
lines. Neuropsychology Review, Published online first: May 19, 2016. doi:10.1007/
s11065-016-9318-0
Groth-Marnat, G. (2009). Handbook of Psychological Assessment (5th ed.). Hoboken,
NJ: John Wiley & Sons.
Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective,
impressionistic) and formal (mechanical, algorithmic) prediction procedures: The
clinical-statistical controversy. Psychology, Public Policy, and Law, 2, 293–323.
13
Haynes, R. B., Devereaux, P. J., & Guyatt, G. H. (2002). Clinical expertise in the era of
evidence-based medicine and patient choice. Evidence Based Medicine, 7, 36–38.
Isaacs, D., & Fitzgerald, D. (1999). Seven alternatives to evidence based medicine.
British Medical Journal, 319, 1618.
Kaufman, A. S. (1994). Intelligent Testing with the WISC-III. New York: John Wiley
& Sons.
Kim, E., Ku, J., Jung, Y.-C., Lee, H., Kim, S. I., Kim, J.-J., … Song, D.-H. (2010).
Restoration of mammillothalamic functional connectivity through thiamine
replacement therapy in Wernicke’s encephalopathy. Neuroscience Letters, 479,
257–261.
Kopelman, M. D., Thomson, A. D., Guerrini, I., & Marshall, E. J. (2009). The Korsakoff
syndrome: Clinical aspects, psychology and treatment. Alcohol & Alcoholism, 44,
148–154.
Larrabee, G. J. (Ed.). (2011). Forensic Neuropsychology: A Scientific Approach (2nd ed.).
New York: Oxford University Press.
Lee, G. P. (2016). Editorial. Archives of Clinical Neuropsychology, 31, 195–196.
Lezak, M. D. (1976). Neuropsychological Assessment. New York: Oxford University Press.
Lilienfeld, S. O., Ritschel, L. A., Lynn, S. J., Cautin, R. L., & Latzman, R. D. (2013). Why
many clinical psychologists are resistant to evidence-based practice: Root causes
and constructive remedies. Clinical Psychology Review, 33, 883–900.
Loring, D. W. (2015). INS Dictionary of Neuropsychology and Clinical Neurosciences.
Oxford, UK: Oxford University Press.
Macniven, J. (Ed.). (2016). Neuropsychological Formulation: A Clinical Casebook.
New York: Springer International Publishing.
Matarazzo, J. D. (1990). Psychological assessment versus psychological testing:
Validation from Binet to the school, clinic, and courtroom. American Psychologist,
45, 999–1017.
Meehl, P. E. (1973). Why I do not attend case conferences. In P. E. Meehl (Ed.),
Psychodiagnosis: Selected Papers (pp. 225–302). Minneapolis, MN: University of
Minnesota Press.
Meehl, P. E. (1997). Credentialed persons, credentialed knowledge. Clinical
Psychology: Science and Practice, 4, 91–98.
Meehl, P. E., & Rosen, A. (1955). Antecedent probability and the efficiency of psycho-
metric signs, patterns, or cutting scores. Psychological Bulletin, 52, 194–216.
Menezes, N. M., Arenovich, T., & Zipursky, R. B. (2006). A systematic review of lon-
gitudinal outcome studies of first-episode psychosis. Psychological Medicine, 36,
1349–1362.
Paul, G. L. (2007). Psychotherapy outcome can be studied scientifically. In S. Lilienfeld
& W. O’Donohue (Eds.), The Great Ideas of Clinical Science: 17 Principles That Every
Mental Health Professional Should Understand (pp. 119–147). New York: Routledge.
Russell, E. W. (2012). The Scientific Foundation of Neuropsychological Assessment: With
Applications to Forensic Evaluation. London: Elsevier.
Sackett, D. L. (1995). Applying overviews and meta-analyses at the bedside. Journal of
Clinical Epidemiology, 48, 61–66.
Scalzo, S. J., Bowden, S. C., Ambrose, M. L., Whelan, G., & Cook, M. J. (2015).
Wernicke-Korsakoff syndrome not related to alcohol use: A systematic review.
Journal of Neurology, Neurosurgery, and Psychiatry, 86, 1362–1368.
Schoenberg, M. R., & Scott, J. G. (Eds.). (2011). The Little Black Book of Neuropsychology:
A Syndrome-Based Approach. New York: Springer.
14
Sechi, G., & Serra, A. (2007). Wernicke’s encephalopathy: New clinical settings and
recent advances in diagnosis and management. The Lancet. Neurology, 6, 442–455.
Shanteau, J. (1992). Competence in experts: The role of task characteristics. Organiza
tional Behavior and Human Decision Processes, 53, 252–266.
Shlonsky, A., & Gibbs, L. (2004). Will the real evidence-based practice please stand up?
Teaching the process of evidence-based practice to the helping professions. Brief
Treatment and Crisis Intervention, 4, 137–153.
Sijtsma, K. (2009). Reliability beyond theory and into practice. Psychometrika, 74,
169–173.
Smith, R. (2006). Peer review: A flawed process at the heart of science and journals.
Journal of the Royal Society of Medicine, 99, 178–182.
Soper, H. V., Cicchetti, D. V., Satz, P., Light, R., & Orsini, D. L. (1988). Null hypothesis
disrespect in neuropsychology: Dangers of alpha and beta errors. Journal of Clinical
and Experimental Neuropsychology, 10, 255–270.
Straus, S., Richardson, W. S., Glasziou, P., & Haynes, R. B. (2011). Evidence-Based
Medicine: How to Practice and Teach EBM (4th ed.). Edinburgh, UK: Churchill
Livingstone.
Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and
methodology. Annual Review of Clinical Psychology, 5, 1–25.
Svanberg, J., Withall, A., Draper, B., & Bowden, S. (2015). Alcohol and the Adult Brain.
London: Psychology Press.
Victor, M. (1994). Alcoholic dementia. The Canadian Journal of Neurological Sciences,
21, 88–99.
Victor, M., Adams, R. D., & Collins, G. H. (1971). The Wernicke-Korsakoff syn-
drome: A clinical and pathological study of 245 patients, 82 with post-mortem
examinations. Contemporary Neurology Series, 7, 1–206.
Victor, M., & Yakovlev, P. I. (1955). S. S. Korsakoff’s psychic disorder in conjunction
with peripheral neuritis: A translation of Korsakoff’s original article with comments
on the author and his contribution to clinical medicine. Neurology, 5, 394–406.
Walsh, K. W. (1985). Understanding Brain Damage: A Primer of Neuropsychological
Evaluation. Edinburgh, UK: Churchill Livingstone.
Wood, J. M., Garb, H. N., & Nezworski, M. T. (2007). Psychometrics: Better measure-
ment makes better clinicians. In S. Lilienfeld & W. O’Donohue (Eds.), The Great
Ideas of Clinical Science: 17 Principles That Every Mental Health Professional Should
Understand (pp. 77–92). New York: Routledge.
Wood, J. M., Nezworski, M. T., Lilienfeld, S. O., & Garb, H. N. (2003). What’s Wrong
with the Rorschach? Science Confronts the Controversial Inkblot Test. San Francisco,
CA: Jossey-Bass.
15
E L I Z A B E T H N . R I L E Y, H A N N A H L . C O M B S ,
H E AT H E R A . DAV I S , A N D G R E G O R Y T. S M I T H
Theory as Evidence 17
in criteria. For example, criteria are often based on some form of judgement,
such as teacher or parent rating, or classification status using a highly imper-
fect diagnostic system like the American Psychiatric Association Diagnostic and
Statistical Manual of Mental Disorders (currently DSM-5: APA, 2013). There are
certainly limits to the criteria predicted by neuropsychological tests. This prob-
lem will not be the focus of this chapter, although it is of course important to
bear in mind.
The second problem is that the criterion validity approach does not facili-
tate the development of basic theory (Cronbach & Meehl, 1955; Smith, 2005;
Strauss & Smith, 2009). When tests are developed for predicting a very specific
criterion, and when they are validated only with respect to that predictive task,
the validation process is likely to contribute little to theory development. As a
result, criterion validity findings tend not to provide a strong foundation for
deducing likely relationships among psychological constructs, and hence for the
development of generative theory. This limitation led the field of psychology to
develop the concept of construct validity and to focus on construct validity in test
and theory validation (for review, see Strauss & Smith, 2009).
In order to develop and test theories of the relationships among psychologi-
cal variables, it is necessary to invoke psychological constructs that do not have
a single criterion reference point (Cronbach & Meehl, 1955). Constructs such
as fluid reasoning, crystallized intelligence, and working memory cannot be
directly observed or tied to a single criterion behavior or action, rather, they
are inferred entities (Jewsbury et al., 2016; McGrew, 2009). We infer the exis-
tence of constructs from data because doing so proves helpful for understanding
psychological processes that lead to important individual differences in real-life
task performance and in real-life outcomes. We consider it important to study
constructs because of their value in understanding, explaining, and predicting
human behaviors (Smith, 2005).
An important challenge is how to measure such inferred or hypothetical enti-
ties in valid ways. It is not possible to rely on successful prediction of a single
criterion, because the inferred meaning of the constructs cannot be operational-
ized in a single task (Cronbach & Meehl, 1955). Instead, one must show that a
measure of a given construct relates to measures of other constructs in system-
atic ways that are predictable from theory. For inferred constructs, there is no
perfect way to show that a measure reflects the construct validly, except to test
whether scores on the measure conform to a theory, of which the target con-
struct is a part.
The process of construct validation is complex and may require multiple
experiments. Suppose we develop a measure of hypothetical construct A. We can
only validate our measure if we have a theoretical argument that, for example,
A relates positively to B, is unrelated to C, and relates negatively to D. If we have
good reason to propose such a theory, and if we have measures of B, C, and D
along with our new measure of A, we can test whether A performed as predicted
by the theoretical argument. Imagine that, over time, tests like this one provide
repeated support for our hypotheses. As that occurs, we become more confident
18
both that our measure is a valid measure of construct A, and that our theory
relating A, B, C, and D has validity. In a very real sense, each test of the validity of
our measure of A is simultaneously a test of the theoretical proposition describ-
ing relationships among the four constructs. After extensive successful empirical
analysis, our evidence supporting the validity of our measure of A comes to rest
on a larger, cumulative body of empirical support for the theory of which A is
a part.
Then, when it comes to tests of the criterion validity of A with respect to some
important neuropsychological function, the criterion validity experiment repre-
sent tests of an extension of an existing body of empirical evidence organized as
a psychological theory. Because of the matrix of underlying empirical support,
positive tests of criterion validity are less likely to reflect false-positive findings.
In the contrasting case, in which tests of criterion validity are conducted without
such an underlying network of empirical support, there is a greater chance that
positive results may not be replicable.
Theory as Evidence 19
general, the content of the obvious items is consistent with emerging theories of
psychopathology, but the content of the subtle items tends not to be. Most of the
many studies testing the replicability of validity findings using the subtle items
have found that those items often do not predict the intended criteria validly and
sometimes even predict in the opposite direction (Hollrah, Schottmann, Scott,
& Brunetti, 1995; Jackson, 1971; Weed et al., 1990). Although those items were
originally chosen based on their criterion validity (using the criterion-keying
approach: Hathaway & McKinley, 1942), researchers can no longer be confident
that they predict intended criteria accurately. The case of MMPI subtle items is
a case in which criterion validity was initially demonstrated in the absence of
supporting theory, and the criterion validity findings have generally not proven
replicable. The contrasting examples of the Wechsler scales and the MMPI subtle
items reflect the body of knowledge in clinical psychology, which indicates the
greater success derived from predicting criteria based on measures well sup-
ported in theory.
between the automatic and the voluntary has been highly influential in basic psy-
chological science for more than a century (e.g., James, 1890, Posner & Snyder,
1975; Quantz, 1897).
Development and validation of the modern Stroop Test (Stroop, 1935) as a
clinical measure thus rests on an impressive basis of theoretical and empirical
support. The test has been found to be a relatively robust measure of attention
and interference in numerous experimental studies, and psychologists have
adopted it for use in the clinical setting. The test has become so popular in
clinical settings that several versions and normative datasets have been created.
Ironically, the existence of multiple versions creates a separate set of administra-
tion and interpretation problems (Mitrushina, 2005). The Stroop Test has proven
to be an effective measure of executive disturbance in numerous neurological
and psychiatric populations (e.g., schizophrenia, Parkinson’s disease, chronic
alcoholism, Huntington’s disease, attention- deficit hyperactivity disorder
[ADHD]: Strauss, Sherman, & Spreen, 2006). Impaired attentional abilities asso-
ciated with traumatic brain injury, depression, and bipolar disorder all appear to
influence Stroop performance (Lezak, Howeison, Bigler, & Tranel, 2012; Strauss
et al., 2006). Consistent with these findings, the Stroop has been shown to be
sensitive to both focal and diffuse lesions (Demakis, 2004). Overall, the Stroop
Test is a well-validated, reliable measure that provides significant information
on the cognitive processes underlying various neurological and psychological
disorders. As is characteristic of the iterative nature of scientific development,
findings from use of the test have led to modifications in the underlying theory
(MacLeod, 1991).
Sometimes, researchers attempt to create assessment measures based on
theory but the tests do not, after repeated examination, demonstrate good cri-
terion validity. Generally, one of two things happens in the case of these theory-
driven tests that ultimately tend not to be useful. First, it is possible that the test
being explored was simply not a good measure of the construct of interest. This
can lead to important modifications of the measure so that it more accurately
assesses the construct of interest and can be potentially useful in the future. The
second possibility is that the theory underlying development of the measure is
not fully accurate. When the latter is the case, attempts to use a test to predict
criteria will provide inconsistent results, characterized by failures to replicate.
An example of this problem is illustrated in the long history of the Rorschach
test in clinical psychology. The Rorschach was developed from hypothetical
contentions that one’s perceptions of stimuli reveal aspects of one’s personality
(Rorschach, 1964) and that people project aspects of themselves onto ambiguous
stimuli (Frank, 1939). This test provides an example of a hypothesis-based test
that has produced inconsistent criterion validity results, a test based on a theoret-
ical contention that has since been challenged and deemed largely unsuccessful.
One school of thought is that the hypotheses are sound but that the test scoring
procedures are flawed. As a consequence, there have been repeated productions
of new scoring systems over the years (see Wood, Nezworski, Lilienfeld, & Garb,
2003). To date, each scoring system, including Exner’s (1974, 1978) systems, has
21
Theory as Evidence 21
had limited predictive validity (Hunsley & Bailey, 1999; Wood et al., 2003). The
modest degree of support provided by the literature has led to a second inter-
pretation, which is that the hypothesis underlying use of the test lacks validity
(Wood et al., 2003).
We believe there are two important lessons neuropsychologists can draw from
histories of tests such as the Rorschach. The first is to be aware of the validation
history of theories that are the basis for tests one is considering using. Knowing
whether underlying theories have sound empirical support or not can provide
valuable guidance for the clinician. The second is that it may not be wise to invest
considerable effort in trying to alter or modify tests that either lack adequate cri-
terion validity or are not based on a valid, empirically supported theory.
The second manner in which theories develop is through what has been
called a “bootstrapping” process. Cronbach and Meehl (1955) described
bootstrapping in this context as the iterative process of using the results of
tests of partially developed theories to refine and extend the primary theory.
Bootstrapping can lead to further refinement and elaboration, allowing for
stronger and more precise validation tests. When using the bootstrapping
method of theory development in neuropsychology, it is common for research-
ers to discover, perhaps even accidentally, that some sort of neuropsychological
test can accurately identify who has a cognitive deficit based on their inability
to perform the test at hand (e.g., Fuster, 2015; Halstead, 1951). Once researchers
are able to determine that the test is a reliable indicator of neuropsychological
deficit, they can explore what the task actually measures and can begin to for-
mulate a theory about this cognitive deficit based on what they think the test
is measuring. In this manner, the discovery and repeated use of a measure to
identify neuropsychological deficit can lead to better theories about the cogni-
tive deficits identified by the tests.
The Trail Making Test (TMT) is one of the most well-validated and widely
utilized assessments of scanning and visuomotor tracking, divided attention,
and cognitive flexibility (Lezak et al., 2012). Despite the strength and utility of
the test as it stands today, the test was not originally developed to test brain
dysfunction. The TMT is a variation of John E. Partington’s Test of Distributed
Attribution, which was initially developed in 1938 to evaluate general intellectual
ability. Now, however, the Trail Making Test is commonly used as a diagnostic
tool in clinical settings. Poor performance is known to be associated with many
types of brain impairment. The TMT remains one of the most commonly used
neuropsychological tests in both research and clinical practice (Rabin, Barr, &
Burton, 2005).
The TMT is an example of a neuropsychological measure that, although it was
not originally based on an established theoretical foundation, is able to function
well in the context of current psychological theory that emphasizes cognitive
flexibility, divided attention, and visuo-motor tracking. The current theory was
developed through the utilization of the TMT, a classic example of the boot-
strapping approach to theory development. TMT performance is associated
with occupational outcome in adulthood after childhood traumatic brain injury
22
(TBI: Nybo et al., 2004). In addition, TMT parts A and B can predict psychoso-
cial outcome following head injury (Devitt et al., 2006) and are useful for pre-
dicting instrumental activities of daily living in older adults (Boyle et al., 2004;
Tierney et al., 2001). Tierney et al. (2001) reported that the TMT was significantly
related to self-care deficits, use of emergency services, experiencing harm, and
loss of property in a study of cognitively impaired people who live alone. At this
point in the history of the test, one can say that the TMT has a strong theoretical
foundation and an extensive matrix of successful criterion predictions in line
with that theory. Inferences made based on the test today are unlikely to be char-
acterized by repeated false positive conclusions.
Theory-Based Assessment
Our first recommendation is the one we have emphasized throughout this chap-
ter: whenever possible, use assessment tools grounded in empirically supported
theory. Doing so reduces the chances that one will draw erroneous conclusions
from test results. However, we also recognize that neuropsychological tests dem-
onstrating good, theory-driven criterion validity do not exist for every construct
that is of clinical interest and importance. There are sometimes serious issues
needing assessment and care for which there are no assessment instruments that
23
Theory as Evidence 23
fit the bill of showing criterion validity based on strong theory. We thus review
here some points we believe will be helpful to practitioners who have to make
difficult decisions about neuropsychological assessment when no clear guide-
lines or evidence exist for that specific clinical context.
More recently, the RBANS has been updated to include special group stud-
ies for Alzheimer’s disease, vascular dementia, HIV dementia, Huntington’s
disease, Parkinson’s disease, depression, schizophrenia, and closed head injury
populations (Pearson Clinical, 2016). This change represents an important step
in enhancing the criterion-related validity of this measure, the measure can now
be used with more confidence in many non-demented populations. Still, clini-
cians who make decisions to use this (or any other) measure on a population
for whom it was not originally created and for which there is not sound validity
evidence should closely examine the body of available empirical literature sup-
porting their decision.
Sometimes the “off-label” use of an assessment tool is necessary because there
are no assessment instruments available for the problem of interest that have
been validated in the population of interest. This is an unfortunately common
gap between the state of the research literature and the state of clinical needs. In
such a case, off-label assessments must be used with extreme caution and with
the knowledge that the use of such instruments is getting ahead of the theory
and research on which these assessments were created.
Theory as Evidence 25
negative affect and low positive affect differ (Chambless & Ollendick, 2001), it is
crucial to know the degree to which each of those two constructs contributed to
the patient’s score. It is far more efficient and accurate to measure negative and
positive affect separately, and doing so can provide a clearer picture of the actual
clinical experience of the patient. It is important to note, however, that the actual
diagnosis of major depressive disorder in the DSM-5 (APA, 2013) is similarly
heterogeneous and is based on these varied and conflated constructs. Thus, the
construct heterogeneity of the BDI is not merely an assessment problem, and
construct heterogeneity in assessment is not merely a neuropsychological prob-
lem. The problem of construct heterogeneity extends well into the entire field of
clinical psychology and our understanding of health problems as a whole (see
also Chapter 4 of this volume).
For the clinician, it is critical to understand exactly what the assessment
instrument being used is purported to measure and, related to this, whether the
target construct is heterogeneous or homogeneous. A well-validated test that
measures a homogenous construct is the Boston Naming Test (BNT: Pedraza,
Sachs, Ferman, Rush, & Lucas, 2011). The BNT is a neuropsychological mea-
sure that was designed to assess visual naming ability using line drawings of
everyday objects (Strauss et al., 2006). Kaplan first introduced the BNT as a test
of confrontation naming in 1983 (Kaplan, Goodglass, & Weintraub, 1983). The
BNT is sensitive to naming deficits in patients with left-hemisphere cerebrovas-
cular accidents (Kohn & Goodglass, 1985), anoxia (Tweedy & Schulman, 1982),
and subcortical disease (Henry & Crawford, 2004, Locascio et al., 2003). Patients
with Alzheimer’s disease typically exhibit signs of anomia (difficulty with word
recall) and show impairment on the BNT (Strauss et al., 2006).
In contrast, there are neuropsychological measures that measure multiple cog-
nitive domains, thus measuring a more heterogeneous construct. For example,
the Arithmetic subtest from the WAIS-IV, one of the core subtests of the work-
ing memory index, appears to be heterogeneous (Sudarshan et al., 2016). The
Wechsler manual describes the Arithmetic test as measuring “mental manipula-
tion, concentration, attention, short-and long-term memory, numerical reason-
ing ability, and mental alertness” (p. 15). Since the test measures several areas of
cognition, it is difficult to pinpoint the exact area of concern when an individ-
ual performs poorly (Lezak et al., 2012). Karzmark (2009) noted that, although
Wechsler Arithmetic appears to tap into concentration and working memory, it
is affected by many other factors and has limited specificity as a concentration
measure.
Tests that assess heterogeneous constructs should be used cautiously and
interpreted very carefully, preferably at the facet-level or narrow ability level (see
Chapter 3 of this volume). Of course, it is likely that, for many patients, there
will not be a perfectly constructed and validated test to measure their exact, spe-
cific neuropsychological problem. In these cases, where there is a substantial gap
between rigorous scientific validity (such as validation evidence only for a het-
erogeneous measure) and the necessities of clinical practice (such as identifying
the precise nature of a deficit), it will be even more important for clinicians to
26
use the tools they do have at their disposal to ensure or preserve whatever level
of validity is available to them.
Ecological Validity
Finally, there is the issue of ecological validity in neuropsychological assess-
ment. An important question for clinicians concerns whether variation in test
scores maps onto variation in performance of relevant cognitive functions in
everyday life. This issue perhaps becomes central when the results of an assess-
ment instrument do not match clinical observation or patient report. When
this occurs, the clinician’s judgement concerning how much faith to place in the
test result should be influenced by the presence or absence of ecological valid-
ity evidence for the test. In the absence of good evidence for ecological validity,
the clinician might be wise to weigh observation and patient report more heav-
ily. When there is good evidence for ecological validity of a test, the clinician
might instead give the test result more weight. There are certainly times when
a neuropsychological test will detect some issue that was previously unknown
or that the patient did not cite as a concern (for a full review of the importance
of ecological validity issues to neuropsychological assessment, see Chaytor &
Schmitter-Edgecombe, 2003).
SUMMARY
In the context of good theory, criterion validity evidence is hugely important
to the practitioner. It provides a clear foundation for sound, scientifically based
clinical decisions. Criterion validity evidence in the absence of theory may well
be useful to practitioners, too, but it provides a riskier basis on which to pro-
ceed. The recommendations we make to practitioners follow from this logic.
Neuropsychological assessments that demonstrate criterion validity and that
were created based on an underlying psychological theory of the construct or
dysfunction of interest are often more useful for clinicians because of the lower
risk of making an error in assessment. Tests that demonstrate criterion validity
evidence in the context of underlying theory often produce results that are more
reliable, stable, and replicable than tests that are not grounded in a solid theoreti-
cal context.
In addition to this core consideration, we recommend the following. Firstly,
when clinicians use measures in the absence of established criterion validity
for the clinical question or population in which the test is applied, then clini-
cians should carefully examine the empirical and theoretical justification for
doing so. Secondly, clinicians should avoid relying on single scores that repre-
sent multiple constructs, because clients with different presentations can achieve
the same score. Thirdly, clinicians should be aware of the evidence concerning
the ecological validity of their assessment tools and use that evidence as a guide
for reconciling discrepancies between test results and observation or patient
report. Because the accuracy with which an individual’s neuropsychological
27
Theory as Evidence 27
REFERENCES
Alkemade, N., Bowden, S. C., & Salzman, L. (2015). Scoring correction for MMPI-2
Hs scale in patients experiencing a traumatic brain injury: A test of measurement
invariance. Archives of Clinical Neuropsychology, 30, 39–48.
American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental
Disorders (DSM-5®). Washington, DC: American Psychiatric Association.
American Psychological Association. (2010). Ethical Principles of Psychologists and
Code of Conduct. Retrieved from http://apa.org/ethics/code/index.aspx. Date
retrieved: January 19, 2015.
Anastasi, A. (1950). The concept of validity in the interpretation of test scores.
Educational and Psychological Measurement, 10, 67–78.
Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Beck Depression Inventory–II. San
Antonio, TX: Psychological Corporation.
Boyle, P. A., Paul, R. H., Moser, D. J., & Cohen, R. A. (2004). Executive impairments
predict functional declines in vascular dementia. The Clinical Neuropsychologist,
18(1), 75–82.
Butcher, J. N. (1990). The MMPI-2 in Psychological Treatment. New York: Oxford
University Press.
Butcher, J. N. (1995). User’s Guide for The Minnesota Report: Revised Personnel Report.
Minneapolis, MN: National Computer Systems
Butcher, J. N. (2001). Minnesota Multiphasic Personality Inventory–2 (MMPI-2) User’s
Guide for The Minnesota Report: Revised Personnel System (3rd ed.). Bloomington,
MN: Pearson Assessments.
Chambless, D. L., & Ollendick, T. H. (2001). Empirically supported psychological inter-
ventions: Controversies and evidence. Annual Review of Psychology, 52(1), 685–716.
Chapin, J. S., Busch, R. M., Naugle, R. I., & Najm, I. M. (2009). The Family Pictures
subtest of the WMS-III: Relationship to verbal and visual memory following tem-
poral lobectomy for intractable epilepsy. Journal of Clinical and Experimental
Neuropsychology, 31(4), 498–504.
Chaytor, N., & Schmitter-Edgecombe, M. (2003). The ecological validity of neu-
ropsychological tests: A review of the literature on everyday cognitive skills.
Neuropsychology Review, 13(4), 181–197.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.
Psychological Bulletin, 52(4), 281–302.
Demakis, G. J. (2004). Frontal lobe damage and tests of executive processing: A meta-
analysis of the category test, Stroop test, and trail-making test. Journal of Clinical
and Experimental Neuropsychology, 26(3), 441–450.
Devitt, R., Colantonio, A., Dawson, D., Teare, G., Ratcliff, G., & Chase, S. (2006).
Prediction of long-term occupational performance outcomes for adults after mod-
erate to severe traumatic brain injury. Disability & Rehabilitation, 28(9), 547–559.
Duff, K., Patton, D., Schoenberg, M. R., Mold, J., Scott, J. G., & Adams, R. L. (2003).
Age-and education-corrected independent normative data for the RBANS in a
community dwelling elderly sample. The Clinical Neuropsychologist, 17(3), 351–366.
Dulay, M. F., Schefft, B. K., Marc Testa, S., Fargo, J. D., Privitera, M., & Yeh, H. S.
(2002). What does the Family Pictures subtest of the Wechsler Memory Scale–III
28
measure? Insight gained from patients evaluated for epilepsy surgery. The Clinical
Neuropsychologist, 16(4), 452–462.
Exner, J. E. (1974). The Rorschach: A Comprehensive System. New York: John Wiley
& Sons.
Exner Jr., J. E., & Clark, B. (1978). The Rorschach (pp. 147–178). New York: Plenum
Press, Springer US.
Flanagan, D. P., McGrew, K. S., & Ortiz, S. O. (2000). The Wechsler Intelligence Scales
and Gf-Gc Theory: A Contemporary Approach to Interpretation. Needham Heights,
MA: Allyn & Bacon.
Frank, L. K. (1939). Projective methods for the study of personality. The Journal of
Psychology, 8(2), 389–413.
Franzen, M. D. (2000). Reliability and Validity in Neuropsychological Assessment.
New York: Springer Science & Business Media.
Fuster, J. M. (2015). The Prefrontal Cortex. San Diego: Elsevier, Acad. Press.
Gass, C. S. (2002). Personality assessment of neurologically impaired patients. In
J. Butcher (Ed.), Clinical Personality Assessment: Practical Approaches (2nd ed.,
pp. 208–244). New York: Oxford University Press.
Gold, J. M., Queern, C., Iannone, V. N., & Buchanan, R. W. (1999). Repeatable Battery
for the Assessment of Neuropsychological Status as a screening test in schizophre-
nia, I: Sensitivity, reliability, and validity. American Journal of Psychiatry, 156(12),
1944–1950.
Gough, H. G. (1996). CPI Manual: Third Edition. Palo Alto, CA: Consulting
Psychologists Press.
Greene, R. L. (2006). Use of the MMPI-2 in outpatient mental health settings. In
J. Butcher (Ed.), MMPI-2: A Practitioner’s Guide. Washington, DC: American
Psychological Association.
Halstead, W. G. (1951). Biological intelligence. Journal of Personality, 20(1), 118–130.
Hathaway, S. R., & McKinley, J. C. (1942). The Minnesota Multiphasic Personality
Schedule. Minneapolis, MN, US: University of Minnesota Press.
Henry, J. D., & Crawford, J. R. (2004). Verbal fluency deficits in Parkinson’s dis-
ease: A meta-analysis. Journal of the International Neuropsychological Society, 10(4),
608–622.
Hollrah, J. L., Schlottmann, S., Scott, A. B., & Brunetti, D. G. (1995). Validity of the
MMPI subtle items. Journal of Personality Assessment, 65, 278–299.
Hunsley, J., & Bailey, J. M. (1999). The clinical utility of the Rorschach: Unfulfilled
promises and an uncertain future. Psychological Assessment, 11(3), 266–277.
Jackson, D. N. (1971). The dynamics of structured personality tests: 1971. Psychological
Review, 78(3), 229–248.
James, W. (1890). Principles of Psychology. New York: Holt.
Jewsbury, P. A., Bowden, S. C., & Strauss, M. E. (2016). Integrating the switching, inhi-
bition, and updating model of executive function with the Cattell-Horn-Carroll
model. Journal of Experimental Psychology: General, 145(2), 220–245.
Kaplan, E., Goodglass, H., & Weintraub, S. (1983). Boston Naming Test (BNT). Manual
(2nd ed.). Philadelphia: Lea and Fabiger.
Karzmark, P. (2009). The effect of cognitive, personality, and background factors on
the WAIS-III arithmetic subtest. Applied Neuropsychology, 16(1), 49–53.
Kaufman, A. S. (1994). Intelligent Testing with the WISC-III. New York: John Wiley
& Sons.
29
Theory as Evidence 29
King, T. Z., Fennell, E. B., Bauer, R., Crosson, B., Dede, D., Riley, J. L., … & Roper, S. N.
(2002). MMPI-2 profiles of patients with intractable epilepsy. Archives of Clinical
Neuropsychology, 17(6), 583–593.
Kohn, S. E., & Goodglass, H. (1985). Picture-naming in aphasia. Brain and Language,
24(2), 266–283.
Larson, E. B., Kirschner, K., Bode, R., Heinemann, A., & Goodman, R. (2005).
Construct and predictive validity of the Repeatable Battery for the Assessment of
Neuropsychological Status in the evaluation of stroke patients. Journal of Clinical
and Experimental Neuropsychology, 27(1), 16–32.
Lezak, M. D., Howieson, D. B., Bigler, E. D., & Tranel, D. (2012). Neuropsychological
Assessment. New York: Oxford University Press.
Lichtenberger, E. O., Kaufman, A. S., & Lai, Z. C. (2001). Essentials of WMS-III
Assessment (Vol. 31). New York: John Wiley & Sons.
Lichtenberger, E. O., & Kaufman, A. S. (2009). Essentials of WAIS-IV Assessment (Vol.
50). New York: John Wiley & Sons.
Lichtenberger, E. O., & Kaufman, A. S. (2013). Essentials of WAIS-IV Assessment (2nd
ed.). New York: John Wiley and Sons.
Locascio, J. J., Corkin, S., & Growdon, J. H. (2003). Relation between clinical char-
acteristics of Parkinson’s disease and cognitive decline. Journal of Clinical and
Experimental Neuropsychology, 25(1), 94–109.
Long, C. J., & Kibby, M. Y. (1995). Ecological validity of neuropsychological tests: A look
at neuropsychology’s past and the impact that ecological issues may have on its
future. Advances in Medical Psychotherapy, 8, 59–78.
MacLeod, C. M. (1991). Half a century of research on the Stroop effect: An integrative
review. Psychological Bulletin, 109(2), 163–203.
Matarazzo, J. D., 1990. Psychological testing versus psychological assessment.
American Psychologist, 45, 999–1017.
Mayes, S. D., & Calhoun, S. L. (2007). Wechsler Intelligence Scale for Children third
and fourth edition predictors of academic achievement in children with attention-
deficit/hyperactivity disorder. School Psychology Quarterly, 22(2), 234–249.
McGrath, R. E. (2005). Conceptual complexity and construct validity. Journal of
Personality Assessment, 85, 112–124.
McGrew, K. S. (2009). CHC theory and the human cognitive abilities project: Standing
on the shoulders of giants of psychometric intelligence research. Intelligence,
37, 1–10.
McKay, C., Wertheimer, J. C., Fichtenberg, N. L., & Casey, J. E. (2008). The Repeatable
Battery for the Assessment of Neuropsychological Status (RBANS): Clinical utility
in a traumatic brain injury sample. The Clinical Neuropsychologist, 22(2), 228–241.
Megargee, E. I. (2006). Using the MMPI-2 in Criminal Justice and Correctional Settings.
Minneapolis, MN: University of Minnesota Press.
Megargee, E. I. (2008). The California Psychological Inventory. In J. N. Butcher (Ed.),
Oxford Handbook of Personality Assessment (pp. 323–335). New York: Oxford
University Press.
Mitrushina, M. (Ed.). (2005). Handbook of Normative Data for Neuropsychological
Assessment. New York: Oxford University Press.
Neisser, U., Boodoo, G., Bouchard Jr., T. J., Boykin, A. W., Brody, N., Ceci, S. J., … &
Urbina, S. (1996). Intelligence: Knowns and unknowns. American Psychologist,
51(2), 77–101.
30
Nelson, J. M., Canivez, G. L., & Watkins, M. W. (2013). Structural and incremental
validity of the Wechsler Adult Intelligence Scale–Fourth Edition with a clinical
sample. Psychological Assessment, 25(2), 618–630.
Nichols, D. S., & Crowhurst, B. (2006). Use of the MMPI-2 in inpatient mental
health settings. In MMPI-2: A Practitioner’s Guide (pp. 195–252). Washington,
DC: American Psychological Association.
Nybo, T., Sainio, M., & Muller, K. (2004). Stability of vocational outcome in adult-
hood after moderate to severe preschool brain injury. Journal of the International
Neuropsychological Society, 10(5), 719–723.
Pedraza, O., Sachs, B. C., Ferman, T. J., Rush, B. K., & Lucas, J. A. (2011). Difficulty and
discrimination parameters of Boston Naming Test items in a consecutive clinical
series. Archives of Clinical Neuropsychology, 26(5), 434–4 44.
Perry, J. N., Miller, K. B., & Klump, K. (2006). Treatment planning with the MMPI-2.
In J. Butcher (Ed.), MMPI-2: A Practitioner’s Guide (pp. 143–64). Washington,
DC: American Psychological Association.
Posner, M. I., & Snyder, C. R. R. (1975). Facilitation and inhibition in the processing of
signals. Attention and Performance V, 669–682.
Quantz, J. O. (1897). Problems in the psychology of reading. Psychological
Monographs: General and Applied, 2(1), 1–51.
Rabin, L. A., Barr, W. B., & Burton, L. A. (2005). Assessment practices of clinical neu-
ropsychologists in the United States and Canada: A survey of INS, NAN, and APA
Division 40 members. Archives of Clinical Neuropsychology, 20(1), 33–65.
Randolph, C. (2016). The Repeatable Battery for the Assessment of Neuropsychological
Status Update (RBANS Update). Retrieved May, 2016, from http://w ww.pear-
sonclinical.com/p sychology/products/100000726/repeatable-b attery-for-t he-
assessment-of-neuropsychological-status-update-rbans-update.html#tab-details.
Randolph, C., Tierney, M. C., Mohr, E., & Chase, T. N. (1998). The Repeatable Battery
for the Assessment of Neuropsychological Status (RBANS): Preliminary clinical
validity. Journal of Clinical and Experimental Neuropsychology, 20(3), 310–319.
Rorschach, H. E. (1964). Nuclear relaxation in solids by diffusion to paramagnetic
impurities. Physica, 30(1), 38–48.
Smith, G. T. (2005). On construct validity: Issues of method and measurement.
Psychological Assessment, 17(4), 396–408.
Strauss, E., Sherman, E. M., & Spreen, O. (2006). A Compendium of Neuropsychological
Tests: Administration, Norms, and Commentary. New York: Oxford University Press.
Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and
methodology. Annual Review of Clinical Psychology, 5, 1–25.
Strenze, T. (2007). Intelligence and socioeconomic success: A meta-analytic review of
longitudinal research. Intelligence, 35(5), 401–426.
Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of
Experimental Psychology, 18(6), 643–662.
Sudarshan, S. J., Bowden, S. C., Saklofske, D. H., & Weiss, L. G. (2016). Age-Related
Invariance of Abilities Measured with the Wechsler Adult Intelligence Scale–IV.
Psychological Assessment, on line ahead of print: http:// d x.doi.org/
10.1037/
pas0000290
Tierney, M. C., Black, S. E., Szalai, J. P., Snow, W. G., Fisher, R. H., Nadon, G., &
Chui, H. C. (2001). Recognition memory and verbal fluency differentiate prob-
able Alzheimer disease from subcortical ischemic vascular dementia. Archives of
Neurology, 58(10), 1654–1659.
31
Theory as Evidence 31
Tweedy, J. R., & Schulman, P. D. (1982). Toward a functional classification of naming
impairments. Brain and Language, 15(2), 193–206.
Wechsler, D. (1997). Wechsler Memory Scale. Third Edition (WMS). San Antonio,
TX: Pearson.
Wechsler, D. (2008). Wechsler Adult Intelligence Scale–Fourth Edition (WAIS-IV). San
Antonio, TX: Pearson.
Wechsler, D. (2014). Wechsler Intelligence Scale for Children–Fifth Edition (WAIS-V).
San Antonio, TX: Pearson.
Weed, N. C., Ben-Porath, Y. S., & Butcher, J. N. (1990). Failure of Wiener and Harmon
Minnesota Multiphasic Personality Inventory (MMPI) subtle scales as personal-
ity descriptors and as validity indicators. Psychological Assessment: A Journal of
Consulting and Clinical Psychology, 2(3), 281–285.
Weimer, W. B. (1979). Notes on the Methodology of Scientific Research. Hillsdale,
NJ: John Wiley & Sons.
Whitman, S., Hermann, B. P., & Gordon, A. C. (1984). Psychopathology in epi-
lepsy: How great is the risk? Biological Psychiatry, 19(2), 213–236.
Wilk, C. M., Gold, J. M., Humber, K., Dickerson, F., Fenton, W. S., & Buchanan,
R. W. (2004). Brief cognitive assessment in schizophrenia: Normative data for the
Repeatable Battery for the Assessment of Neuropsychological Status. Schizophrenia
Research, 70(2), 175–186.
Wood, J. M., Nezworski, M. T., Lilienfeld, S. O., & Garb, H. N. (2003). What’s Wrong
with the Rorschach? Science Confronts the Controversial Inkblot Test. San Francisco,
CA: Jossey-Bass.
32
33
PAU L A . J E W S B U R Y A N D S T E P H E N C . B OW D E N
Surma-Aho, & Servo, 1996). Critical reviews of popular executive tests (e.g.,
Alvarez & Emory, 2006; Andrés, 2003; Mountain & Snow-William, 1993) con-
cluded that, while these tests are usually sensitive to frontal lobe lesions, lesions
in other regions can also cause deficits. Early developers of sorting tests, includ-
ing the forerunners of the Wisconsin Card Sorting Test (WCST; Grant & Berg,
1948), noted the sensitivity of these tests to focal lesions in many brain locations
(Goldstein & Scheerer, 1941).
Moreover, as will be shown below, contemporary definitions of executive
function have evolved to include concepts overlapping in important ways with
definitions derived from contemporary models of cognitive ability, as commonly
assessed by intelligence tests. It will be argued later that the concept of executive
function has evolved from clinical case-study research and other forms of neuro-
psychological research to identify many of the same critical cognitive functions
that overlap with the critical cognitive functions identified by parallel research
in educational, developmental, and broader clinical populations. It may be that
there is an opportunity to integrate diverse approaches to cognitive function,
an opportunity highlighted when the broad definition of the executive system
is considered alongside the contemporary definitions of intelligence. Ironically,
rather than disconfirming one or other line of cognitive ability research, the
converging lines of evidence lend support to the construct validity evidence
obtained from diverse research approaches. Take, for example, one of Lezak’s
(1995, p. 42) widely cited definitions of executive function “The executive func-
tions consist of those capacities that enable a person to engage successfully in
independent, purposive, self-serving behavior.” This definition has clear similar-
ities to Wechsler’s (1944, p. 3) definition of intelligence “Intelligence is the aggre-
gate or global capacity of the individual to act purposefully, to think rationally
and to deal effectively with his (sic) environment.” The similarities of the two
definitions show that one of the most influential executive function exponents,
and one of the most influential intelligence-test developers, sought to measure
broadly similar constellations of constructs, and the target constructs may have
co-evolved to a similar focus of theoretical and applied research.
A common argument in the neuropsychological literature is that intelligence
tests are limited because they fail to assess important cognitive abilities, espe-
cially the executive system (e.g., Ardila, 1999). This argument is usually made
on the basis of various case-studies of individual frontal lobe–lesioned patients
who appear to have dysfunctional executive or other ability and yet normal intel-
ligence scores (e.g., Ackerly & Benton, 1947; Brickner, 1936; Shallice & Burgess,
1991). There are several issues to consider before it is assumed that this line of
argument provides definitive evidence for the discriminant validity of executive
versus intelligence constructs.
First, the imperfect reliability of test scores in general, and extreme (low)
scores, in particular, is often not considered, although the problem of regression
to the mean is well recognized in cognitive neuropsychology. In single-case and
dissociation studies, test scores and especially less-reliable component or subtest
scores are rarely reported with confidence intervals centered on the appropriate
38
score, the predicted true score (Nunnally & Bernstein, 1994). As a consequence,
the interpretation of extreme deviation scores often neglects the fact that a
patient’s true score is expected to be closer to the normal range (see Chapter 5,
this volume). Even when the confidence interval is reported, the issue of unreli-
ability is still relevant. In a busy clinic, occurrences of the estimated score falling
outside the confidence interval can accrue, on the basis of false-positive findings
alone. Just reporting extreme cases is potentially uninformative (Chapman &
Chapman, 1983; Strauss, 2001).
Second, case-study methods typically assume, implicitly, that clinical clas-
sification of impaired ability to be perfect. The probability of a false posi-
tive (FP: see Chapter 10, this volume) diagnosis of impairment on, say, a less
reliable test, in the context of an accurate assessment of no impairment by a
more reliable intelligence test, will be non-zero. That is, any less reliable test
will produce some FPs, and the less reliable the test, the more frequent the FPs
(Strauss, 2001). Such cases may accrue and produce a literature of false-posi-
tive diagnosed “impaired” single-cases with normal intelligence scores. That
is, cases with clinically diagnosed impaired ability but normal intelligence may
represent the failure of accurate clinical diagnosis, rather than insensitivity of
intelligence tests. Hence, imperfect specificity of clinical diagnosis based on less
reliable tests, and imperfect sensitivity of more reliable intelligence tests, are
confounded in descriptive case-studies and dissociation studies (Chapman &
Chapman, 1983; Strauss, 2001).
Both of these points provide independent reasons for why cases of apparently
normal intelligence scores but abnormal clinical diagnoses of cognitive ability
will be expected to occur in the absence of a real difference in the assessed abili-
ties and may be published without regard to the underlying problem of FP find-
ings. Therefore, the existence of cases of clinically diagnosed impaired ability
with normal intelligence (or any other test) scores does not provide conclusive
evidence against psychometric intelligence tests as clinically useful and valid
assessment. Alternative techniques are required to resolve this dissociation
conundrum more rigorously, including latent-variable analysis of the test scores
of interest, as described below.
tests such as the Wechsler scales, different methodology would not be required
for the study of biological intelligence (Larrabee, 2000; Matarazzo, 1990). Rather,
established methods of convergent and discriminant validity would provide
much informative evidence, again including latent variable or factor analysis, as
described in following sections.
The use of factor analysis to investigate convergent and discriminant valid-
ity allows for controlled group studies and an account of measurement error
as well as a priori hypothesis testing of proposed models and plausible alterna-
tives (Strauss & Smith, 2009). The use of factor analysis for executive function
research could close the gap between psychometric research of cognitive abilities
and neuropsychological research on executive system function. In fact, factor
analysis is often used in executive function research, although with limited refer-
ence to the enormous body of previous research on cognitive abilities with factor
analysis. For example, Miyake and colleagues’ (2000) highly cited study of the
executive system employs the methodology most commonly used in intelligence
research—namely, confirmatory factor analysis—and their study was even con-
ducted with a healthy community population similar to those commonly used in
nonclinical psychometric research.
In summary, the single-case and dissociation approach to cognitive models
often depends on methodology involving tenuous assumptions and is highly
prone to errors of inference, such as confounding measurement error with true
discriminant validity in the identification of double-dissociations (for reviews, see
Strauss, 2001; van Orden et al., 2001). When stronger methodology is employed,
neuropsychology model-building is compatible with the psychometric and intel-
ligence approaches to modelling or cognitive abilities, and there appears to be
some important and as yet incompletely explored convergence between current
definitions of executive functions and contemporary models of intelligence as
multiple-factor cognitive abilities. These questions will be explored in further
detail toward the end of this chapter, after brief consideration of other histori-
cally important approaches to modelling cognition.
ALTERNATIVE PSYCHOMETRIC MODELS
Three psychometric cognitive models are popular in contemporary assessment
psychology. These are the multiple-intelligences theory, the triarchic theory
of intelligence, and the CHC model of cognitive abilities. Although all three
theories were influenced by the early psychometric research reviewed above,
the CHC model is the natural continuation of the general-research approach
that began with Spearman, and is the dominant cognitive-ability theory in
psychometrics (McGrew, 2009). In contrast, the multiple-i ntelligences theory
and the triarchic theory of intelligence diverge from pure psychometric meth-
odology by placing more importance on cognitive theory and other forms
of evidence. Consequently, constructs in the multiple-intelligences theory
and the triarchic theory of intelligence are conceptually broader but also less
well defined than the factorial constructs in the psychometric approach of the
CHC model.
The multiple-intelligences theory states that intelligence is made up of mul-
tiple intelligences, of which there are “probably eight or nine” (Gardner, 2006,
p. 503). It is hypothesized that each of these intelligences has an independent
information- processing mechanism associated with an independent neural
substrate that is specialized by type of information. Each intelligence also has a
unique symbol system and separate perceptual and memory resources (Gardner,
1983, 1999, 2006). However, the validity of multiple-intelligences theory has
been widely debated, some arguing that support for the theory is not convinc-
ing (Allix, 2000; Sternberg, 1994; Sternberg & Grigorenko, 2004; Waterhouse,
2006) and that the theory needs to generate falsifiable hypotheses to be scien-
tifically fruitful (Visser, Ashton, & Vernon, 2006). Gardner has responded that
“Multiple-intelligences theory does not lend itself easily to testing through
paper-and-pencil assessments or a one-shot experiment” (Gardner & Moran,
2006, p. 230).
Focusing on the aspect of multiple-intelligences theory that is aligned with
factor analysis, multiple-intelligences theory seems to be a mix of intelligence
and personality factors. Carroll (1993) noted that there are similarities between
the Cattell-Horn Gf-Gc model and multiple-intelligences, and suggested that
multiple-intelligences may be useful to suggest new possible areas of intelli-
gence (e.g., bodily-k inesthetic) that can be explored with factor analysis (Carroll,
pp. 641–642). On the other hand, as a factor model, multiple-intelligences theory
makes tenuous claims, in particular, that the intelligences are autonomous and
correspond to uncorrelated factors (Messick, 1992).
The triarchic theory of intelligence was championed by Sternberg in several
influential publications (e.g., Sternberg, 1985, 2005). The theory comprises three
subtheories. The componential subtheory addresses the mental mechanisms that
underlie intelligent human behavior. The componential subtheory is assumed
to describe universal processes and does so in terms of higher-order executive
processes (metacomponents), specialized lower-order processes that are coordi-
nated by the metacomponents (performance components), and finally, learning
43
1995; Henson & Roberts, 2006; Preacher & MacCallum, 2003; Widaman, 1993).
Nevertheless, exploratory factor analysis in general, and principal components
analysis in particular, continue to be widely reported in the clinical research
literature, giving rise to a proliferation of imprecise and unreplicated models.
However, when used in a careful way, factor analysis readily deals with the issues
of selecting a parsimonious number of unidimensional constructs. First, factor
analysis separates dimensions specific to items (unique variances) from dimen-
sions shared across items (factors), where the factors are expected to be theoreti-
cally fruitful and empirically useful. Second, most researchers do not attempt
to achieve perfectly fitting factor models (e.g., by not rejecting models with sig-
nificant chi-square statistics indicating perfect fit), but rather accept models that
fit well according to approximate fit indices (Hu & Bentler, 1999; Kline, 2011).
Factor analysis guided by approximate fit indices allows for dimensions that are
major sources of individual differences to be separated from dimensions spe-
cific to each measure as well as dimensions associated with trivial individual-
difference variation (Brown, 2006; Kline, 2011).
In summary, the traditional psychometric approach has moved away from
defining validity purely in terms of criterion-related validity and towards a
greater focus on construct validity. Factor-analytic methodology evolved with
early theories of intelligence (Spearman, 1904; Thurstone, 1938, 1947) and
remains central to construct validity in personality and assessment psychology
(Brown, 2006; Marsh et al., 2010; Strauss & Smith, 2009).
The CHC model is a hierarchical model comprising three levels. Stratum III
(general) consists solely of a weaker form of Spearman’s g. Stratum II (broad)
comprises eight to ten abilities comparable to Thurstone’s primary mental abili-
ties and Horn’s mental abilities. Broad constructs describe familiar aspects of
cognition usually derived from factor analysis of cognitive ability batteries (e.g.,
acquired knowledge, fluid reasoning, short-term memory). The fifth edition of
the Diagnostic and Statistical Manual of Mental Disorders (DSM-5; American
Psychiatric Association, 2013) classification of “cognitive domains” maps closely
to the CHC broad (Stratum II) constructs, with the major exception of the confla-
tion of several CHC constructs under the DSM category of Executive Function.
CHC broad abilities also correspond closely to many pragmatic classifications of
neuropsychological tests (e.g., Zakzanis, Leach, & Kaplan, 1999).
Definitions of the most commonly reported broad CHC abilities are included
in Table 3.1 (McGrew, 2009). For example, the fourth edition of the Wechsler
Intelligence Scales for adults and children have been described in terms of four
or five of the CHC broad abilities defined in Table 3.1 (Weiss, Keith, Zhu, &
Chen, 2013a, 2013b), namely, Acquired Knowledge (Gc), Fluid Intelligence (Gf),
Visual Processing (Gv), Short-term or Working Memory (Gsm), and Processing
Speed (Gs). With the addition of the Wechsler Memory Scale, a Wechsler bat-
tery includes further assessment of Gsm and, critically, the addition of long-
term memory ability (Glr; for a recent discussion on the interpretation of Glr,
see Jewsbury & Bowden, 2016).
Each broad construct has a number of subsidiary Stratum I (narrow) con-
structs. The narrow constructs are based on the work of Carroll (1993) and are
quite specific to the cognitive task (e.g., the ability to recall lists in any order,
the ability to rapidly state the name of objects, or the ability to solve a maze).
Identifying the narrow construct associated with a test indicates what broad fac-
tor the test should be classified under. The narrow abilities allow the CHC model
to be defined in great detail and the classification of tests to be undertaken more
objectively, but the narrow factors are usually not required to be specified in con-
firmatory factor analyses of common clinical or educational assessment batteries
to achieve good model fit (Keith & Reynolds, 2010). One of the reasons that it is
not necessary to define narrow abilities relates to the statistical definition of fac-
tors. In general, any well-identified factor should have at least three indicators.
But with multiple narrow abilities under each broad factor, the number of tests
to statistically identify even a few narrow abilities would require many tests per
broad factor, more than could be feasibly administered in any reasonable clinical
or educational assessment protocol.
As a consequence, most assessment batteries will typically contain only one
or two tests per narrow ability, making it impractical to statistically identify
the narrow abilities. Occasional exceptions are observed when multiple ver-
sions of a test are administered within one assessment battery, multiple scores
are derived from one test as part of a larger battery, or a test is repeated as in
the immediate and delayed subtests of the Wechsler Memory Scale (WMS).
In the latter case, it is common to observe test-format-specific variance in the
47
Construct Description
Gf The use of deliberate and controlled mental operations to solve novel
problems that cannot be performed automatically. Mental operations
often include drawing inferences, concept formation, classification,
generating and testing hypotheses, identifying relations, comprehending
implications, problem solving, extrapolating, and transforming
information. Inductive and deductive reasoning are generally considered
the hallmark indicators of Gf. Gf has been linked to cognitive complexity,
which can be defined as a greater use of a wide and diverse array of
elementary cognitive processes during performance.
Gc The knowledge of the culture that is incorporated by individuals
through a process of acculturation. Gc is typically described as a
person’s breadth and depth of acquired knowledge of the language,
information, and concepts of a specific culture, and/or the application
of this knowledge. Gc is primarily a store of verbal or language-
based declarative (knowing what) and procedural (knowing how)
knowledge acquired through the investment of other abilities during
formal and informal educational and general life experiences.
Gsm The ability to apprehend and maintain awareness of a limited number
of elements of information in the immediate situation (events that
occurred in the last minute or so). A limited-capacity system that
loses information quickly through the decay of memory traces, unless
an individual activates other cognitive resources to maintain the
information in immediate awareness.
Gv The ability to generate, store, retrieve, and transform visual images
and sensations. Gv abilities are typically measured by tasks (figural or
geometric stimuli) that require the perception and transformation of
visual shapes, forms, or images and/or tasks that require maintaining
spatial orientation with regard to objects that may change or move
through space.
Ga Abilities that depend on sound as input and on the functioning of
our hearing apparatus. A key characteristic is the extent to which
an individual can cognitively control (i.e., handle the competition
between signal and noise) the perception of auditory information.
The Ga domain circumscribes a wide range of abilities involved in
the interpretation and organization of sounds, such as discriminating
patterns in sounds and musical structure (often under background
noise and/or distorting conditions) and the ability to analyze,
manipulate, comprehend, and synthesize sound elements, groups of
sounds, or sound patterns.
Glr* The ability to store and consolidate new information in long-
term memory and later fluently retrieve the stored information
(e.g., concepts, ideas, items, names) through association. Memory
consolidation and retrieval can be measured in terms of information
stored for minutes, hours, weeks, or longer. Some Glr narrow
abilities have been prominent in creativity research (e.g., production,
ideational fluency, or associative fluency).
(Continued)
48
Table 3.1 Continued
Construct Description
Gs The ability to automatically and fluently perform relatively easy or
over-learned elementary cognitive tasks, especially when high mental
efficiency (i.e., attention and focused concentration) is required.
Gq The breadth and depth of a person’s acquired store of declarative
and procedural quantitative or numerical knowledge. Gq is largely
acquired through the investment of other abilities, primarily during
formal educational experiences. Gq represents an individual’s
store of acquired mathematical knowledge, not reasoning with this
knowledge.
Adapted from “CHC theory and the human cognitive abilities project: Standing on
the shoulders of the giants of psychometric intelligence research” by K. S. McGrew
(2009). Table used with permission of the author.
*Recent research suggests that encoding and retrieval memory abilities may be better
seen as distinct factors rather than combined as Glr (see Jewsbury & Bowden, 2016;
Schneider & McGrew, 2012).
factor analysis of the WMS, for example, between immediate and delayed recall
of Logical Memory. But because there are only two indicators (scores) for the
“Logical Memory” narrow factor (CHC narrow factor MM under Glr; Schneider
& McGrew, 2012), limited information is available to identify a separate fac-
tor for Glr-MM, and instead Logical Memory is included in the same factor or
broad ability with Verbal Paired Associates (CHC narrow ability MA under Glr).
Instead of a separate factor, a correlated uniqueness is included in the model to
account for the test-specific or narrow-ability variance with greater parsimony
(e.g., Bowden, Cook, Bardenhagen, Shores, & Carstairs, 2004; Bowden, Gregg,
et al., 2008; Tulsky & Price, 2003). Notably, both Logical Memory and Verbal
Paired Associates may share other narrow abilities, including Free Recall (CHC
M6; Schneider & McGrew, 2012), as well as auditory-verbal stimulus content. In
sum, because of the practical limitations on modelling narrow abilities, most
studies of omnibus or comprehensive batteries will model only the broad abili-
ties, with several tests per broad ability, and will not attempt to model the narrow
abilities. Comprehensive description of the CHC model, including definitions
of all the narrow factors, are given elsewhere (e.g., McGrew, 2009; Schneider &
McGrew, 2012).
Similarly, the general intelligence (Stratum III) factor is often included in
confirmatory factor analyses of cognitive ability batteries (e.g., Salthouse, 2005;
Weiss et al., 2013a, 2013b). However, requirements for satisfactory statistical
definition of a general factor dictate that, unless the top level (general) factor
is modeled with more than three subordinate (broad) factors, then the hierar-
chical model will be either unidentified or statistically equivalent to the factor
model that includes only the broad factors, provided the interfactor correla-
tion parameters are freely estimated. The statistical reasons for this challenge in
49
testing hierarchical factor models are beyond the scope of the current chapter
but have been discussed in detail elsewhere (Bowden, 2013; Brown, 2006; Kline,
2011; Rindskopf & Rose, 1988).
interpretation of CHC narrow and broad factors is testament to the utility of the
CHC model (Hoelzle, 2008; Jewsbury et al., 2016a,b; Keith & Reynolds, 2010;
Schneider & McGrew, 2012).
The conclusion from these studies, that conventional psychometric and clini-
cal neuropsychological tests measure the same cognitive constructs is consis-
tent with previous research. Larrabee (2000) reviewed the exploratory factor
analyses in outpatient samples of Leonberger, Nicks, Larrabee, and Goldfader
(1992) and Larrabee and Curtiss (1992) that showed a common factor struc-
ture underlying Wechsler Adult Intelligence Scale– Revised, the Halstead-
Reitan Neuropsychological Battery, and various other neuropsychological tests.
Larabee (2000) noted that the results were consistent with Carroll’s (1993) model
of cognitive abilities. The finding that psychometric and clinical tests measure
the same constructs has important implications for rationale test-selection in
clinical practice, much as the Big-5 model of personality guides the selection of
personality and psychopathology tests (see Chapter 4, this volume). Apart from
theoretical choices in terms of construct coverage, clinicians should compare
tests relevant to the assessment of any particular construct on the basis of how
reliable they are (Chapman & Chapman, 1983), and tests with optimal construct
measurement and reliability should be first-rank choices in clinical assessment.
For executive function, it was found that the factorial representation of puta-
tive executive function measures is complex, but little modification to the CHC
model was required for the CHC model to account for these measures (Jewsbury
et al., 2016a,b). Specifically, the CHC model could both explain datasets used to
derive the highly influential model of executive functions of switching, inhibi-
tion, and updating (Jewsbury et al., 2016b), as well as accounting for the most
common clinical measures of executive function (Jewsbury et al., 2016a) without
introduction of additional executive function constructs.
The semantic overlap between various definitions of executive function and
CHC constructs, as well as the available empirical evidence, suggest that execu-
tive function tests are distributed across a number of CHC constructs, rather
than overlapping with a single CHC construct (Jewsbury et al., 2016a,b). This
observation has two important implications. First, the available factor-a na-
lytic data suggest that there is no unitary executive construct underlying all
executive function tests, consistent with arguments by Parkin (1998) based on
neurobiological evidence. Second, executive function should not be treated as
a separate domain of cognition on the same level as, but separate from, well-
defined CHC constructs such as fluid reasoning, working memory, and process-
ing speed. Third, averaging across various executive function tests treated as a
single domain of cognition as in the DSM-5 classification (American Psychiatric
Association, 2013) leads to conceptually confused and clinically imprecise
results. Therefore, it is recommended that executive function not be used as a
single domain of cognition in meta-a nalyses and elsewhere, but recognized as
an overlapping set of critical cognitive skills that have been defined in paral-
lel and now can be integrated with the CHC model (Jewsbury et al., 2016a,b).
Fourth, the results suggest that equating executive function and Gf (e.g., Blair,
53
2006; Decker, Hill, & Dean, 2007) does not tell the whole story, as not all execu-
tive function tests are Gf tests. While simply equating Gf with executive func-
tion would be helpful to integrate the two research traditions of psychometrics
and neuropsychology amiably (e.g., Floyd, Bergeron, Hamilton, & Parra, 2010),
it may also lead to confusion due to elements of executive function that do not
conform to Gf (Jewsbury et al., 2016a,b).
CONCLUSIONS
Carefully developed and replicated factor models have great value in simplifying
and standardizing clinical research that is undertaken on the assumption that
tests measure general cognitive constructs. As noted above, the notion that every
test measures a different, unique clinical phenomenon is still encountered in
some clinical thinking but is contradicted by a century of psychometric research
on the latent structure of cognitive ability tests (e.g., Carroll, 1993; Nunnally &
Bernstein, 1994; Schneider & McGrew, 2012; Vernon, 1950). Ignoring the impli-
cations of psychometric construct-validity research risks an unconstrained pro-
liferation of constructs for clinical assessment (Dunn & Kirsner, 1988; 2003; Van
Orden et al., 2001), an approach that is incompatible with evidence-based neu-
ropsychological practice. Evidence-based practice requires an accumulation of
high-quality criterion-related validity evidence derived from the use of scientifi-
cally defensible measures of relevant constructs of cognitive ability. The best way
currently available to provide scientifically defensible measures of cognitive abil-
ity constructs involves the kinds of converging evidence from multiple lines of
research reviewed above, at the center of which is psychometric latent-structure
evidence.
Finally, factor models provide a coherent structure to group tests in meta-
analyses and clinical case-studies. Typically in neuropsychological meta-analy-
ses, tests are grouped in informal domains, and the properties of tests within each
domain are averaged (e.g., Belanger, Curtiss, Demery, Lebowitz, & Vanderploeg,
2005; Irani, Kalkstein, Moberg, & Moberg, 2011; Rohling et al., 2011; Zakzanis et
al., 1999). However, unless these domains are supported by theoretically guided
factor-analysis (Dodrill, 1997, 1999), averaging across the tests within a domain
produces confused results. In fact, many previous classifications of tests conform
more or less closely to a CHC classification, sometimes because authors have
been mindful of the CHC taxonomy, but most often by force of the cumula-
tive factor-analytic research in neuropsychology that has converged on a simi-
lar taxonomy (e.g., American Psychiatric Association, 2013; Rohling et al., 2011;
Zakzanis et al., 1999), so the CHC taxonomy should not be seen as radical or
unfamiliar, rather as a refinement of long-standing research insights in clinical
neuropsychology.
The CHC theory is not a completed theory, but it continues to evolve (Jewsbury
& Bowden, 2016; McGrew, 2009). However, accumulating evidence suggests that
CHC theory provides an accurate and detailed classification of a wide variety
of neuropsychological tests. Perhaps better than any other model of cognitive
54
abilities, CHC theory provides a rationale and comprehensive basis for refining
evidence-based neuropsychological assessment.
REFERENCES
Ackerly, S. S., & Benton, A. L. (1947). Report of a case of bilateral frontal lobe defect.
Research Publications: Association for Research in Nervous and Mental Disease, 27,
479–504.
Ackerman, P. L., & Lohman, D. F. (2006). Individual differences in cognitive function.
In P. A. Alexander & P. H. Winne (Eds.), Handbook of Educational Psychology (2nd
ed., pp. 139–161). Mahwah, NJ: Erlbaum.
Akhutina, T. V. (2003). L. S. Vygotsky and A. R. Luria: Foundations of neuropsychol-
ogy. Journal of Russian and East European Psychology, 41, 159–190.
Allix, N. M. (2000). The theory of multiple intelligences: A case of missing cognitive
matter. Australian Journal of Education, 44, 272–288.
Allport, A. (1993). Attention and control: Have we been asking the wrong questions?
A critical review of twenty-five years. In D. E. Meyer & S. Kornblum (Eds.), Attention
and Performance (Vol. 14, pp. 183–218). Cambridge, MA: Bradford.
Alvarez, J. A., & Emory, E. (2006). Executive function and the frontal lobes: A meta-
analytic review. Neuropsychology Review, 16, 17–42.
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education, (2014). Standards for Educational
and Psychological Testing. Washington, DC: American Educational Research
Association.
American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental
Disorders, Fifth Edition. Arlington, VA: APA.
American Psychological Association. (1954). Technical recommendations for psycho-
logical tests and diagnostic techniques. Psychological Bulletin Supplement, 51, 1–38.
American Psychological Association. (1966). Standards for Educational and
Psychological Tests and Manuals. Washington, DC: APA.
Andrés, P. (2003). Frontal cortex as the central executive of working memory: Time to
revise our view. Cortex, 39, 871–895.
Andrés, P., & Van der Linden, M. (2001). Supervisory Attentional System in patients
with focal frontal lesions. Journal of Clinical and Experimental Neuropsychology,
23, 225–239.
Andrés, P., & Van der Linden, M. (2002). Are central executive functions working in
patients with focal frontal lesions? Neuropsychologia, 40, 835–845.
Andrewes, D. (2001). Neuropsychology: From Theory to Practice. Hove, UK: Psychology
Press.
Ardila, A. (1999). A neuropsychological approach to intelligence. Neuropsychology
Review, 9, 117–136.
Baddeley, A. D. (1986). Working Memory. Oxford: Oxford University Press.
Baddeley, A. D. (1990). Human Memory: Theory and Practice. Hove, UK: Erlbaum.
Baddeley, A. D., & Hitch, G. J. (1974). Working memory. In G. A. Bower (Ed.), Psychology
of Learning and Motivation (Vol. 8, pp. 47–90). New York: Academic Press.
Bandalos, D. (2008). Is parceling really necessary? A comparison of results from item
parceling and categorical variable methodology. Structural Equation Modeling, 15,
211–240.
55
Ellis, A. W., & Young, A. (Eds.). (1996). Human Cognitive Neuropsychology. Hove,
UK: Erlbaum.
Embretson, S. (1983). Construct validity: Construct representation versus nomothetic
span. Psychological Bulletin, 93, 179–197.
Embretson, S. (1984). A general latent trait model for response processes. Psychometrika,
49, 175–186.
Embretson, S. E. (1998). A cognitive design system approach to generating valid tests:
Application to abstract reasoning. Psychological Method, 3, 380–396.
Embretson, S. E. (1999). Generating items during testing: Psychometric issues and
models. Psychometrika, 64, 407–433.
Embretson, S. E. (2007). Construct validity: A universal validity system or just another
test evaluation procedure? Educational Researcher, 36, 449–455.
Embretson, S., & Gorin, J. (2001). Improving construct validity with cognitive psychol-
ogy principles. Journal of Educational Measurement, 38, 343–368.
Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and refine-
ment of clinical assessment instruments. Psychological Assessment, 7, 286–299.
Floyd, R. G., Bergeron, R., Hamilton, G., & Parra, G. R. (2010). How do executive func-
tions fit with the Cattell-Horn-Carroll model? Some evidence from a joint factor
analysis of the Delis-Kaplan executive function system and the Woodcock-Johnson
III tests of cognitive abilities. Psychology in the Schools, 47, 721–738.
Gardner, H. (1983). Frames of Mind: The Theory of Multiple Intelligences. New York:
Basic Books.
Gardner, H. (1999). Intelligence Reframed: Multiple Intelligences for the 21st Century.
New York: Basic Books.
Gardner, H. (2006). On failing to grasp the core of MI theory: A response to Visser
et al. Intelligence, 34, 503–505.
Gardner, H., & Moran, S. (2006). The science of Multiple Intelligences Theory: A response
to Lynn Waterhouse. Educational Psychologist, 41, 227–232.
Gladsjo, J. A., McAdams, L. A., Palmer, B. W., Moore, D. J., Jeste, D. V., & Heaton,
R. K. (2004). A six-factor model of cognition in schizophrenia and related psy-
chotic disorders: Relationships with clinical symptoms and functional capacity.
Schizophrenia Bulletin, 30, 739–754.
Goldberg, L. R. (1971). A historical survey of personality scales and inventories. In
P. McReynolds (Ed.), Advances in Psychological Assessment. (Vol. 2, pp. 293–336).
Palo Alto, CA: Science and Behavior Books.
Goldstein, K., & Scheerer, M. (1941). Abstract and concrete behavior: An experimental
study with special tests. Psychological Monographs, 53, i–151.
Gottfredson, L. S. (2003a). Dissecting practical intelligence theory: Its claims and evi-
dence. Intelligence, 31, 343–397.
Gottfredson, L. S. (2003b). On Sternberg’s “Reply to Gottfredson.” Intelligence, 31,
415–424.
Grant, D. A., & Berg, E. A. (1948). A behavioral analysis of degree of reinforcement and
ease of shifting to a new response in a Weigl-t ype card-sorting problem. Journal of
Experimental Psychology, 38, 404–411.
Guilford, J. P., & Hoepfner, R. (1971). The Analysis of Intelligence. New York: McGraw-Hill.
Halstead, W. C. (1947). Brain and Intelligence. Chicago: University of Chicago Press.
Hathaway, S. R., & McKinley, J. C. (1967). Minnesota Multiphasic Personality Inventory
Manual (rev. ed.). New York: Psychological Corporation.
58
Hazy, T. E., Frank, M. J., & O’Reilly, R. (2007). Towards an executive without a
homunculus: Computational models of the prefrontal cortex/basal ganglia system.
Philosophical Transactions of the Royal Society B, 362, 1601–1613.
Hécaen, H., & Albert, M. L. (1978). Human Neuropsychology. New York: Wiley.
Helmes, E., & Reddon, J. R. (1993). A perspective on developments in assessing psy-
chopathology: A critical review of the MMPI and MMPI-2. Psychological Bulletin,
113, 453–471.
Henson, R. K., & Roberts, J. K. (2006). Use of exploratory factor analysis in published
research. Common errors and some comment on improved practice. Educational
and Psychological Measurement, 6, 393–416.
Hoelzle, J. B. (2008). Neuropsychological Assessment and the Cattell-Horn-Carroll
(CHC) Cognitive Abilities Model. Unpublished doctoral dissertation, University
of Toledo, Toledo, OH.
Horn, J. L. (1965). Fluid and Crystallized Intelligence: A Factor Analytic and
Developmental Study of the Structure Among Primary Mental Abilities.
Unpublished doctoral dissertation, University of Illinois, Urbana, IL.
Horn, J. L. (1986). Intellectual ability concepts. In R. J. Sternberg (Ed.), Advances in the
Psychology of Human Intelligence (Vol. 3, pp. 35–77). Mahwah, NJ: Erlbaum.
Horn, J. L. (1988). Thinking about human abilities. In J. R. Nesselroade (Ed.), Handbook
of Multivariate Psychology (pp. 645–685). New York: Academic Press.
Horn, J. L. (1998). A basis for research on age differences in cognitive abilities. In
J. J. McArdle & R. W. Woodcock (Eds.), Human Cognitive Abilities in Theory and
Practice (pp. 57–92). Mahwah, NJ: Lawrence Erlbaum.
Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measure-
ment invariance in aging research. Experimental Aging Research: An International
Journal Devoted to the Scientific Study of the Aging Process, 18, 117–144.
Horn, J. L., & Noll, J. (1997). Human cognitive capabilities: Gf-Gc theory. In D. P.
Flanagan, J. L. Gensaft, & P. L. Harrison (Eds.), Contemporary Intellectual
Assessment: Theories, Tests, and Issues (pp. 53–91). New York: Guilford.
Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance struc-
ture analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling, 6, 1–55.
Irani, F., Kalkstein, S., Moberg, E. A., & Moberg, P. J. (2011). Neuropsychological per-
formance in older patients with schizophrenia: A meta-analysis of cross-sectional
and longitudinal studies. Schizophrenia Bulletin, 37, 1318–1326.
Jensen, A. R. (2004). Obituary—John Bissell Carroll. Intelligence, 32, 1–5.
Jewsbury, P. A., & Bowden, S. C. (2016). Construct validity of Fluency, and implications
for the latent structure of the Cattell-Horn-Carroll model of cognition. Journal of
Psychoeducational Assessment, in press. http://jpa.sagepub.com/content/early/2016/
05/11/0734282916648041.abstract
Jewsbury, P. A., Bowden, S. C., & Duff, K. (2016a). The Cattell–Horn–Carroll model
of cognition for clinical assessment. Journal of Psychoeducational Assessment, in
press. http://jpa.sagepub.com/content/early/2016/05/31/0734282916651360.abstract
Jewsbury, P. A., Bowden, S. C., & Strauss, M. E. (2016b). Integrating the switching,
inhibition, and updating model of executive function with the Cattell-Horn-Carroll
model. Journal of Experimental Psychology: General, 145(2), 220–245.
Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational
Measurement, 38, 319–342.
Kaufman, A. S. (2009). IQ Testing 101. New York: Springer.
59
MacCorquodale, K., & Meehl, P. E. (1948). On a distinction between hypothetical con-
structs and intervening variables. Psychological Review, 55, 95–107.
Marsh, H. W., Lüdtke, O., Muthén, B., Asparouhov, T., Morin, A. J., Trautwein, U., &
Nagengast, B. (2010). A new look at the Big Five factor structure through explor-
atory structural equation modeling. Psychological Assessment, 22, 471–491.
Matarazzo, J. D. (1990). Psychological testing versus psychological assessment.
American Psychologist, 45, 999–1017.
McGrew, K. S. (1997). Analysis of the major intelligence batteries according to a pro-
posed comprehensive Gf–Gc framework. In D. P. Flanagan, J. L. Genshaft, & P. L.
Harrison (Eds.), Contemporary Intellectual Assessment: Theories, Tests, and Issues
(pp. 151−179). New York: Guilford.
McGrew, K. S. (2005). The Cattell-Horn-Carroll theory of cognitive abilities. In D. P.
Flanagan & P. L. Harrison (Eds.), Contemporary Intellectual Assessment: Theories,
Tests, and Issues (2nd ed., pp. 136–181). New York: Guilford Press.
McGrew, K. S. (2009). CHC theory and the human cognitive abilities project: Standing
on the shoulders of giants of psychometric intelligence research. Intelligence,
37, 1–10.
McReynolds, P. (1975). Historical antecedents of personality assessment. In
P. McReynolds (Ed.), Advances in Psychological Assessment (Vol. 3, pp. 477–532).
San Francisco: Jossey-Bass.
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance.
Psychometrika, 58, 525–543.
Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invari-
ance. Medical Care, 44, S69–S77.
Messick, S. (1992). Multiple intelligences or multilevel intelligence? Selective empha-
sis on distinctive properties of hierarchy: On Gardner’s Frames of Mind and
Sternberg’s Beyond IQ in the context of theory and research on the structure of
human abilities. Psychological Inquiry, 3, 365–384.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from
persons’ responses and performances as scientific inquiry into score meaning.
American Psychologist, 50, 741–749.
Millsap, R. E., & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-
categorical measures. Multivariate Behavioral Research, 39, 479–515.
Mislevy, R. J. (2007). Validity by design. Educational Researcher, 36, 463–469.
Miyake, A., Friedman, N. P., Emerson, M. J., Witzki, A. H., Howerter, A., & Wager,
T. D. (2000). The unity and diversity of executive functions and their contributions
to complex “frontal lobe” tasks: A latent variable analysis. Cognitive Psychology, 41,
49–100.
Mountain, M. A., & Snow-William, G. (1993). WCST as a measure of frontal pathology.
A review. Clinical Neuropsychologist, 7, 108–118.
Mungas, D., & Reed, B. R. (2000). Application of item response theory for development
of a global functioning measure of dementia with linear measurement properties.
Statistics in Medicine, 19, 1631–1644.
Norman, D. A., & Shallice, T. (1986). Attention to action: Willed and automatic control
of behavior. In R. J. Davidson, G. E. Schwartz, & D. Shapiro (Eds.), Consciousness
and Self-Regulation: Advances in Research and Theory (Vol. 4, pp. 1–18).
New York: Plenum Press.
Nunnally, J. C. (1978). Psychometric Theory (2nd ed.). New York: McGraw-Hill.
61
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory (3rd ed.). New York:
McGraw-Hill.
Parkin, A. J. (1998). The central executive does not exist. Journal of the International
Neuropsychological Society, 4, 518–522.
Peña-Casanova, J. (1989). A. R. Luria today: Some notes on “Lurianism” and the fun-
damental bibliography of A. R. Luria. Journal of Neurolinguistics, 4, 161–178.
Preacher, K. J., & MacCallum, R. C. (2003). Repairing Tom Swift’s electric factor analy-
sis machine. Understanding Statistics, 2, 13–43.
Rabbitt, P. (1988). Human intelligence. The Quarterly Journal of Experimental
Psychology Section A, 40, 167–185.
Raven, J. C., Court, J. H, & Raven, J. (1992). Manual for Raven’s Progressive Matrices
and Vocabulary Scale. San Antonio, TX: Psychological Corporation.
Reitan, R. M., & Wolfson, D. (1985). The Halstead-Reitan Neuropsychological Test
Battery: Theory and Clinical Interpretation. Tucson, AZ: Neuropsychology Press.
Reynolds, C. R., & Milam, D. A. (2012). Challenging intellectual testing results. In
D. Faust (Ed.), Coping with Psychiatric and Psychological Testimony (6th ed.,
pp. 311–334). Oxford, UK: Oxford University Press.
Reynolds, M. R., Keith, T. Z., Flanagan, D. P., & Alfonso, V. C. (2013). A cross-battery,
reference variable, confirmatory factor analytic investigation of the CHC taxon-
omy. Journal of School Psychology, 51, 535–555.
Rindskopf, D., & Rose, T. (1988). Some theory and applications for confirmatory
second-order factor analysis. Multivariate Behavioural Research, 23, 51–67.
Rohling, M. L., Binder, L. M., Demakis, G. J, Larrabee, G. J., Ploetz, D. M., &
Langhinrichsen-Rohling, J. (2011). A meta-analysis of neuropsychological out-
come after mild traumatic brain injury: Re- analyses and reconsiderations of
Binder et al. (1997), Frencham et al. (2005), and Pertab et al. (2009). The Clinical
Neuropsychologist, 25, 608–623.
Salthouse, T. A. (2005). Relations between cognitive abilities and measures of executive
functioning. Neuropsychology, 4, 532–545.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological
research: Lessons from 26 research scenarios. Psychological Methods, 1, 199–223.
Schneider, R. J., Hough, L. M., & Dunnette, M. D. (1996). Broadsided by broad
traits: How to sink science in five dimensions or less. Journal of Organizational
Behavior, 17, 639–655.
Schneider, W. J., & McGrew, K. (2012). The Cattell-Horn-Carroll model of intelligence.
In D. Flanagan & P. Harrison (Eds.), Contemporary Intellectual Assessment: Theories,
Tests, and Issues (3rd ed., pp. 99–144). New York: Guilford.
Shallice, T. (1982). Specific impairments of planning. Philosophical Transactions of the
Royal Society of London. Series B, Biological Sciences, 298, 199–209.
Shallice, T. (1988). From Neuropsychology to Mental Structure. New York: Cambridge
University Press.
Shallice, T., & Burgess, P. W. (1991). Deficits in strategy application following frontal
lobe damage in man. Brain, 114, 727–741.
Snow, R. E. (1998). Abilities as aptitudes and achievements in learning situations. In
J. J. McArdle & R. W. Woodcock (Eds.), Human Cognitive Abilities in Theory and
Practice (pp. 93–112). Mahwah, NJ: Erlbaum.
Spearman, C. (1904). “General intelligence,” objectively determined and measured.
The American Journal of Psychology, 15, 201–292.
62
TA Y L A T. C . L E E , M A R T I N S E L L B O M ,
A N D C H R I STOP H E R J. HOP WOOD
EXTERN
DISTRESS
INTERN
FEAR
2013; Lahey et al., 2011, 2012). Broadly, these results strongly indicate that indi-
viduals with mental disorders do not always fall into an internalizing or exter-
nalizing group. Rather, individuals who are at risk for psychopathology are likely
to have some experience of both internalizing and externalizing symptoms.
Additional support has been garnered for the MCLM, as it has been dem-
onstrated to be similar across diverse cultures (Krueger et al., 1998; Krueger,
Chentsova-Dutton, Markon, Goldberg, & Ormelet 2003; Krueger et al., 2001;
Slade & Watson, 2006; Vollebergh et al., 2001) and across genders (Kramer et al.,
2008; Krueger, 1999). There is also an accumulating body of evidence to suggest
that this model replicates well across age groups (Kramer et al., 2008; Krueger,
1999; Lahey et al., 2008; Vollebergh et al., 2001). Lastly, predispositions for inter-
nalizing and externalizing dysfunction have been demonstrated to be relatively
stable over short periods of time (i.e., one year and three years; Krueger et al.,
1998; Vollebergh et al., 2001).
Indicators of
AGGRESS aggressiveness and
lack of empathy
Indicators of
impulsivity,
irresponsibility, and EXTERN
poor control
Indicators of
alcohol, marijuana,
SUBS and drug use, and
related problems
A Third Dimension—Psychosis
Earlier studies on the MCLM were primarily conducted in large, community-
dwelling samples where the prevalence of psychotic phenomena, such as symp-
toms of schizophrenia and schizotypal personality disorder, were less common,
leaving open to question how these types of difficulties would be best described
by structural models of psychopathology. To begin answering this question,
Kotov and colleagues (Kotov, Chang, et al., 2011; Kotov, Ruggero, et al., 2011) con-
ducted two large-scale studies using clinical inpatient and outpatient samples
in order to determine if these types of symptoms would best be described by
existing internalizing and externalizing liability concepts, or if a third, inde-
pendent factor representing psychotic symptoms would emerge. Results of both
studies suggested psychotic symptoms were not well accounted for by a model
containing only internalizing and externalizing problems. Rather, results indi-
cated that psychotic symptoms were best conceptualized as manifestations of a
third dimension of psychopathology, which they termed “psychosis” (also some-
times referred to as “thought disorders”; Caspi et al., 2014). Subsequent studies in
large epidemiological samples from Australia (Wright et al., 2013), New Zealand
(Caspi et al., 2014), and the United States (Fleming, Shevlin, Murphy, & Joseph,
2014) have supported the inclusion of a Psychosis liability in the externalizing/
internalizing MCLM framework.
The final, best-fitting model in studies that have included a Psychosis liability
is displayed in Figure 4.3. The model has three correlated factors—Externalizing,
Internalizing, and Psychosis—which are displayed in Figure 4.3 with ovals con-
nected by double-headed arrows. Diverse empirical evidence supports linking
to the Psychosis liability such disorders as schizophrenia, schizotypal person-
ality disorder, paranoid personality disorder, and schizoid personality disor-
der (Caspi et al., 2014; Kotov, Chang, et al., 2011; Kotov, Ruggero, et al., 2011).
72
PSYCHOSIS
Indicators of
AGGRESS Antisociality and
Aggression
Indicators of
Inattention and EXTERN
Impulsivity
Indicators of
SUBS Drug and Alcohol
Misuse
DISTRESS
INTERN
FEAR
General
Psychopathology
Internalizing Externalizing
Fear/Negative Distress/
Externalizing
Affectivity Detachment
Negative
Detachment Disinhibition Aggression
Affectivity
Negative
Detachment Psychoticism Disinhibition Aggression
Affectivity
2012). These findings make sense, of course, given the construct variations in
these different populations. Overall, these hierarchical analyses are important
as they demonstrate at least two important things (1) It does not matter which
personality model one examines, as they typically map onto each other and rep-
resent different levels of analyses within the broader personality hierarchy, and
(2) personality and psychopathology variance map onto one another in impor-
tant ways from an empirical hierarchy perspective.
Higher-Order Structures
Most of the MMPI-2-R F scales are organized in a hierarchical fashion that maps
onto the general hierarchical three-factor structure of psychopathology (Kotov,
Chang, et al., 2011; Kotov, Ruggero, et al., 2011; Wright et al., 2012). Specifically,
the three higher- order (H- O) scales— Emotional- Internalizing Dysfunction
(EID), Thought Dysfunction (THD), and Behavioral-Externalizing Dysfunction
(BXD)—a ll map onto the three broad contemporary psychopathology domains.
The nine RC scales occupy the middle level of the hierarchy, and the 23 specific
problems (SP) scales compose the lowest level with a very narrow, facet-based rep-
resentation of psychopathology, including specific problems not directly assessed
via the broader scales (e.g., suicidal ideation). Extant to the three-level hierarchy,
but no less important with respect to empirical psychopathology structures, are
the Personality Psychopathology Five (PSY-5; Harkness & McNulty, 1994; see
also Harkness, McNulty, Finn, Reynolds, & Shields, 2014) scales that provide a
dimensional assessment of maladaptive personality traits from a five-factor per-
spective akin to that of DSM-5 Section III (APA, 2013).
Although it certainly would be expected that the H-O scales would map
onto contemporary psychopathology structures, it is noteworthy that these
scales were developed based on factor analysis of the nine RC scales (Tellegen &
Ben-Porath, 2008; see also Sellbom, Ben-Porath, & Bagby, 2008b). Tellegen and
Ben-Porath (2008) conducted a series of exploratory factor analyses in large
outpatient and inpatient samples, and found that three RC scales representing
demoralization (RCd), low positive affectivity (RC2), and high negative affec-
tivity (RC7) consistently loaded on an internalizing factor. RC scales reflecting
antisocial behavior (RC4), hypomanic activation (RC9), and, to a lesser degree,
cynicism (RC3) loaded on an externalizing factor and RC scales indexing per-
secutory ideation (RC6) and aberrant experiences (RC1) were core markers for
a psychoticism factor. Subsequent research has replicated and extended these
findings in a variety of samples from North America and Europe. For instance,
Hoelzle and Meyer (2008) and Sellbom et al. (2008b) independently reported
almost identical findings in large psychiatric samples from the United States
and Canada, respectively. Van der Heijden and colleagues (Van der Heijden,
Rossi, Van der Veld, Derksen, & Egger, 2013a) replicated Sellbom et al.’s
(2008b) findings across five large Dutch clinical and forensic samples, again,
finding that the RC scales adhered to a three-factor higher-order structure.
Most recently, Anderson et al. (2015) conducted a conjoint exploratory fac-
tor analysis using MMPI-2-R F scale sets with the Personality Inventory for
DSM-5 (PID-5; Krueger et al., 2012) in a large Canadian psychiatric sample.
These authors found that the three higher-order domains could be extracted in
77
analyses using each of the four MMPI-2-R F scale sets in conceptually expected
ways. It is noteworthy that, in these latter results using the lower-order scale
sets, that a fourth factor representing social detachment, introversion, and
low affective arousal consistently emerged. This result is similar to much of
the PAI research (reviewed later) on this topic, as well as the five-factor mod-
els described earlier (e.g., Wright et al., 2012). Thus, quantitative hierarchical
research using the RC scales, which were developed without any particular
diagnostic nosology in mind, but rather were grounded in Tellegen’s theory
of self-reported affect (e.g., Tellegen, 1985; Watson & Tellegen, 1985), strongly
suggests the hierarchical organization of the MMPI-2-R F conforms to the
same structure as identified in the extant psychopathology epidemiology lit-
erature just reviewed.
Domain-Specific Structures
More recent research has emerged to indicate that, within each domain, the SP
scales also map onto extant empirically validated structures. Sellbom (2011)
demonstrated that the internalizing SP scales conformed to Watson’s (2005; see
also work by Krueger, 1999; Krueger & Markon, 2006) quantitative hierarchical
structure of emotional disorders. Specifically, MMPI-2-R F scales Suicide/Death
Ideation (SUI), Helplessness/Hopelessness (HLP), Self-Doubt (SFD), Inefficacy
(NFC), and Stress and Worry (STW) loaded on a “distress” factor, whereas
Anxiety (AXY), Behavior-Restricting Fears (BRF), and Multiple Specific Fears
(MSF) loaded on a “fear” factor in a very large outpatient mental health sample.
Sellbom (2010, 2011; see also Sellbom, Marion, et al., 2012) also elaborated on
externalizing psychopathology structures in a variety of community, correc-
tional, and forensic samples. By and large, the research has shown that the four
externalizing SP scales, including Juvenile Conduct Problems (JCP), Substance
Abuse (SUB), Aggression (AGG), and Activation (ACT), load onto a broad exter-
nalizing domain, but also can be modeled in accordance with Krueger et al.’s
(2007) bifactor structure in which residual subfactors of callous-aggression
and substance misuse can be identified. Finally, Sellbom, Titcomb, and Arbisi
(2011; see also Titcomb, Sellbom, Cohen, & Arbisi, under review) found that the
thought-dysfunction items embedded within RC6 (Ideas of Persecution) and
RC8 (Aberrant Experiences) can be modeled according to an overall thought-
dysfunction factor, but also isolating a residual, paranoid-ideation subfactor (in a
bifactor framework) that corresponds to a neuropsychiatric etiology model that
separates paranoid delusions from schizophrenia more broadly (Blackwood,
Howard, Bentall, & Murray, 2001).
Research has also established that MMPI-2-R F scale scores map onto the
broader five- factor structure of personality and psychopathology (see, e.g.,
Bagby et al., 2014; Markon et al., 2005; Wright & Simms, 2014; Wright et al.,
2012). McNulty and Overstreet (2014) subjected the entire set of MMPI-2-R F
scales (corrected for item overlap) to factor analyses in very large outpatient and
inpatient mental health samples. They found a six-factor structure, with five
78
of the factors mirroring the aforementioned PSY-5 domains and a sixth factor
reflecting somatization. The PSY-5 scales in their own right overlap both con-
ceptually and empirically with the personality trait domains listed in DSM-5
Section III, which provides an alternative model for operationalizing personal-
ity disorders (Anderson et al., 2013; see also Anderson et al., 2015). When the
PSY-5 domain scales were subjected to a conjoint factor analysis with the DSM-5
Section III personality-trait facets (as operationalized by the PID-5), the five-
factor higher-order structure emerged, with the PSY-5 loading on their expected
domains (see also Bagby et al., 2014, for analyses with the PSY-5 items). Recent
research has further shown that the PSY-5 scales operate similarly to the DSM-5
Section III model in accounting for variance in the formal Section II personal-
ity disorders (PD; e.g., Finn, Arbisi, Erbes, Polusny, & Thuras, 2014; Sellbom,
Smid, De Saeger, Smit, & Kamphuis, 2014). For instance, Avoidant PD is best
predicted by Negative Emotionality/Neuroticism (NEGE-r) and Introversion/
Low Positive Emotionality (INTR-r), Antisocial PD by Aggressiveness (AGGR-r)
and Disconstraint (DISC-r), Borderline by Negative Emotionality/Neuroticism
(NEGE-r) and Disconstraint (DISC-r), Narcissistic by Aggressiveness (AGGR-r),
and Schizotypal by Psychoticism (PSYC-r).
Internalizing
Research has accumulated to suggest that individual MMPI-2-R F scale scores
map onto specific hierarchical models of internalizing psychopathology in ways
predicted by theory. In one of the first studies in this regard, Sellbom, Ben-Porath,
and Bagby (2008a) examined the utility of Demoralization (RCd), Low Positive
Emotions (RC2), and Dysfunctional Negative Emotions (RC7) as markers of an
expanded model of temperament in predicting “distress” and “fear” disorders
within Watson’s (2005) framework for internalizing disorders. They used both
clinical and nonclinical samples, and via structural equation modeling, showed
that RCd mapped onto the distress disorders, whereas RC7 was preferentially
associated with fear disorders. RC2, as expected, differentiated depression
(within the distress domain) and social phobia (within the fear domain) from
the other disorders. In another study examining PTSD comorbidity, RCd was
the best predictor of internalizing/distress psychopathology (Wolf et al., 2008).
Several recent studies have also specifically examined predictors of PTSD within
contemporary frameworks. Among the RC scales, RCd seems to consistently be
a predictor of global PTSD symptomatology, and in particular, the distress/dys-
phoria factor associated with this disorder, with RC7 being a meaningful predictor
in some as well (Arbisi, Polusny, Erbes, Thuras, & Reddy, 2011; Miller et al., 2010;
Sellbom, Lee, Ben-Porath, Arbisi, & Gervais, 2012; Wolf et al., 2008; see also Forbes,
Elhai, Miller, & Creamer, 2010). More specifically, among the SP scales, Anxiety
(AXY) appears to be the best predictor of PTSD symptoms in various clinical and/
or veteran samples (Arbisi et al., 2011; Sellbom, Lee, et al., 2012). Anger Proneness
(ANP) is a good predictor of hyperarousal symptoms, and Social Avoidance (SAV)
of avoidance symptoms (Sellbom, Lee, et al., 2012; see also Koffel, Polusny, Arbisi, &
Erbes, 2012). Finally, Koffel et al. (2012) have begun to identify specific MMPI-2-RF
item markers of DSM-5 PTSD symptoms in a large U.S. National Guard sample, but
these require further validation before their use can be recommended.
Externalizing
Numerous studies have shown good convergent and discriminant validity of
the externalizing RC (RC4 and RC9) and SP (Juvenile Conduct Problems [JCP],
Substance Abuse [SUB], Aggression [AGG], and Activation [ACT]) scales that
make up the externalizing spectrum. As documented in the Technical Manual
and elsewhere, across nonclinical, mental health, and forensic samples, JCP is pref-
erentially associated with crime data and impulsivity. Not surprisingly, SUB is the
most potent predictor of alcohol and drug misuse, but it tends to be a good pre-
dictor of general externalizing and sensation-seeking as well (Johnson, Sellbom,
& Phillips, 2014; Tarescavage, Luna-Jones, & Ben-Porath, 2014; Tellegen & Ben-
Porath, 2008). AGG is more specifically associated with behavioral manifesta-
tions of both reactive (or angry) and instrumental forms of aggression (Tellegen
& Ben-Porath, 2008). ACT is specifically associated with externalizing as reflected
80
Thought Dysfunction
Four scales of the MMPI-2-R F are particularly relevant for assessing the psy-
chotic dimension that emerges in structural models of psychopathology, namely
the Higher-Order Thought Dysfunction (THD) scale, RC6 (Ideas of Persecution),
RC8 (Aberrant Experiences), and the Psychoticism (PSYC-r) scale (Tellegen &
Ben-Porath, 2008). Research to date has shown that the two main indicators of
specific psychotic symptomatology, RC6 and RC8, are effective in differentiating
between paranoid delusional and non-paranoid psychotic presentations. More
specifically, in a large inpatient psychiatric sample, Arbisi, Sellbom, and Ben-
Porath (2008) and Sellbom et al. (2006) found that RC6 was preferentially asso-
ciated with a history and active presence of delusions (particularly grandiose
and persecutory types), whereas RC8 is a better predictor of hallucinations and
non-persecutory delusions. Handel and Archer (2008) replicated these findings
in another large inpatient sample. Furthermore, research with the PSY-5 PSYC-r
scale has found it to be a good predictor of global thought disturbance and dis-
connection from reality (Tellegen & Ben-Porath, 2008), as well as Schizotypal
Personality Disorder (Sellbom et al., 2014).
Somatization
As discussed earlier, somatization is typically not recognized as its own higher-
order domain in the psychopathology literature. However, such measurement
is featured on many omnibus personality and psychopathology inventories,
81
including the PAI and MMPI-2-R F, and structural analyses do seem to tenta-
tively support its distinct (from internalizing psychopathology) nature (e.g.,
Kotov, Ruggero, et al., 2011).
The MMPI-2-R F RC1 (Somatic Complaints) scale is the broadest measure of
somatization on the instrument, with five SP scales reflecting preoccupation
with general physical debilitation (MLS), gastrointestinal complaints (GIC),
head pain complaints (HPC), neurological/conversion symptoms (NUC), and
cognitive memory and attention complains (COG). The hierarchical structure of
these scales has been supported (Thomas & Locke, 2010). The Technical Manual
presents validity evidence that all scales are good predictors of somatic preoc-
cupation with reasonable discriminant validity. Some more specific research has
indicated that NUC is a particularly potent predictor of non-epileptic seizures
in medical settings (Locke et al., 2010). Gervais, Ben-Porath, and Wygant (2009)
have shown that the COG scale is a good predictor of self-report memory com-
plaints in a very large disability sample.
et al., 2005; Tackett et al., 2008; Wright et al., 2012). More specifically, in these
structures, “disinhibition” breaks down into specific representations of uncon-
scientiousness and antagonism.
Research also shows that PAI scales are sensitive and specific to direct mea-
sures of higher-order factors of personality and psychopathology. For instance,
in the initial validation studies, each of the higher-order dimensions of the nor-
mal range NEO Personality Inventory (NEO-PI; Costa & McCrae, 1985) was
specifically correlated with several PAI scales. Research with instruments whose
content focuses on more pathological elements of personality traits, such as the
Personality Inventory for DSM-5 (Krueger et al., 2012), shows similar patterns
(Hopwood et al., 2013). In the remainder of this section, we provide more specific
information about connections between PAI scales and higher-order dimensions
of personality and psychopathology.
Internalizing
A general distress factor with strong loadings on PAI scales such as Depression
(DEP), Anxiety (ANX), Anxiety-Related Disorders (ARD), Borderline Features
(BOR), Suicidal Ideation (SUI), and Stress (STR) is invariably the first factor
extracted across studies examining the structure of the PAI scales. As with the
MMPI-R F, validity correlations among these scales provide important informa-
tion about lower-order fear and distress variants of the internalizing dimension.
Numerous studies have demonstrated the sensitivity of PAI scales to distress
disorders (e.g., major depression) and fear disorders (e.g., panic disorder or pho-
bias), as well as the ability of PAI scales to discriminate between these classes of
disorders (e.g., Fantoni-Salvador & Rogers, 1997). Furthermore, Veltri, Williams,
and Braxton (2004) found that PAI DEP correlated .55 with MMPI-2-R F RC7
(Dysfunctional Negative Emotions) and .70 with RC2 (Low Positive Emotions),
whereas PAI ARD correlated .45 with RC2 and .70 with RC7, suggesting that
the PAI and MMPI-2-R F operate similarly in terms of distinguishing fear and
distress disorders.
PAI scales are also available for the targeting of specific constructs within this
domain. The DEP and ANX scales have the strongest loadings on the internaliz-
ing factor. Both of these scales have subscales focused on the affective, cognitive,
and physiological aspects of the constructs, which generally tend to be related to
the distress aspects of that factor. The one exception is the ANX-Physiological
subscale, which, together with the ARD-Phobias subscale, is the most specific
to fear symptoms involving panic and phobias among PAI scales. For instance,
in the initial validation studies, Morey (1991) showed that ANX-Physiological
correlated .62 and ARD-Phobias correlated .60 with the MMPI Wiggins Phobias
Content Scale, whereas the average correlation between MMPI Phobias and
other PAI ANX and ARD scale correlations was .37.
Several PAI scales can be used to assess other, more specific problems on
the internalizing spectrum. For instance, an emerging literature supports the
84
validity of the PAI, and particularly the ARD-Traumatic Stress scale, for assess-
ing post-traumatic symptoms (e.g., Edens & Ruiz, 2005). In addition, the ARD-T
scale had a sensitivity of 79% and a specificity of 88% for a PTSD diagnosis based
on the Clinician Administered PTSD Scale in a group of women who had been
exposed to traumatic events (McDevitt-Murphy, Weathers, Adkins, & Daniels,
2005). The PAI Suicidal Ideation (SUI) scale has been shown to be a valid indi-
cator of suicidal behavior (Hopwood, Baker, & Morey, 2008). The PAI Somatic
Complaints (SOM) scale, which typically loads on the internalizing factor of
the PAI, focuses on common health concerns, the somatization of psychological
symptoms, and the presence of bizarre or unlikely symptoms. The SOM scale has
been shown to be sensitive to a variety of health conditions, such as headaches
(Brendza, Ashton, Windover, & Stillman, 2005), pain (Karlin et al., 2006), and
diabetes (Jacobi, 2002).
Externalizing
The second factor that is typically extracted in factor analyses of the PAI has
the strongest loadings on Antisocial Features (ANT), Alcohol Problems (ALC),
and Drug Problems (DRG), implying an underlying externalizing dimension.
The ANT scale has subscales measuring antisocial behaviors directly, as well
subscales that measure psychopathic features including callous egocentricity
and sensation-seeking. These scales have demonstrated empirical validity in
discriminating disorders from the externalizing spectrum. For instance, Edens,
Buffington-Vollum, Colwell, Johnson, and Johnson (2002) reported an area
under the curve (AUC) of .70 using the ANT scale to indicate categorical psy-
chopathy diagnoses based on the Psychopathy Checklist–Revised (Hare, 2003),
whereas Ruiz, Dickinson, and Pincus (2002) reported an AUC of .84 using the
ALC scale to predict interview-based alcohol dependence.
Antagonism
A third factor that is often extracted in factor analyses of the PAI scales has
its strongest loadings on Mania (MAN), Dominance (DOM), and Aggression
(AGG).3 Common among these scales is a tendency toward self-importance, con-
trolling behavior, and interpersonal insensitivity, which are collectively similar
to the personality construct Antagonism and the liability toward callousness-
aggression that emerges in structural psychopathology models. The MAN scale
has specific subscales measuring irritability and frustration tolerance, grandios-
ity and self-esteem, and energy or activity. The Aggression (AGG) scale also gen-
erally exhibits strong loadings on the PAI externalizing factors, and has scales
sampling behaviors related to an aggressive attitude, verbally aggressive behav-
ior, and physically aggressive behavior. An emerging body of research, mostly in
forensic samples, speaks to the validity of AGG to be associated with and predic-
tive of violent and other aggressive behaviors (e.g., Edens & Ruiz, 2005; see also
Gardner et al., 2015, for a meta-analysis).
85
Detachment
In other samples, such as the PAI community normative sample, the fourth
factor has its highest loadings on high Nonsupport (NON) and low Warmth
(WRM), implying social detachment, disconnection, and introversion, akin to
the social-detachment/introversion factor that emerges from structural mod-
els of psychopathology and personality discussed earlier. In addition to these
scales, numerous PAI scales provide additional information about this aspect of
personality and psychopathology. In particular, the Schizophrenia (SCZ) Social
Detachment subscale provides a direct assessment of disconnection from the
social environment, which is common among psychotic individuals, but is also a
common concern among individuals with other forms of psychopathology.
Thought Dysfunction
Although thought disorder is often identified as a major spectrum of psychopa-
thology and is often reflected in one of the major factors explaining covariation
in MMPI-2-R F scales, it has not been identified in factor analyses of the PAI full
scales. One reason for this finding has to do with participant sampling, insofar as
no factor analyses have been based on samples with a high representation of psy-
chotic individuals. A second reason has to do with the proportion of scales on the
PAI targeting thought dysfunction. Only two full scales, Schizophrenia (SCZ)
and Paranoia (PAR), directly target psychotic content, and several of the sub-
scales of those full scales will tend to relate more strongly to other factors (such
as social detachment, as described above). Like other broadband measures with a
relatively small number of scales tapping thought dysfunction (e.g., the Schedule
for Nonadaptive and Adaptive Personality [SNAP] or Dimensional Assessment
of Personality Pathology [DAPP]), the proportion of content on the PAI may not
be sufficient to yield a robust thought-dysfunction factor. That being said, the
PAI scales have been shown to be empirically effective in distinguishing between
individuals with and without psychotic disorders. For example, Klonsky (2004)
reported an effect size difference of d = 1.29 in distinguishing schizophrenic and
non-schizophrenic patients using PAI SCZ.
results are quite impressive for both inventories, and it is important to note that
none of these analyses was ever rigged or otherwise set up to confirm the extant
structures, rather, exploratory analyses identified them in parallel. Subsequent
validity research on both MMPI-2-R F and PAI scale scores clearly indicates that
the higher-order structures and the scales that compose them reflect constructs
that are located within nomological networks similar to those in the extant litera-
ture. Therefore, clinical neuropsychologists and other mental health practitioners
who use these inventories in practice can rest assured that their scale scores map
onto contemporary and empirically validated models of psychopathology, as well
as can be used to generate hypotheses about diagnostic considerations (based on
current nosologies) and standing on individual difference personality traits.
Even in light of these sanguine conclusions, there is still much work needed in
both empirical investigations of psychopathology structures and the assessment
of these structures as we move forward. As would be expected from the methods
used to conduct latent-structure analyses, many of our current concepts of the
structure of psychopathology, such as that represented in the MCLM, are a result
of the types of symptoms and difficulties that were examined, as well as the types
of individuals who were included in the studies’ samples. We have tried to high-
light in this chapter the effects of these methodological issues with our inclusion
of three-, four-, and five-factor models of psychopathology. However, our under-
standing of psychopathology from a dimensional point of view is ever evolving.
Future research is needed to reconcile current ambiguities (e.g., an independent
somatization factor?) and to further establish a replicable structure that accounts
for more of the dysfunctions practitioners encounter in their diverse practices
(e.g., impulse-control disorders, eating disorders, paraphilias). Equally impor-
tant for future work is ensuring that assessment instruments already in use,
such as the MMPI-2-R F and the PAI, map onto emerging personality/psychopa-
thology structures. Research of this type allows us to bridge the divide between
categorical conceptualizations currently necessary for medical documentation
and financial reimbursements and dimensional models that allow large bodies
of clinical science research to be more easily applied in routine practice. These
efforts will require that both assessment scholars and practitioners move away
from the categorical thinking about psychological dysfunctions we have all
been trained to use, as well as crystallized beliefs about the distinctions between
psychopathology and personality. The previous efforts reviewed in this chapter
represent the first steps toward this type of work, both in terms of central liabili-
ties leading to mental difficulties and how our major assessment instruments
conform to these structures. However, it should be clear that much additional
work is needed, especially concerning mapping existing scales onto more spe-
cific liabilities in personality and psychopathology hierarchies.
AUTHOR NOTES
Tayla T. C. Lee is with the Department of Psychological Science, Ball State
University, Muncie, Indiana; Martin Sellbom is with the Department of
87
NOTES
1. To our knowledge, the model being described in this chapter has not been given an
official name, perhaps because this would not be in the “model in development” spirit
that pervades conclusions sections in reports of these studies. For ease of reference,
we have given the model a name in this chapter. Although it is a mouthful, we hope
the reader will forgive us for choosing “Multivariate Correlated Liabilities Model.”
We chose this name because the discussed models represent multivariate extensions
of the Correlated Liabilities Model (Klein & Riso, 1993; Neale & Kendler, 1995).
2. The latter abstraction is typically psychoticism/thought dysfunction when sufficient
indicators of such individual differences are included; however, when predominant
measures of the five-factor model are used, openness tends to appear at the fifth level.
3. The nature of the third factor tends to depend on the sample in which the analysis
is conducted (see, for example, Morey, 2007; and Hoelzle & Meyer, 2009).
REFERENCES
American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental
Disorders, Fifth Edition. Arlington, VA: APA.
Anderson, J. L., Sellbom, M., Ayearst, L., Quilty, L. C., Chmielewski, M., & Bagby, R. M.
(2015). Associations between DSM-5 Section III personality traits and the Minnesota
Multiphasic Personality Inventory 2–Restructured Form (MMPI-2-RF) scales in a
psychiatric patient sample. Psychological Assessment, 27, 801–815.
Anderson, J. L., Sellbom, M., Bagby, R. M., Quilty, L. C., Veltri, C. O. C., Markon,
K. E., & Krueger, R. F. (2013). On the convergence between PSY-5 domains and
PID-5 domains and facets: Implications for assessment of DSM-5 personality traits.
Assessment, 20, 286–294.
Arbisi, P. A., Sellbom, M., & Ben-Porath, Y. S. (2008). Empirical correlates of the MMPI-
2 Restructured Clinical (RC) scales in an inpatient sample. Journal of Personality
Assessment, 90, 122–128.
Arbisi, P. A., Polusny, M. A., Erbes, C. R., Thuras, P., & Reddy, M. K. (2011). The
Minnesota Multiphasic Personality Inventory–2 Restructured Form in National
Guard soldiers screening positive for posttraumatic stress disorder and mild trau-
matic brain injury. Psychological Assessment, 23, 203–214.
Bagby, R. M., Sellbom, M., Ayearst, L. E., Chmielewski, M. S., Anderson, J. L., & Quilty,
L. C. (2014). Exploring the hierarchical structure of the MMPI-2-R F Personality
Psychopathology Five in psychiatric patient and university student samples. Journal
of Personality Assessment, 96, 166–172.
Porath, Y. S., & Tellegen, A. (2008). MMPI-2-R F (Minnesota Multiphasic
Ben-
Personality Inventory–2 Restructured Form): Manual for Administration, Scoring,
and Interpretation. Minneapolis, MN: University of Minnesota Press.
88
Blackwood, N. J., Howard, R. J., Bentall, R. P., & Murray, R. M. (2001). Cognitive neu-
ropsychiatric models of persecutory delusions. American Journal of Psychiatry 158,
527–539.
Brady, K. T., & Sinha, R. (2005). Co-occurring mental and substance use disorders: The
neurobiological effects of chronic stress. American Journal of Psychiatry, 162,
1483–1493.
Brendza, D., Ashton, K., Windover, A., & Stillman, M. (2005). Personality Assessment
Inventory predictors of therapeutic success or failure in chronic headache patients.
The Journal of Pain, 6, 81.
Brown, T. A., & Barlow, D. H. (2005). Categorical vs. dimensional classification of men-
tal disorders in DSM-V and beyond. Journal of Abnormal Psychology, 114, 551–556.
Caspi, A., Houts, R. M., Belsky, D. W., Goldman-Meilor, S. J., Harrington, H., Israel, S.,
… Moffit, T. E. (2014). The p factor: One general psychopathology factor in the
structure of psychiatric disorders? Clinical Psychological Science, 2, 119–137.
Chmielewski, M., & Watson, D. (2007, October). Oddity: The Third Higher Order Factor
of Psychopathology. Poster presented at the 21st Annual Meeting of the Society for
Research in Psychopathology, Iowa City, IA.
Costa, P. T. Jr., & McCrae, R. R. (1985). NEO Personality Inventory Manual. Odessa,
FL: Psychological Assessment Resources.
Cox, B. J., Clara, I. P., & Enns, M. W. (2002). Posttraumatic stress disorder and the
structure of common mental disorders. Depression & Anxiety, 15(4), 168–171.
Cronbach, L. J, & Meehl, P. E. (1955). Construct validity in psychological tests.
Psychological Bulletin, 52, 281–302.
Eaton, N. R., South, S. C., & Krueger, R. F. (2010). The meaning of comorbidity among
common mental disorders. In T. Millon, R. F. Krueger, & E. Simonson (Eds.),
Contemporary Directions in Psychopathology. Scientific Foundations of the DSM-V
and ICD-11 (pp. 223–241). New York: The Guilford Press.
Edens, J. F., Buffington-Vollum, J. K., Colwell, K. W., Johnson, D. W., & Johnson, J. K.
(2002). Psychopathy and institutional misbehavior among incarcerated sex offend-
ers: A comparison of the Psychopathy Checklist–Revised and the Personality
Assessment Inventory. International Journal of Forensic Mental Health, 1, 49–58.
Edens, J. F., & Ruiz, M. A. (2005). PAI Interpretive Report for Correctional Settings
Professional Manual. Lutz, LF: Psychological Assessment Resources.
Fantoni-Salvador, P., & Rogers, R. (1997). Spanish versions of the MMPI-2 and PAI: An
investigation of concurrent validity with Hispanic patients. Assessment, 4, 29–39.
Finn, J. A., Arbisi, P. A., Erbes, C. R., Polusny, M. A., & Thuras, P. (2014). The MMPI-
2 Restructured Form Personality Psychopathology Five Scales: Bridging DSM-5
Section 2 personality disorders and DSM-5 Section 3 personality trait dimensions.
Journal of Personality Assessment, 96, 173–184.
Fleming, S., Shevlin, M., Murphy, J., & Joseph, S. (2014). Psychosis within dimensional
and categorical models of mental illness. Psychosis, 6, 4–15.
Forbes, D., Elhai, J. D., Miller, M. W., & Creamer, M. (2010). Internalizing and exter-
nalizing classes in posttraumatic stress disorder: A latent class analysis. Journal of
Traumatic Stress, 23, 340–349.
Gardner, B. O., Boccaccini, M. T., Bitting, B. S., & Edens, J. F. (2015). Personality
Assessment Inventory Scores as predictors of misconduct, recidivism, and vio-
lence: A meta-analytic review. Psychological Assessment, 27, 534–544.
Gervais, R. O., Ben-Porath, Y. S., & Wygant, D. B. (2009). Empirical correlates and
interpretation of the MMPI- R F Cognitive Complaints Scale. The Clinical
2-
Neuropsychologist, 23, 996–1015.
89
Kendler, K. S., Prescott, C. A., Myers, J., & Neale, M. C. (2003). The structure of genetic
and environmental risk factors for common psychiatric and substance use disor-
ders in men and women. Archives of General Psychiatry, 60, 929–937.
Kessler, R. C., Adler, L., Barkley, R., Biederman, J., Conners, C. K., Demler, O., …
Zaslavsky, A. M. (2006). The prevalence and correlates of adult ADHD in the United
States: Results from the National Comorbidity Survey Replication. American
Journal of Psychiatry, 163, 716–723.
Kessler, R. C., Chiu, W. T., Demler, O., & Walters, E. E. (2005). Prevalence, severity,
and comorbidity of twelve-month DSM-IV disorders in the National Comorbidity
Survey Replication (NCS-R). Archives of General Psychiatry, 62, 617–627.
Kessler, R. C., & Wang, P. S. (2008). The descriptive epidemiology of commonly occur-
ring mental disorders in the United States. Annual Review of Public Health, 29,
115–129.
Klein, D. N., & Riso, L. P. (1993). Psychiatric disorders: Problems of boundaries and
comorbidity. In C. G. Costello (Ed.), Basic Issues in Psychopathology (pp. 19–66).
New York: Guilford Press.
Klonsky, E. D. (2004). Performance of Personality Assessment Inventory and
Rorschach indices of schizophrenia in a public psychiatric hospital. Psychological
Services, 1, 107–110.
Koffel, E., Polusny, M. A., Arbisi, P. A., & Erbes, C. R. (2012). A preliminary investiga-
tion of the new and revised symptoms of posttraumatic stress disorder in DSM-5.
Depression and Anxiety, 29, 731–738.
Kotov, R., Chang, S. W., Fochtmann, L. J., Mojtabai, R., Carlson, G. A., Sedler, M. J.,
& Bromet, E. J. (2011). Schizophrenia in the internalizing-externalizing frame-
work: A third dimension? Schizophrenia Bulletin, 37, 1168–1178.
Kotov, R., Ruggero, C. J., Krueger, R. F., Watson, D., Qilong, Y., & Zimmerman,
M. (2011). New dimensions in the quantitative classification of mental illness.
Archives of General Psychiatry, 68, 1003–1011.
Kramer, M. D., Krueger, R. F., & Hicks, B. M. (2008). The role of internalizing and
externalizing liability factors in accounting for gender differences in the prevalence
of common psychopathological syndromes. Psychological Medicine, 38, 51–62.
Krueger, R. F. (1999). The structure of common mental disorders. Archives of General
Psychiatry, 56, 921–926.
Krueger, R. F., Capsi, A., Moffitt, T. E., & Silva, P. A. (1998). The structure and stability
of common mental disorders (DSM-III-R): A longitudinal-epidemiological study.
Journal of Abnormal Psychology, 107, 216–227.
Krueger, R. F., Chentsova-Dutton, Y. E., Markon, K. E., Goldberg, D., & Ormel, J.
(2003). A cross-cultural study of the structure of comorbidity among common psy-
chopathological syndromes in the general health care setting. Journal of Abnormal
Psychology, 112, 437–4 47.
Krueger, R. F., Derringer, J., Markon, K. E., Watson, D., & Skodol, A. E. (2012). Initial
construction of a maladaptive personality trait model and inventory for DSM–5.
Psychological Medicine, 42, 1879–1890.
Krueger, R. F., Hicks, B. M., Patrick, C. J., Carlson, S. R., Iacono, W. G., & McGue,
M. (2002). Etiologic connections among substance dependence, antisocial behav-
ior, and personality: Modeling the externalizing spectrum. Journal of Abnormal
Psychology, 111, 411–474.
Krueger, R. F., & Markon, K. E. (2006). Reinterpreting comorbidity: A model-based
approach to understanding and classifying psychopathology. Annual Review of
Clinical Psychology, 2, 111–133.
91
Krueger, R. F., & Markon, K. E. (2014). The role of the DSM-5 personality trait model
in moving toward a quantitative and empirically based approach to classify-
ing personality and psychopathology. Annual Review of Clinical Psychology, 10,
477–501.
Krueger, R. F., Markon, K. E., Patrick, C. J., Benning, S. D., & Kramer, M. D. (2007).
Linking antisocial behavior, substance use, and personality: An integrative quanti-
tative model of the adult externalizing spectrum. Journal of Abnormal Psychology,
116, 645–666.
Krueger, R. F., McGue, M., & Iacono, W. G. (2001). The higher-order structure of com-
mon DSM mental disorders: Internalization, externalization, and their connections
to personality. Personality and Individual Differences, 30, 1245–1259.
Kushner, S. C., Quilty, L. C., Tackett, J. L., & Bagby, R. M. (2011). The hierarchical
structure of the dimensional assessment of personality pathology (DAPP–BQ).
Journal of Personality Disorders, 25, 504–516.
Lahey, B. B., Rathouz, P. J., Van Hulle, C., Urbano, R. C., Krueger, R. F., Applegate, B.,
Garriock, H. A., … Waldman, I. D. (2008). Testing structural models of DSM-IV
symptoms of common forms of child and adolescent psychopathology. Journal of
Abnormal Child Psychopathology, 36, 187–206.
Lahey, B. B., Applegate, B., Hakes, J. K., Zald, D. H., Hariri, A. R., & Rathouz, P. J.
(2012). Is there a general factor of prevalent psychopathology during adulthood?
Journal of Abnormal Psychology, 121, 971–977.
Lahey, B. B., Van Hulle, C. A., Singh, A. L., Waldman, I. D. & Rathouz, P. J. (2011).
Higher-order genetic and environmental structure of prevalent forms of child and
adolescent psychopathology. Archives of General Psychiatry, 68, 181–189.
Lilienfeld, S. O., & Andrews, B. P. (1996). Development and preliminary validation of
a self-report measure of psychopathic personality traits in noncriminal population.
Journal of Personality Assessment, 66, 488–524.
Locke, D. E. C., Kirlin, K. A., Thomas, M. L., Osborne, D., Hurst, D. F., Drazkowsi, J. F.,
Sirven, J. I., & Noe, K. H. (2010). The Minnesota Multiphasic Personality Inventory
Restructured Form in the epilepsy monitoring unit. Epilepsy and Behavior, 17,
252–258.
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological
Reports, 3, 653–694.
Markon, K. E., Krueger, R. F., & Watson, D. (2005). Delineating the structure of nor-
mal and abnormal personality: An integrative hierarchical approach. Journal of
Personality and Social Psychology, 88, 139–157.
McDevitt-Murphy, M. E., Weathers, F. W., Adkins, J. W., & Daniels, J. B. (2005). Use
of the Personality Assessment Inventory in the assessment of post-traumatic stress
disorder in women. Journal of Psychopathology and Behavior Assessment, 27, 57–65.
McNulty, J. L., & Overstreet, S. R. (2014). Viewing the MMPI- 2-
R F structure
through the Personality Psychopathology Five (PSY-5) lens. Journal of Personality
Assessment, 96, 151–157.
Miller, M. W., Wolf, E. J., Harrington, K. M., Brown, T. A., Kaloupek, D. G., & Keane,
T. M. (2010). An evaluation of competing models for the structure of PTSD symp-
toms using external measures of comorbidity. Journal of Traumatic Stress, 23,
631–638.
Morey, L. C. (1991). Professional Manual for the Personality Assessment Inventory.
Odessa, FL: Psychological Assessment Resources.
Morey, L. C. (2007). Professional Manual for the Personality Assessment Inventory,
Second Edition. Lutz, FL: Psychological Assessment Resources.
92
Morey, L. C., & Hopwood, C. J. (200). The Personality Assessment Inventory and
the measurement of normal and abnormal personality constructs. In S. Strack
(Ed.), Differentiating Normal and Abnormal Personality (2nd ed., pp. 451–471).
New York: Springer.
Neale, M. C., & Kendler, K. S. (1995). Models of comorbidity for multifactorial disor-
ders. American Journal of Human Genetics, 57, 935–953.
O’Connor, B. P. (2002). A quantitative review of the comprehensiveness of the five-
factor model in relation to popular personality inventories. Assessment, 9, 188–203.
Patrick, C. J., & Drislane, L. E. (2015). Triarchic model of psychopathy: Origins, opera-
tionalizations, and observed linkages with personality and general psychopathol-
ogy. Journal of Personality, 83, 627–643.
Phillips, T. R., Sellbom, M., Ben-Porath, Y. S., & Patrick, C. J. (2014). Further devel-
opment and construct validation of MMPI-2-R F indices of Global Psychopathy,
Fearless-Dominance, and Impulsive- Antisociality in a sample of incarcerated
women. Law and Human Behavior, 28, 34–46.
Reiger, D. A., Farmer, M. E., Rae, D. S., Locke, B. Z., Keith, S. J., Judd, L. L., & Goodwin,
F. (1990). Comorbidity of mental disorders with alcohol and other drug abuse.
Results from the Epidemiologic Catchment Area (ECA) study. Journal of the
American Medical Association, 264, 2511–2518.
Ruiz, M. A., Dickinson, K. A., & Pincus, A. L. (2002). Concurrent validity of the
Personality Assessment Inventory Alcohol Problems (ALC) scale in a college stu-
dent sample. Assessment, 9, 261–270.
Sellbom, M. (2010, March). The MMPI-2-R F Externalizing Scales: Hierarchical Structure
and Links to Contemporary Models of Psychopathology. Paper presented at the 2010
Annual Meeting of the American Psychology-Law Society, Vancouver, BC, Canada.
Sellbom, M. (2011, September). Exploring Psychopathology Structure Empirically
Through an Omnibus Clinical Assessment Tool: Reintegrating the MMPI into
Contemporary Psychopathology Research. Paper presented at the 2011 Annual
Meeting of the Society for Research on Psychopathology, Boston, MA.
Sellbom, M., Bagby, R. M., Kushner, S., Quilty, L. C., & Ayearst, L. E. (2012). The diag-
nostic construct validity of MMPI-2 Restructured Form (MMPI-2-R F) scale scores.
Assessment, 19, 185–195.
Sellbom, M., Ben-Porath, Y. S., & Bagby, R. M. (2008a). On the hierarchical structure
of mood and anxiety disorders: Confirmatory evidence and an elaborated model of
temperament markers. Journal of Abnormal Psychology, 117, 576–590.
Sellbom, M., Ben-Porath, Y. S., & Bagby, R. M. (2008b). Personality and psychopathol-
ogy: Mapping the MMPI-2 Restructured Clinical (RC) scales onto the five-factor
model of personality. Journal of Personality Disorders, 22, 291–312.
Sellbom, M., Ben-Porath, Y. S., & Graham, J. R. (2006). Correlates of the MMPI–2
Restructured Clinical (RC) scales in a college counseling setting. Journal of
Personality Assessment, 86, 88–99.
Sellbom, M., Ben-Porath, Y. S., Lilienfeld, S. O., Patrick, C. J., & Graham, J. R. (2005).
Assessing psychopathic personality traits with the MMPI-2. Journal of Personality
Assessment, 85, 334–343.
Sellbom, M., Ben-Porath, Y. S., & Stafford, K. S. (2007). A comparison of MMPI-2 mea-
sures of psychopathic deviance in a forensic setting. Psychological Assessment, 19,
430–436.
Sellbom, M., Lee, T. T. C., Ben-Porath, Y. S., Arbisi, P. A., & Gervais, R. O. (2012).
Differentiating PTSD symptomatology with the MMPI-2-R F (Restructured Form)
in a forensic disability sample. Psychiatry Research, 197, 172–179.
93
Sellbom, M., Marion, B. E., Kastner, R. M., Rock, R. C., Anderson, J. L., Salekin,
R. T., & Krueger, R. F. (2012, October). Externalizing Spectrum of Psychopathology:
Associations with DSM-5 Personality Traits and Neurocognitive Tasks. Paper pre-
sented at the 2012 Annual Meeting of the Society for Research on Psychopathology,
Ann Arbor, MI.
Sellbom, M., Smid, W., De Saeger, H., Smit, N., & Kamphuis, J. H. (2014). Mapping
the Personality Psychopathology Five Domains onto DSM-IV personality disor-
ders in Dutch clinical and forensic samples: Implications for the DSM-5. Journal of
Personality Assessment, 96, 185–192.
Sellbom, M., Titcomb, C., & Arbisi, P. A. (2011, May). Clarifying the Hierarchical
Structure of Positive Psychotic Symptoms: The MMPI-2-R F as a Road Map. Paper
presented at the 46th Annual Symposium on Recent MMPI-2, MMPI-2-R F, and
MMPI-A Research, Minneapolis, MN.
Slade, T., & Watson, D. (2006). The structure of common DSM-IV and ICD-10 men-
tal disorders in the Australian general population. Psychological Medicine, 35,
1593–1600.
Tackett, J. L., Quilty, L. C., Sellbom, M., Rector, N. A., & Bagby, R. M. (2008). Additional
evidence for a quantitative hierarchical model of the mood and anxiety disorders
for DSM-V: The context of personality structure. Journal of Abnormal Psychology,
117, 812–825.
Tarescavage, A. M., Luna-Jones, L., & Ben-Porath, Y. S. (2014). Minnesota Multiphasic
Personality Inventory–2–Restructured Form (MMPI-2-R F) predictors of violating
probation after felonious crimes. Psychological Assessment, 26, 1375–1380.
Tellegen, A. (1985). Structures of mood and personality and their relevance to assess-
ing anxiety, with an emphasis on self-report. In A. H. Tuma & J. D. Maser (Eds.),
Anxiety and the Anxiety Disorders (pp. 681–706). Hillsdale, NJ: Erlbaum.
Tellegen, A., & Ben-Porath, Y. S. (2008). MMPI-2-R F (Minnesota Multiphasic Personality
Inventory–2 Restructured Form): Technical Manual. Minneapolis, MN: University
of Minnesota Press.
Tellegen, A., Ben-Porath, Y. S., McNulty, J. L., Arbisi, P. A., Graham, J. R., & Kaemmer,
B. (2003). MMPI-2 Restructured Clinical (RC) Scales: Development, Validation, and
Interpretation. Minneapolis, MN: University of Minnesota Press.
Thomas, M. L., & Locke, D. E. C. (2010). Psychometric properties of the MMPI-2-R F
Somatic Complaints (RC1) scale. Psychological Assessment, 22, 492–503.
Titcomb, C., Sellbom, M., Cohen, A., & Arbisi, P. A. (under review). On the Structure of
Positive Psychotic Symptomatology: Can Paranoia Be Modeled as a Separate Liability
Factor? Manuscript submitted for publication.
Van der Heijden, P. T., Egger, J. I. M., Rossi, G., & Derksen, J. J. L. (2012). Integrating
psychopathology and personality disorders conceptualized by the MMPI-2-R F and
the MCMI-III: A structural validity study. Journal of Personality Assessment, 94,
345–347.
Van der Heijden, P. T., Rossi, G. M., Van der Veld, M. M., Derksen, J. J. L., & Egger, J. I.
M. (2013a). Personality and psychopathology: Higher order relations between the
Five Factor Model of personality and the MMPI-2 Restructured Form. Journal of
Research in Personality, 47, 572–579.
Van der Heijden, P. T., Rossi, G. M., Van der Veld, M. M., Derksen, J. J. L., & Egger,
J. I. M (2013b). Personality and psychopathology: Mapping the MMPI-2-R F on
Cloninger’s psychobiological model of personality. Assessment, 20, 576–584.
Veltri, C. O. C., Williams, J. E., & Braxton, L. (2004, March). MMPI-2 Restructured
Clinical scales and the Personality Assessment Inventory in a veteran sample.
94
Poster presented at the Annual Meeting of the Society for Personality Assessment,
Chicago, IL.
Vollebergh, W. A. M., Iedema, J., Vijl, R. V., deGraaf, R., Smit, F., & Ormel, J. (2001).
The structure and stability of common mental disorders: The NEMESIS study.
Archives of General Psychiatry, 58, 597–603.
Watson, D. (2005). Rethinking the mood and anxiety disorders: A quantitative hierar-
chical model for DSM–V. Journal of Abnormal Psychology, 114, 522–536.
Watson, C., Quilty, L. C., & Bagby, R. M. (2011). Differentiating bipolar disorder from
major depressive disorder using the MMPI-2-R F: A receiver operating character-
istics (ROC) analysis. Journal of Psychopathology and Behavioral Assessment, 33,
368–374.
Watson, D., & Tellegen, A. (1985). Toward a consensual structure of mood. Psychological
Bulletin, 98, 219–235.
Widiger, T. A., & Clark, L. A. (2000). Toward DSM-V and the classification of psycho-
pathology. Psychological Bulletin, 126, 946–963.
Widiger, T. A., & Sankis, L. M. (2000). Adult psychopathology: Issues and controver-
sies. Annual Review of Psychology, 51(1), 377–404.
Wolf, E. J., Miller, M. W., Orazem, R. J., Weierich, M. R., Castillo, D. T., Milford, J., …
Keane, T. M. (2008). The MMPI-2 Restructured Clinical Scales in the assessment
of posttraumatic stress disorder and comorbid disorders. Psychological Assessment,
20, 327–340.
Wright, A. G. C., Krueger, R. F., Hobbs, M. J., Markon, K. E., Eaton, N. R., & Slade,
T. (2013). The structure of psychopathology: Toward an expanded quantitative
empirical model. Journal of Abnormal Psychology, 122, 281–294.
Wright, A. G. C., & Simms, L. J. (2014). On the structure of personality disorder
traits: Conjoint analyses of the CAT- PD, PID- 5, and NEO- PI-
3 trait models.
Personality Disorders: Theory, Research, and Treatment, 5, 43–54.
Wright, A. G. C., Thomas, K. M., Hopwood, C. J., Markon, K. E., Pincus, A. L., &
Krueger, R. F. (2012). The hierarchical structure of DSM–5 pathological personality
traits. Journal of Abnormal Psychology, 121, 951–957.
Wygant, D. B., & Sellbom, M. (2012). Viewing psychopathy from the perspective of
the Personality Psychopathology Five Model: Implications for DSM-5. Journal of
Personality Disorders, 26, 717–726.
95
S T E P H E N C . B OW D E N A N D S U E F I N C H
There has never been any mathematical or substantive rebuttal of the main
findings of psychometric theory. (Schmidt & Hunter, 1996, p. 199)
CLINICAL SCENARIO
A 67-year-old male patient is assessed after family physician referral for the
evaluation of possible early dementia. In the context of no other significant ill-
ness, a family informant provides a history of minor everyday memory failures.
Premorbid cognitive ability is judged to be of at least “high average” standing
(corresponding to scores of 110 or above on a test of general cognitive ability,
with a population mean of 100 and a standard deviation of 15; e.g., Wechsler,
2009) estimated from the patient’s educational and occupational attainment. On
a well-validated test of anterograde memory function (or long-term retrieval abil-
ity: see Chapter 3 in this volume), with the same population mean and standard
deviation, the patient scored 115. Assuming a test score reliability of 0.95, a 95%
confidence interval (CI) was constructed centered on the predicted true score of
114 (see section below “The Predicted True Score”) and using the standard error
of estimation (see section “The Family of Standard Errors of Measurement”). The
95% CI ranged from 108–121 (see Table 5.1). This CI includes the estimated pre-
morbid range of ability, namely, above an Index score of 110. Hence the examin-
ing clinician concluded there was no objective evidence to infer that anterograde
memory function is below that expected on the basis of premorbid estimates, with
95% confidence. The clinician recommends a healthy lifestyle with regular cogni-
tive stimulation and a wait-and-see approach to the diagnosis of early dementia.
Table 5.1 Predicted true scores with 95% confidence intervals (95% CI) for different
of 100 and a standard deviation of 15, illustrated for alternative reliability values
intervals are centered on the predicted true score (bolded) and calculated with the
are shown for a range of Intelligence Index scale scores from 55 (z = –3) to 145 (z = +3).
observed test scores from a range of hypothetical tests with a population mean
and corresponding standard errors of estimation (SE estimation) values. The confidence
standard error of estimation. The upper and lower limits of the confidence intervals
Predicted true scores and confidence limits are rounded to the nearest whole number
Upper Lower Predicted Upper Lower Predicted Upper Lower Predicted Upper
bound bound True bound bound True bound bound True bound
(95%CI) (95%CI) Score (95%CI) (95%CI) Score (95%CI) (95%CI) Score (95%CI)
92 55 69 82 51 60 68 51 57 64
95 59 72 85 55 64 73 56 62 68
97 62 76 89 60 69 77 60 67 73
100 66 79 92 64 73 82 65 72 78
102 69 83 96 69 78 86 70 76 83
105 73 86 99 73 82 91 75 81 87
107 76 90 103 78 87 95 79 86 92
110 80 93 106 82 91 100 84 91 97
112 83 97 110 87 96 104 89 95 102
115 87 100 113 91 100 109 94 100 106
117 90 104 117 96 105 113 98 105 111
120 94 107 120 100 109 118 103 110 116
122 97 111 124 105 114 122 108 114 121
125 101 114 127 109 118 127 113 119 125
127 104 118 131 114 123 131 117 124 130
130 108 121 134 118 127 136 122 129 135
132 111 125 138 123 132 140 127 133 140
135 115 128 141 127 136 145 132 138 144
137 118 132 145 132 141 149 136 143 149
98
The patient was seen 18 months later. Self-report no different, family infor-
mant reiterating similar symptoms of memory failures, as before, and also
increasing social withdrawal. Reassessment with the same test of anterograde
memory function resulted in a score of 95. On the basis of test research data,
practice effects were judged to be negligible over an 18-month retest interval.
To evaluate the consistency of the memory test score at the second assessment
with the performance at the first assessment, a 95% prediction interval using
the standard error of prediction (see section “Prediction Interval for the True
Score Based on the Standard Error of Prediction”) was constructed using the
18-month retest reliability estimate of 0.9 available in the test manual. The
prediction interval was centered on the predicted true score at Time 1 of 114,
and ranged from 101–126 (see Table 5.2). The clinician noted that the Time
2 observed score fell below the lower bound of the prediction interval esti-
mated from the Time 1 assessment, that is, the observed score was outside the
range of predicted scores. The clinician concluded that anterograde memory
function had most likely deteriorated, with 95% confidence. In this chapter,
the methods of interval construction for clinical inference-making will be
explained.
Table 5.2 Predicted true scores with 95% prediction intervals (95% PI) for different
of 100 and a standard deviation of 15, illustrated for alternative reliability values
intervals are centered on the predicted true score (bolded) and calculated with the
are shown for a range of Intelligence Index scale scores from 55 (z = –3) to 145 (z = +3).
observed test scores from a range of hypothetical tests with a population mean
and corresponding standard error of prediction (SEprediction) values. The prediction
standard error of prediction. The upper and lower limits of the prediction intervals
Predicted true scores and prediction limits are rounded to the nearest whole number
110 24
100
Score on Test A
18
Score on Test B
90
12
80
6
70
0
Time1 Time2 Time1 Time2
2 as two other people, one person who scored 12 at Time 1 and one person who
scored 1 on the test at Time 1. In this example, the person who scores highest on
the test at Time 1 obtains the same score at Time 2 as a person who obtained the
lowest score at Time 1, indicative of the marked lack of consistency in a score
with a retest reliability of .52.
A low reliability correlation tells us that, in general, test scores are likely to
show little consistency in both absolute and relative scores in relation to other
people in the population, from one test occasion to another. Tests are used to
generalize about the assessed psychological constructs beyond the immediate
assessment occasion. As Figure 5.1 Test B shows, a test with low reliability limits
our ability to make precise statements about a person’s likely test score and, by
inference, the assessed psychological construct, from one occasion to another.
In Figure 5.1, reliability was estimated using a Pearson correlation, so it is
important to remember that this assumes a linear relationship between scores,
and when calculating the sample correlation, appropriate random sampling
from the population of interest. For a detailed discussion of the assumptions
underlying use of Pearson correlation, and ways to verify the correct use, see
Gravetter and Wallnau (2013).
Sometimes clinicians assert that their clinical expertise or judgement allows
them to overcome the limitations of tests, or other clinically relevant information,
that display poor reliability. This objection is logically indefensible. Unreliable
tests provide poor-quality information, and if interpreted at face-value, they
lead to poor-quality clinical inferences (Nunnally & Bernstein, 1994; Schmidt
& Hunter, 1996). As will be shown later, unreliable test scores, interpreted at
face value, sometimes lead clinicians into logical traps and simple, avoidable
errors in clinical judgement. For example, a clinician, who does not interpret
test scores or other information in terms of psychometric principles may feel the
need to provide a clinical explanation for changes in scores from one occasion to
another, when this variability may reflect nothing more than the poor reliability
of the test. Similarly, extremely high or low scores derived from unreliable tests
may invite specific explanation. Instead, if the observed scores are interpreted
in terms of the appropriate psychometric principles (see section “The Predicted
True Score”) the perceived “extremeness” of one score relative to another may
disappear, along with the need to provide ad hoc interpretations (Einhorn, 1986;
Faust, 1996; Strauss, 2001; Strauss & Fritsch, 2004).
retest reliability, see Anastasi and Urbini (1997). Test authors bear a responsibil-
ity to ensure that clinicians have access to the appropriate estimates of reliability
corresponding to the ways in which the test is to be used. Conversely, clinicians
should be wary if the only estimate of reliability available for a test is in internal
consistency coefficient, which may overestimate more clinically relevant retest
reliability estimates (Kaufman & Lichtenberger, 2005). As will be shown later, if
no estimate of reliability is available for a test, then clinicians should carefully
consider whether clinical application of a test is premature.
This discussion reflects the “classical” psychometric approach to the consid-
eration of reliability. Advances in psychometrics have provided more detailed
and comprehensive approaches to understanding the reliability of tests scores,
which complement and elaborate the classical approach (Brennan, 2006).
Generalizability theory provides methods to examine alternative sources of
variance that may contribute to reliability in a comprehensive study design
(Shavelson & Webb, 1991). For example, generalizability theory enables exami-
nation of sources of variance due to items, people, alternative examiners, and
multiple assessment occasions in one study. Yet another approach to estimating
reliability is available with item-response theory (IRT), which has been widely
implemented, for example, in educational testing and adaptive clinical testing
(Brennan, 2006; www.nihpromis.org).
In the examples in Figures 5.1, it is assumed there was no practice effect from
Time 1 to Time 2. The presence of practice effects (or other sources of bias) in
retest scores will reduce the consistency in scores, if consistency is expected in
terms of absolute value as well as relative rank of tests scores. While some test
publishers provide information about expected practice effects, a retest correla-
tion may not provide the best measure of reliability in terms of absolute value
or absolute agreement. This problem is illustrated in Figure 5.2, using the same
110
Score on Test A
100
90
80
70
Time 1 Time 2
set of scores as were used for Test A in Figure 5.1. However in Figure 5.2, a con-
stant of 10 points has been added to every score at Time 2, to show that the
Pearson reliability correlation stays the same at .93. Adding a constant to one set
of scores that is used to calculate a correlation does not change the correlation
(Gravetter & Wallnau, 2013). In this scenario, the correlation remains high, but
the absolute agreement is poor.
In contrast, intra-class correlation coefficients (ICCs) can provide estimates
of reliability, depending on whether the focus is on consistency (relative rank
and position) or absolute agreement (absolute value). Intra-class correlations are
another method for evaluating the reliability of test scores, and, like general-
izability theory, they are based on a variance decomposition approach (just as
analysis of variance decomposes sources of variance due to explanatory vari-
ables and their interactions: Brennan, 2006). Methods for estimating ICCs are
available in most statistical software, and ICCs should be used carefully, because
misreporting is common. For example, the default ICC reported may include
a consistency estimate that assumes the “score” to be interpreted is the sum of
scores from every test occasion included in the analysis, which might be two or
more occasions. However, if the test is to be interpreted on the basis of a single re-
administration, then the simple two-occasion ICC should be reported. The latter
may be lower than the multiple test-occasion reliability estimate. In addition,
ICCs are available for estimates of consistency, in terms described previously as
the correlation between scores on two or more occasions, or as agreement esti-
mates, where the absolute level of scores is also taken into account, and changes
due, say, to practice effects, may lower agreement across test occasions. Again,
the appropriate estimate should be reported.
In the example in Figure 5.2, the ICC for consistency (assuming persons and
test occasions are both random samples of all possible people and test occa-
sions) is equal to .92—high, like the reliability correlation calculated for Test
A in Figure 5.1. However, the ICC for absolute agreement is .58, lower than the
consistency estimate, because the ICC for agreement estimate takes into account
the change in absolute value as well.
As the preceding examples show, different methods for estimating reliability
convey different kinds of information. Clinicians need to ensure that they use
reliability estimates that are appropriate to the clinical question at hand. Test
publishers should ensure that reliability estimates relevant to the intended use of
the test are reported in test manuals and that the appropriate interpretation of
alternative reliability estimates is communicated effectively to test users.
a CI for the true test score. This concept is analogous to the estimation of a CI
for a population mean (Cumming & Finch, 2005). In the same vein, a CI around
a person’s test score allows estimation of the precision of the estimate of the true
score. Calculating CIs also facilitates testing hypotheses using observed scores,
as illustrated in the clinical scenario at the beginning of this chapter.
In the familiar research scenario of calculating a CI for a population mean, the
center for the CI is the observed sample mean. However, a CI for an individual
true test score is centered, not on the observed test score, but on a value known as
the predicted true score (PTS: Nunnally & Bernstein, 1994). The PTS takes into
account the statistical trend for an observed score obtained on any one occasion
to shift toward the population mean if observed on another occasion. The more
extreme the deviation of the observed score from the population mean, above
or below, the more the adjustment of the observed score toward the population
mean. Formally, the PTS takes into account the statistical phenomenon known
as “regression to the mean.” This regression toward the mean is an important
property of test scores that is well known, for example, in single-case studies and
in personnel selection (Barlow, Nock, & Hersen, 2009; Crawford & Garthwaite,
2007; Hausknecht, Halpert, Di Paolo, & Moriarty Gerrard, 2007). A well-k nown
source of artifactual “recovery” in single-case studies and in cohort studies with-
out a control group arises from selection of patients on the basis of extremely low
scores at pre-test. Improvement at post-test may be a function of an intervention
but may also be due to regression to the mean. Selection on the basis of predicted
true scores at pre-test, not observed scores, reduces the risk of this misinterpreta-
tion (Hausknecht et al., 2007). Calculation of the PTS is shown as Formula 5.1
(see Dudek, 1979; Nunnally & Bernstein, 1994):
Several examples of observed scores and PTS for alternative reliability values are
shown in Table 5.3, on an Intelligence Index scale with a mean of 100. Whether
the observed score is above or below the population mean, the PTS is always
closer to the population mean, and the relative difference between the observed
score and the PTS in relation to the population mean is directly proportional to
the reliability coefficient for the score. Note that the calculation of the PTS does
not require knowledge of the population standard deviation. As can be seen in
Table 5.3, the PTS for an observed score of 70 is much closer to the population
108
mean if the test score reliability is 0.5 (PTS = 85) than if the test score reliability
is 0.9 (PTS = 73). So, to reiterate, the PTS provides the appropriate center for
the CI for an individual true test score. Note that as a consequence of centering
on the PTS, the CI is not symmetrical around the observed score, but instead is
symmetrical around the respective PTS (Nunnally & Bernstein, 1994). It can be
seen from Table 5.3 that the PTS has an important property. If a test has high
reliability, the PTS will be weighted towards the observed score. If, however, the
test has low reliability, the PTS will be weighted toward the population mean.
rxx SEmeasurement
0.0 15.0
0.2 13.4
0.4 11.6
0.6 9.5
0.8 6.7
1.0 0.0
test in the population. This latter circumstance implies that when the reliabil-
ity is zero, the variability in the scores of any one individual is the same as the
variability among individuals in the population. Table 5.4 shows values for the
standard error of measurement for selected values of the reliability coefficient
between 0 and 1 for a test with a standard deviation of 15. As can be seen, when
the reliability of the test score is 0.2, then the SEmeasurement is 13.4, almost as large
as the population standard deviation of 15. Even when the reliability is 0.8, the
value of SEmeasurement is 6.7, almost half the magnitude of the population standard
deviation.
Another way to illustrate the effects of the reliability coefficient on the size of
the standard error of measurement is shown by the function relating the reliabil-
ity coefficient to the standard error of measurement, where the latter is expressed
as a proportion of the population standard deviation. The curve generated by
this function is shown in the left panel in Figure 5.3 where the x-a xis represents
the reliability of the test and the y-a xis represents the standard error of measure-
ment (SEmeasurement) as a proportion of the population standard deviation on the
test. As can be seen in Figure 5.3, as the reliability increases toward a value of 1.0
(to the right on the x-a xis), the value of the standard error of measurement as a
proportion of the population standard deviation decreases toward zero (toward
the lowest point of the y-a xis). In contrast, as the reliability coefficient tends
toward zero, the standard error of measurement as a proportion of the popula-
tion standard deviation tends toward 1.0. For example, it can be seen in Figure
5.3 that if the reliability coefficient is 0.8 (on the x-a xis), then the standard error
of measurement as a proportion of the population standard deviation on the test
is 0.45 (on the y-a xis), approximately. If the population standard deviation were
15, then the standard error of measurement for that test would be 15 × 0.45 = 6.7.
This is the same value as in the example calculation in Table 5.4.
Note also the shape of the curve for the SEmeasurement in Figure 5.3. The SEmeasurement
does not become a small proportion of the population standard deviation until
the reliability coefficient is close to one. The shape of this function is the reason
why psychometricians suggest that the reliability coefficient for a test score used
for individual assessment should be high, preferably in excess of 0.9 (Nunnally &
111
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Reliability of the test score
Bernstein, 1994). With lower reliabilities, the value of the standard error of mea-
surement is large and too close to the value of population standard deviation, so
the precision in the estimation of a person’s true score on the test is inexact.
Most textbooks on psychological assessment describe only the SEmeasurement, as
we have defined it here, without indicating that there is a family of standard
errors for individual testing, and the best choice of standard error depends on
the clinical assessment question (Dudek, 1979). While choice of standard error
makes little difference if the reliability coefficient for a test is very high, choice
of standard error has a big impact with lower reliabilities, in the range of reli-
abilities reported for many published tests (see Franzen, 2000). The SEmeasurement
(Formula 5.2) described previously, and reported in many textbooks on assess-
ment, is, technically, the standard deviation of the distribution of observed
scores estimated from a fixed (known) true score (Nunnally & Bernstein, 1994).
There are two alternative standard errors that are most useful in clinical assess-
ment and are described in the next sections. The advantage of these alternatives
is that they make better use of the relationship between the observed score and
the true score.
score is estimated from the observed score. Combined with the formula for the
standard error of estimation, we can estimate the average true score from an
observed score and provide a CI around this estimate.
The function relating the reliability of the test to the standard error of estimation
as a proportion of the population standard deviation is shown in the centre panel
in Figure 5.3. Immediately, the reader will notice that this function has a peculiar
property, namely, that the standard error of estimation is smaller, as a proportion
of the population standard deviation, for both higher and lower reliabilities. We
can understand this apparent paradox by noting that the standard error of esti-
mation reflects the error in estimating the true score. When the reliability of the
test is low, the estimate of the true score (PTS) will be heavily weighted towards
the population mean and there will be relatively less error involved than if the
estimate was weighted towards the observed score.
In what context should we use the standard error of estimation? Consider a
clinician who is undertaking a single assessment of a given patient and is wish-
ing to test a hypothesis about that patient’s true score. A CI for the true score is
best calculated with the standard error of estimation (Dudek, 1979; Nunnally &
Bernstein, 1994). This kind of assessment situation will arise when a clinician
wishes to consider the patient’s true score in relation to a known criterion such
as the second percentile in the case of a possible diagnosis of learning difficulty
(intellectual disability), or in relation to a cutting score for identification of sub-
optimal effort on a performance validity test, to cite just two examples.
A 95% confidence interval for the true score, allowing comparison to clinical or other
criteria, is provided by the formula –
More generally,
PTS ± z × SEestimation
Worked example:
For an intelligence test score with a population mean of 100, standard deviation of 15,
and reliability of 0.8, a patient obtains a score of 46. The clinician wishes to determine
whether this patient’s true score falls below the commonly used criterion for learning
difficulty or intellectual disability of 70.
From Formula 5.1 in this chapter, the predicted true score for this patient on this
occasion is 57.
104. Notably, in this instance of extremely low test reliability, the observed score
is not included in the 95% CI. This should not be surprising since, if the reliabil-
ity is very poor, the PTS will be heavily weighted towards the population mean.
If instead the reliability of the test score is 0.5, then the CI based on the same
observed score of 55 covers a range of over 29 Index points, bounds from 63–92
and centered on a PTS of 78. Finally, if the reliability of the test score is .95, then
the CI based on an observed score of 55 covers a range of 13 Index points, cen-
tered on a PTS of 57 (see Table 5.1). The varying CI widths reflect varying levels
of precision in the estimate of the true score.
These issues can be illustrated with information in Table 5.1. Suppose the
assessment question is to identify a person with a learning difficulty or intel-
lectual disability, which usually requires an intelligence scale score less than 70.
Suppose, for this example, an observed score of 60 is obtained on a test with a
reliability of 0.5. It can be seen from the second row of Table 5.1 that the lower
and upper limits of the 95% CI are 65 and 95, respectively, centered on a PTS of
80. While the clinician may note that the CI is consistent with a true score below
70, she can also note that the CI is consistent with many true scores much higher
than 70. The width of the CI highlights the relatively poor precision here.
In contrast, suppose the reliability of the test was .95, again with an observed
score of 60. It can be seen in the second row of Table 5.1 that the lower and upper
bounds of the 95% CI are 56 and 68, respectively, centered on a PTS of 62. In this
case, the clinician should infer, with 95% confidence, that the patient’s true score
was clearly below the criterion score of 70. The relatively narrow CI supports this
interpretation.
Obviously, as is illustrated in these examples, test scores with lower reliability
will be associated with CIs that may be inconveniently wide. Then classifica-
tion decisions combining cut-points and the respective upper and lower limits
of the CI are likely to show that clinical interpretation of the test score has low
precision.
REFERENCES
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (2014). Standards for Educational
and Psychological Testing. Washington, DC: American Educational Research
Association.
Anastasi, A., & Urbini, S. (1997). Psychological Testing (7th ed.). Upper Saddle River,
NJ: Prentice Hall.
Barlow, D. H., Nock, M. K., & Hersen, M. (2009). Single Case Experimental
Designs: Strategies for Studying Behavior Change (3rd ed.). Boston, MA: Pearson.
Bates, E., Appelbaum, M., Salcedo, J., Saygin, A. P., & Pizzamiglio, L. (2003).
Quantifying dissociations in neuropsychological research. Journal of Clinical and
Experimental Neuropsychology, 25, 1128–1153.
Brennan, R. L. (Ed.). (2006). Educational Measurement (4th ed.). Westport, CT: Praeger
Publishers.
Chapman, J. P., & Chapman, L. J. (1983). Reliability and the discrimination of normal
and pathological groups. Journal of Nervous and Mental Disease, 171, 658–661.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applica-
tions. Journal of Applied Psychology, 78, 98–104.
Crawford, J. R., & Garthwaite, P. H. (2007). Using regression equations built from
summary data in the neuropsychological assessment of the individual case.
Neuropsychology, 21, 611–620.
Crawford, J. R., Garthwaite, P. H., & Gray, C. D. (2003). Wanted: Fully operational
definitions of dissociations in single-case studies. Cortex, 29, 357–370.
Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to
read pictures of data. American Psychologist, 60, 170–180.
Dudek, F. J. (1979). The continuing misinterpretation of the standard error of measure-
ment. Psychological Bulletin, 86, 335–337.
119
Dunn, J. C., & Kirsner, K. (2003). What can we infer from double dissociations? Cortex,
39, 1–7.
Einhorn, H. J. (1986). Accepting error to make less error. Journal of Personality
Assessment, 50, 387–395.
Faust, D. (1996). Learning and maintaining rules for decreasing judgment accuracy.
Journal of Personality Assessment, 50, 585–600.
Franzen, M. (2000). Reliability and Validity in Neuropsychological Assessment (2nd
ed.). New York: Kluwer Academic/Plenum Publishers.
Gravetter, F. J., & Wallnau, L. B. (2013). Statistics for the Behavioral Sciences (9th ed.).
Belmont, CA: Wadsworth, Cengage Learning.
Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007).
Retesting in selection: A meta-analysis of coaching and practice effects for tests of
cognitive ability. Journal of Applied Psychology, 92, 373–385.
Kaufman, A. S., & Lichtenberger, E. O. (2005). Assessing Adolescent and Adult
Intelligence. Hoboken, NJ: John Wiley & Sons.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory (3rd ed.). New York:
McGraw-Hill.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability Theory. Newbury Park, CA:
Sage Publications.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research:
Lessons from 26 research scenarios. Psychological Methods, 1, 199–223.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s
alpha. Psychometrika. 74, 107–120.
Spearman, C. (1904). “General intelligence,” objectively determined and measured.
The American Journal of Psychology, 15, 201–292.
Strauss, M. E. (2001). Demonstrating specific cognitive deficits: A psychometric per-
spective. Journal of Abnormal Psychology, 110, 6–14.
Strauss, M. E., & Fritsch, T. (2004). Factor structure of the CERAD neuropsychological
battery. Journal of the International Neuropsychological Society, 10, 559–565.
Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and
methodology. Annual Review of Clinical Psychology, 5, 1–25.
Van Orden, G. C., Pennington, B. F., & Stone, G. O. (2001). What do double dissocia-
tions prove? Cognitive Science, 25, 111–172.
Wechsler, D. (2009). Wechsler Memory Scale—Fourth Edition. San Antonio, TX:
Pearson.
Wittchen, H. U., Höfler, M., Gloster, A. T., Craske, M. G., & Beesdo, K. (2011). Options
and dilemmas of dimensional measures for DSM-5: Which types of measures
fare best in predicting course and outcome. In D. Regier, W. Narrow, E. Kuhl, &
D. Kupfer (Eds.), The Conceptual Evolution of DSM-5 (pp. 119–146). Washington,
DC: American Psychiatric Publishing.
120
121
A N T O N D. H I N T O N - B A Y R E A N D K A R L E I G H J . K WA P I L
ratios. The chapter concludes with a discourse on the limitations of RCI and a
summary of key practice points.
HISTORY OF RELIABLE CHANGE
Reliable change as a concept was first introduced by Jacobson, Follette, and
Revenstorf (1984), and modified after a correction suggested by Christensen
and Mendoza (1986). Reliable change techniques replaced earlier rudimentary
approaches such as the “20% change score” that arbitrarily suggested that a
change of 20% or more from a baseline score could be considered significant,
and the “standard deviation method,” which suggested that a retest score greater
than one standard deviation from a baseline score should be considered signifi-
cant. These approaches have since been, or at least should be, abandoned. They
typically fail to consider important psychometric principles such as practice
effects, measurement error, or test–retest reliability (Collie et al., 2004).
Jacobson and Truax (1991), in their landmark paper, laid the foundations for
the concept of reliable change. RCIs were originally developed to demonstrate
the effect of psychotherapy on measures of marital satisfaction. The Jacobson
and Truax model was proposed for use in situations when there was no practice
effect, and it only used baseline variance of test performance as the error esti-
mate. In the Jacobson and Truax model (RCIJT), a statistically significant change
in scores for an individual was considered to have occurred when a change in
test scores from baseline (X) to retest (Y) exceeded the change expected due to
measurement error or “unreliability” of the test. (A worked example of the RCIJT
is presented in the following section.) Since RCIJT, a litany of modifications has
been proposed. A detailed discussion on various RCI models was presented by
Hinton-Bayre (2010).
Although many RCI models exist, they can be reduced to the following
expression:
RCI = (Y − Y ’) / SE
= ( Actual Retest Score − Predicted Retest Score) /Standard Error
Table 6.1 Baseline and Retest Raw Data for Case A on Wechsler Indices
of the difference is the amount of error expected for individual difference scores.
It is dependent on the amount of variability in difference scores and the test–
retest reliability of the measure. Jacobson and Truax initially determined this
value, in the absence of test–retest normative data, according to the following
expression:
It should be recognized that the 2SX2 (or SX2 + SX2) component of the expression
represented the “pooling” or addition of baseline and retest variances, assum-
ing they were equivalent (SX2 = SY2). But this assumption of equal variances is
not always met, a circumstance that will be considered later in this chapter.
When retest data are available, a better estimate of the denominator can be used
(Abramson, 2000). Thus, with baseline and retest variability estimates, the fol-
lowing expression can be used:
SEdiff = (S X
2
)
+ SY 2 ∗ (1 − rXY )
The values of variability (SX and SY ) and retest reliability (rXY ) are taken from
appropriate test–retest data. Consider Table 6.2, which shows published test–
retest data for the Wechsler Adult Intelligence Scale–IV (WAIS-IV; Wechsler,
2008) and Wechsler Memory Scale–IV (WMS-IV; Wechsler, 2009) for a sample
of healthy individuals.
For Full Scale IQ (FSIQ), the pooled standard error or SEdiff
SEdiff = (S + S ) ∗ (1 − r )
X
2
Y
2
XY
SEdiff = 4.56
Scale MX MY SX SY rXY
FSIQ 99.7 104.0 13.8 15.0 0.95
PSI 100.2 104.6 13.5 14.9 0.84
WMI 99.5 102.6 14.0 14.7 0.87
PRI 100.4 104.3 14.3 14.3 0.85
VCI 99.3 101.8 14.4 15.0 0.95
AMI 100.1 111.6 14.1 14.4 0.81
VMI 100.0 112.1 14.8 16.6 0.80
VWMI 99.5 103.8 14.4 15.6 0.82
IMI 99.9 112.3 14.9 15.6 0.81
DMI 100.4 114.1 15.0 15.0 0.79
128
RC JT = (Y − Y ’) / SE
RC JT = (Y − X ) / SEdiff
RC JT = (96 − 104 ) / 4.56
RC JT = −1.75
despite the use of alternate forms (Beglinger et al., 2005). While the content may
vary from one form to the next, the learning process of how to complete the
task is difficult to remove, with studies indicating that those with higher IQs
are more likely to benefit from practice (Calamia et al., 2012; Rapport, Brines,
Axelrod, & Theisen, 1997). Some authors have advocated prolonging the interval
between assessments. However, this is not always possible or practical. If a clini-
cal trial expects to see a change early, then brief test–retest intervals are required.
If methodological strategies fail to reduce practice, then statistical methods are
always available. Therefore, statistical control of practice effects in RCI models
has received considerable attention (Hinton-Bayre, 2010).
Test–retest control data can be derived from published norms or from pub-
lished or newly acquired control samples. The practice effect (PE) is indexed as
a difference between the baseline control mean (M X ) and retest control mean
(MY ), or PE = (MY – M X). This value will be the same whether the mean dif-
ference score for individual controls (Y – X) is used, or the difference between
means is calculated. For example, for the normative retest data on FSIQ (see
Table 6.2), the difference between means at baseline and retest (MY – M X ) was a
4.3-IQ-point increase. A repeated measures t-test on the FSIQ retest norms data
confirms that this change was statistically significant, t (297) = 15.8, p < 0.01.
Failure to consider, and in some way correct for, practice effects, can bias subse-
quent interpretation of reliable change. Chelune and colleagues (1993) modified
the RCIJT to incorporate a correction for mean practice. It is worth noting that
mean practice effects are exceedingly common in published retest data (Strauss
et al., 2006). When the repeated measures t tests are performed on the published
test–retest normative data, all five composite measures of the WAIS-IV and all
five composite measures of the WMS-IV showed evidence of a significant prac-
tice effect (p < .05). Correction for practice effects should be mandatory for any
performance-based RCI methodology employed.
= (Y − Y ’) / SE
= Y − ( X + PE ) / SEdiff
= 96 − (104 + 4.3) / 4.56
= [96 − 1088.3] / 4.56
= −2.69
131
One can clearly see that the expected score for an individual on retesting went
from 104 to 108.3, and thus a larger RCI value was obtained under the Chelune
model. An RCI score of –2.69 would be significant with either 90% or 95% con-
fidence intervals, when cut scores are ±1.645 and ±1.96, respectively. Remember
that, as the sample size of n = 298 is large enough to approximate a population from
a statistical perspective, a Z distribution may be employed. It should be appreciated
that failure to correct for a practice effect has the potential of underestimating any
change score obtained over test sessions. To clarify, the error term used is identical
to that used by RCIJT, noting that in the original description by Chelune, SE was
based on baseline variability only (SX). However, a pooled estimate using baseline
(SX) and retest (SY) variability is again preferred, SEdiff = (S X
2
)
+ SY 2 ∗ (1 − rXY )
(see Iverson et al., 2003). It must be noted that, in using the RCI Chelune, the same
adjustment for practice effect is made for all individuals. Also, the same standard
error is also applied for all individuals, assuming both the baseline and retest vari-
ance are equal. In other words, a uniform practice effect and standard error are
applied to all individuals under the RCI Chelune. The mean practice approach can
be contrasted to the regression-based models, which make further adjustments to
both the prediction of retest scores (Y′) and the standard error (SE).
this way, scores falling further from the control mean at baseline have a greater
correction for regression to the mean when estimating retest scores. If the base-
line score for the individual (X) falls below the control baseline mean (M X), a
change greater than the control mean practice effect must be observed to reach
statistical significance. In contrast, if the value of X falls above the control mean,
then a change less than the control mean practice effect must be exceeded to
reach significance. For example, applying the McSweeny RCI method to the case
example using FSIQ, the predicted retest score (Y′) can be obtained using the
least squares regression formula, Y′ = bX + a. The slope of the line (b) can be cal-
culated using FSIQ control test–retest statistics (see Table 6.2), such that:
b = rXY ∗ (SY / SX )
= 0.95 ∗ (15 / 13.8 )
= 1.03
The Y intercept of the regression line (a) can then be calculated as follows:
a = MY − bM X
= 104 − (1.03 ∗ 99.7 )
= 1.05
Once b and a have been calculated, the predicted retest score (Y′) can be
calculated:
Y ’ = bX + a
= (1.03 ∗ 104 ) + 1.05
= 108.2
The reader will note that the predicted FSIQ retest score (Y’) for McSweeny’s
RCI (Y’ = 108.2) is close to, but not the same as, Chelune’s RCI method
(Y’ = 108.3). This is because the test–retest reliability is excellent, rXY = 0.95.
Lesser reliability usually leads to greater discrepancy between RCI Chelune and
RCI McSweeny (Hinton-Bayre, 2010). To continue, once Y’ for RCI McSweeny is
known, the generic RCI expression can be once again employed, RCI = (Y – Y’)/
SE. When using RCI McSweeny, the compatible standard error (SE) term is the
standard error of estimate (SEE). The SEE is effectively the standard deviation of
the regression residuals or error in the control retest data. The value is generated
automatically in any regression analysis software, and is available in the afore-
mentioned spreadsheet, but can be calculated using the expression:
SEE = SY ∗ (1 − r )
XY
2
= 15 ∗ (1 − 0.95 ) 2
= 4.68
133
Thus, to continue the case example from Table 6.1 for FSIQ, RCI McSweeny:
RCI = (Y − Y ’) / SEE
= Y − (bX + a ) / SEE
= 96 − (108.2) / 4.68
= −2.61
Again, irrespective of the confidence interval or cut score used, 90% or 95%,
the case example FSIQ RCI will be significant. The reader will notice that the
score obtained under RCI McSweeny (–2.61) is less than that obtained under
RCI Chelune (–2.69). The reason for this discrepancy is regrettably quite com-
plicated, but nonetheless predictable, and will be touched on in a later section. It
should be appreciated that RCI McSweeny also adjusts for mean practice effect
as is done by RCI Chelune. However, the simplest way of conceptualizing the
difference between the two models is to see that the degree of practice effect
is adjusted depending on the individual’s relative position at baseline and test–
retest reliability.
( ) (n − 2)
t = SX 2 − SY 2 ∗ 2 ∗ S ∗ S ∗
X Y (1 − r )
XY
(
t = 13.82 − 15.02 ∗ ) (298 − 2) 2 ∗ 13.8 ∗ 15.0 ∗
( )
1 − 0.95
= −4.60
134
Using a regular t distribution, this result is statically significant (p < .001, two-
tailed) and suggests a significant increase in the retest variability when com-
pared with baseline variability. In the analysis of WAIS-IV, FSIQ, Processing
Speed Index (PSI), Working Memory Index (WMI), and Verbal Comprehension
Index (VCI) all demonstrated differential practice with significant increases in
retest variability (p < .01, two-tailed). On WMS-IV, only Visual Memory Index
(VMI) and Visual Working Memory Index (VWMI) demonstrated differen-
tial practice, again with retest variability exceeding baseline variability (see
Table 6.2). There remains a paucity of consideration of the underlying causes
of differential practice and its implications for clinical-outcome interpretation.
Nonetheless, the RCI McSweeny and its derivatives all make a mathematical cor-
rection to the predicted score based on mean practice, differential practice, and
test–retest unreliability. Several authors have jointly presented outcomes using
both RCI Chelune and RCI McSweeny, yet the understanding of actual differ-
ences between the models is yet to be fully explicated (see Hinton-Bayre, 2016).
MULTIPLE REGRESSION RCI
In the setting of RCI, particularly with regression-based models, rather than
lamenting the lack of test–retest reliability, one can make efforts to determine
whether other factors significantly contribute to the prediction of retest scores.
The extension from simple linear regression to multiple linear regression per-
mits the use of more than just baseline scores to predict retest scores. To this
135
end, Temkin and colleagues (1999) described a multiple regression RCI model.
Although baseline performance was invariably the best predictor of retest per-
formance, some measures also had statistically significant contributions from
other factors like age and the duration of the test–retest interval. In effect, any
extra significant contribution to prediction under the multiple regression model
is equivalent to improving the test–retest reliability through a greater explana-
tion of retest variance. For example, if the baseline score correlates with retest
scores rXY = 0.70, the coefficient of determination r2 = 0.702 = 0.49. This means
that baseline scores only account for 49% of retest score variability. If rXY = 0.90,
then r2 = 0.81, and baseline scores predict 81% of retest score variability in the
control group. If age, independent of baseline scores, correlates rXY = 0.20 with
retest scores, then age would predict 4% of retest variance. If test–retest inter-
val independently correlated rXY = 0.35 with retest scores (r2 = 0.123), then the
retest interval accounts for 12.3% of retest scores. Unique variance attributable
to independent variables can be directly summed, such that the baseline score (r2
= 0.902 = 0.81), age (r2 = 0.202 = 0.04), and retest interval (r2 = 0.352 = 0.123) can be
combined such that 0.81 + 0.04 + 0.123 = 0.973, or 97.3% of retest variance, might
be explained. If the percentage of variance explained were equal to r2 = 0.973,
2
then the effective reliability coefficient would be r = 0.973 = 0.986 or nearly
perfect reliability. Thus, whenever possible, a multiple regression approach to
predicting retest scores is encouraged. Indeed, this is the methodology employed
in Advanced Solutions package attached to the Wechsler scales (see Holdnack,
Drozdick, Weiss, & Iverson, 2013).
SEP = SEE ∗ 1 + (1 / n) + {( X − M ) } / {S
X
2
X
2
}
∗ (n − 1)
the denominator. The reader should also note the SEP will always exceed the SEE.
The degree to which the SEP exceeds SEE will increase with a smaller sample size
(n), and when the individual baseline score (X) differs more from the baseline
control group mean (MX ). Thus, the SEP will be different for individuals, unless
they have the same baseline score. To continue with the recurring case example
using the Wechsler FSIQ (see Table 6.1):
Thus, the RCI Crawford accounts for mean practice effect, differences between
baseline and retest variance, measurement error due to test–retest unreliability,
and extremeness of the individual case at baseline, by constructing both indi-
vidualized predicted retest scores (Y′) and standard error scores (SEP). As such,
it may be considered the most complete RCI model.
CHOOSING THE RCI MODEL
At present, there is no clear consensus on which is the preferred model, and this
can present a challenge for clinicians in deciding which approach to use when
assessing for change. There is no universally sensitive or conservative reliable
change model, and as will be discussed, the classification bias will vary depend-
ing on the individual case and the nature of the control test–retest parameters.
To recapitulate, all popular reliable change models share a fundamental
structure. The individual’s predicted retest score can be subtracted from their
actual retest score and then divided by a standard error. This will yield a stan-
dardized score, which may be interpreted via a standard Z distribution or t
distribution. With knowledge of test–retest means (M X and MY ), test–retest
standard deviations (SX and SY ), and a test–retest reliability coefficient (rXY ),
essentially any RCI model can be implemented—except the multiple regres-
sion RCI (Hinton-Bayre, 2010). Reliable change models vary in how they derive
predicted retest scores (Y’) and standard error (SE) values. It has also been
shown that knowledge of an inferential statistic, such as t or F, can be also be
manipulated to derive a test–retest reliability coefficient and thus expand the
selection of RCI models available (Hinton-Bayre, 2011). This finding means
that the interested researcher or clinician can choose which RCI model they
wish to implement, rather than being constrained by the model provided by
the test manual or reference study.
137
On review of current techniques, the multiple regression RCI with the indi-
vidualized error term seems the most complete model and is probably preferred.
However, it is important to note that, just because the RCI Crawford model is
the most sophisticated, it is not necessarily the most sensitive. It has been shown
that RCI models are not equivalent (Hinton-Bayre, 2012). Preliminary work fur-
ther suggests an individual’s relative position at baseline compared to controls
(ZX ), and the discrepancy between baseline (SX ) and retest (SY ) variability, can
dramatically influence the magnitude and even direction (positive or negative
change) of RCI scores (Hinton-Bayre, 2016).
COMPARISON OF RCI MODELS
To highlight how RCI models can be influenced by individual score values, two
case examples, B and C, will be provided. In Case B, the individual baseline level
of performance is less than the control mean, in Case C, it is greater than the
control mean. The relative position of an individual to the baseline control mean
can be expressed as a Z score [ZX = (X – M X )/SX], with positive values reflecting
above-average performance and negative values, below-average performance.
This approach presumes that a greater score is associated with better perfor-
mance, and the interpretation can be reversed if the scale is reversed.
Thus, Case B was just over one standard deviation below the control group mean
VMI score at baseline.
RCI Chelune—Mean Practice model:
VMI : Y ’ = X + ( M X − MY ) = 85 + 12 = 97
Using a 90% confidence interval, all three RCI models produced scores exceed-
ing + 1.645 and would be interpreted as a significant deterioration on VMI.
It is pertinent to note that the RCI Chelune produced a less negative score
when the individual started below the control baseline mean, and retest vari-
ability was significantly greater than baseline variability. Clearly, the choice of
RCI model would not have affected the ultimate interpretation, as the result sug-
gested significant deterioration with any of the presented RCI models. However,
the reader can appreciate the potential for the discrepant results depending on
the RCI model chosen, which will be considered in the next example.
Case C was just over one standard deviation above than the control-group mean
VMI score at baseline.
RCI Chelune—Mean Practice model:
Again, using a 90% confidence interval, the RCI Chelune for Case C was iden-
tical to that of Case B. This would be expected, given that a uniform practice
effect and error term is utilized. However, despite the same raw-score difference
(X – Y), a reversal in the relative position to a baseline score above the control
mean subsequently revealed a non-significant result for the regression based
models. It must be recognized that the choice of RCI model can readily effect
interpretation of individual change statistical significance. The examples made of
cases B and C should not be taken to suggest that when baseline scores are above
the mean, RCI Chelune will be more sensitive (giving the most negative score), or
that when baseline scores are below the mean, RCI McSweeny will be more sensi-
tive. In fact, this pattern would be reversed if all factors remained equal and the
direction of differential practice were changed. To elaborate: on VMI, the retest
variance exceeded baseline variance (SX < SY ), as occurred with many Wechsler
composite scales. However, this will not always be the case, as many subtests
show differential practice in the opposite direction, with retest variance being
less than the baseline variance (SX > SY ). Under such circumstances, the most
sensitive RCI will also be reversed. Such concepts are complicated and take some
digesting. The interested reader is directed to Hinton-Bayre (2016) for a brief
introduction to this phenomenon. Further elaboration of this finding is in prog-
ress. For the present time, the user of any RCI model needs to appreciate that not
all RCIs are equivalent. They cannot be used interchangeably, and they should
not be averaged thinking this provides a more stable or valid estimate of change.
The complexity of the situation is compounded when one realizes that selecting
an RCI model based on presumed sensitivity will potentially lead to interpreting
different RCI models depending on where the individual fell at baseline and the
differential practice seen on the measures of interest. It is far more tenable, as
well as more conceptually and practically justifiable, to select one RCI model for
all interpretation.
It has already been demonstrated all RCI models are influenced by one or
more of four factors: (1) mean practice effect (M X – MY ), (2) differential practice
effect (SX – SY ), (3) test–retest reliability (rXY ), and (4) relative position of the
individual compared to the mean of controls at baseline (Z X). Knowledge of
these parameters enables the reader to derive whichever RCI model they wish.
Practically, the discrepancy between RCI models can be minimized through
taking several steps. First, the individual case at hand should match the control
group as closely as possible. It has been suggested that the individual’s baseline
score (X) fall within one standard deviation of the control group baseline mean
(M X ) (Hinton-Bayre, 2005). However, as was seen in comparison of cases B and
C, the interpretation can still be altered under these circumstances. Second,
only use measures with adequate test–retest reliability in that context. Desirable
reliability for measures being tested to determine if the change is significant
are often quoted as needing to be rXY > 0.90 (see Chapter 5 in the current vol-
ume). It is important to note that even the most well-standardized cognitive
performance test batteries, WAIS-IV and WMS-IV, have the majority of com-
posite measures failing to reach this benchmark, despite the relatively short
140
retest intervals. Longer retest intervals on these measures have been associated
with reduced mean practice effects along with reduced reliability (Calamia,
Markon, & Tramal, 2013). Recall that reliability depends on how a test is used
and is not an intrinsic property of the test. Moreover, if there are significant
predictors of retest performance over and above baseline performance, a greater
explained variance is equivalent to greater test–retest reliability and can be
incorporated into a multiple regression–based RCI. Third, the extent to which
normative baseline and retest standard deviations approach equality will influ-
ence the extent to which alternative RCI models will also converge. While no
one model is more sensitive to change, there regrettably remains no consensus
or guideline on which RCI model is preferable. However, the multiple regres-
sion RCI model accounts statistically for all of the factors known to affect indi-
vidual change interpretation. In our opinion, the multiple regression RCI model
with the individualized error term, or the multiple regression version of RCI
Crawford, is preferred. Bear in mind that not all test–retest data will have mul-
tivariate predictors of retest performance, in which case the regular univariate
RCI Crawford should be considered.
Given the complexity of the preceding discussion, a series of summary points
are provided:
raw scores from baseline to retest will not indicate significant improvement if it
does not exceed the practice effect seen in the control or normative group. For
this reason, RCI analyses should always be interpreted in the context of the mean
difference seen in the comparison group.
between two linearly related continuous variables. Effect sizes should be consid-
ered mandatory, as statistical significance alone does attest to the importance of
the result (Schulz et al., 2010). Reporting of effect size is easy to reconcile when
one recalls that, with a sufficiently large sample size, essentially any difference or
association can be made statistically significant. As RCI standardizes the differ-
ence score for an individual after a designated effect, the average of such scores
may be considered an effect size for change in repeated-measures clinical out-
come studies.
To continue the cardiac surgery and memory cohort study example, the aver-
age RCI score of the 25 surgery patients can be calculated for each of the five
WMS-IV indices (see Table 6.3). The DMI and VWMI appeared to be most
affected by surgery, demonstrating the most negative scores. If the average RCI
score in the surgical group approximates zero, this indicates that on that index
they do not differ from the average RCI change score seen in the control group.
Thus, the surgical group RCI mean serves as a point-estimate for a treatment
effect. Interval-estimates can also be calculated using a standard error of the
mean with subsequent confidence intervals. The standard deviation (SD) of the
clinical group RCI scores is calculated in the usual manner, with the standard
error of the mean (SEM) being calculated as follows: SEM = SD n . When SEM
is multiplied by ±1.96, it will yield a 95% confidence limit, assuming an underly-
ing normal distribution. A multiplication factor of 1.645 provides a 90% confi-
dence limit. The mean RCI, plus and minus the relevant limit, yields a confidence
interval for the effect size. The use of RCI to yield an effect size is a novel applica-
tion and requires further validation. It does, however, provide the potential to
report a generalizable standardized index for comparison across measures and
studies of clinical outcome on continuous variables where individual change is
the ultimate focus.
80
60
Sensitivity
40
20
0
0 20 40 60 80 100
100-Specificity
Figure 6.1 Receiver Operating Characteristic (ROC) curve for Wechsler Memory Scale-
IV (WMS-IV) Reliable Change Index (RCI) composite scores in cardiac surgery and
control patients. Straight diagonal line represents the chance classification. The heavy
line represents the actual ROC curve, with the lighter dotted lines representing 95%
confidence intervals.
Such estimates are only as stable or reliable as the sample and its correspond-
ing size. A classic method to summarize the usefulness of a single or composite
measure to classify individuals is the Youden Index, or the maximal difference
between true-positive and false-positive rates (also known as the Youden J statis-
tic = Sensitivity + Specificity –1). Scores range from 0 to 1, with scores close to
zero suggesting that the cut score is not able to classify better than chance, and a
score of one suggesting a perfect classification, that is, no false positives or false
negatives. For the clinical dataset, a Youden Index J = 0.44 was found, with an
associated criterion of RCI <–0.61 (Sensitivity 52%, Specificity 92%).
Alternatively, the utility of a measure can be expressed in terms of likelihood
ratios (LRs). LRs typically assist in the determination of post-test probabilities of
the condition of interest (e.g., cardiac surgery), and can be positive or negative.
Like Sensitivity and Specificity estimates, the LR will change as the criterion
changes. A Positive LR is the ratio of the proportion of patients who have the
target condition (surgery) and test positive (memory decline) to the proportion
of those without the target condition (controls) who also test positive (memory
decline). In the current example, with an example criterion of <–0.4, the positive
LR was 3.27. In other words, using the WMS-IV RCI composite cut score <–0.4,
those having undergone surgery were 3.27 times more likely to be labelled “dete-
riorated” than controls. A Negative LR is the ratio of the proportion of patients
who have the target condition (surgery) who test negative (memory stable) to the
proportion of those without target condition (controls) who also test negative.
Again, using the WMS-IV RCI composite score <–0.4 cut score, the negative
147
LR was 0.46. In other words, those who underwent surgery were 0.46 times, or
almost half as likely, to show stable memory function.
ROC curves can also be used to make comparisons between measures. For
example, a composite index may be significantly more accurate than the most
accurate single measure (DMI). Such analyses can also be useful for “value added”
studies, where the new measure should demonstrate a significant improvement
in classification over the existing standard.
truly interpreted as a difference between baseline and retest scores, once relevant
systematic and error variances have been considered. In essence, RCI provides a
result like a repeated measures t-test, but for the individual rather than a group.
Clinically meaningful change is a different, but related, concept that has histori-
cally been difficult to define. In their original description of RCI, Jacobson and
Truax (1991) considered not just the significance of change, but also the transi-
tion from the “clinical” to the “control” group. This “meaningful change” index
was operationalized as moving over the midpoint between the two distributions.
Clinically important change has also been operationalized as the difference
score associated with subjective or clinician judgement of change, which heavily
depends on the outcome measure of interest, the accuracy of the patient’s subjec-
tive judgements, or the expertise of the clinician-judge.
There remains no consensus on what constitutes clinically meaningful change.
However, it should be considered that a clinically meaningful change cannot be
viewed as reliable unless it exceeds the corresponding statistical change score,
or RCI. In other words, the RCI provides the minimal change score required
to be potentially meaningful. A significant RCI score is thus not equivalent to
clinically meaningful change, but a precursor to it. It is also worth considering
that the concept of minimally important difference or change is actually syn-
onymous with a reliable change index that looks at the simple difference between
baseline and retest scores, not accounting for mean practice or regression to the
mean, but pooling baseline and retest variances in the estimate of error variance.
Third, there is no consensus for which RCI is most valid, and it has been demon-
strated that no one model is more sensitive than another (Hinton-Bayre, 2012).
In fact, sensitivity of RCI models is determined by how the particular model
accounts (or does not) for the relative position of the clinical case to controls at
baseline (viz., above or below the mean), the presence of a mean practice effect
or differential practice effect, and also the test–retest reliability. Essentially, it is
possible that a different RCI model will be more or less sensitive across different
individuals and different tests. It seems theoretically untenable to consider an ad
hoc approach to RCI model-selection.
The slowness of clinical outcome research to adopt RCI models is perhaps in
part due to the lack of consensus, which is further probably due to the lack of
understanding of how RCI models systematically differ. This latter topic is of
considerable further research interest and would help guide further recommen-
dations for RCI usage. Fourth, the RCI methodology was originally designed for
a pre–post scenario where individuals started at an abnormal level (Jacobson &
Truax, 1991). Using a control group that was also starting at an abnormal level,
the RCI was used to determine whether an individual significantly improved
following a therapeutic intervention. A beneficial effect can be demonstrated
by a significant increase in retest score, using non-treated individuals as a con-
trol group. In contrast, individuals can start at “normal” level prior to a nega-
tive intervening event such as surgery or traumatic brain injury, and then be
reassessed to determine whether there has been a statistically significant dete-
rioration in functioning. The similarity of these two methodologies is that the
150
individual or clinical group and corresponding control group start at the same
level, either both “abnormal,” or both “normal.” Thus, it should be recognized
that the use of RCI methodology when the clinical individual or group starts at
a different level to the control group has not been well validated. An example
of this would be examining recovery in those suffering from a traumatic brain
injury where baseline data does not exist. When retesting to examine for recov-
ery from an abnormal starting point, the use of an RCI based on an uninjured
control group with retest normative data is not well validated. An option wor-
thy of further consideration in this setting is to use an estimate of premorbid
functioning as the baseline score. The potential validity of this approach will lie
in how accurate is the premorbid estimate of functioning.
or process (e.g., TBI, dementia). The effect size automatically corrects for
systematic and random variability as afforded by the RCI model.
• If the outcome is correct patient selection or diagnostic accuracy, and
change is kept on a continuous scale, ROC curves can be employed. ROC
curves importantly take into account aspects of sensitivity and specificity
and allow direct examination of the tradeoff. Cut scores can be derived
for maximizing sensitivity or specificity or overall group separation, for
example, with Youden’s J index or likelihood ratios.
• If the outcome is the influence of a treatment on reliable change measured
as a dichotomy (changed or not), then NNT or NNH can be used. NNT
and NNH values can also be expressed as point and interval estimate
statistics, for example, as a point-estimate, an NNT = 8 suggests on average
8 individuals would need to be treated in order for one significant reliable
change to occur. Interval-estimates can also be derived, for example, with
a 95% confidence level, between 6.5 to 9.5 individuals would need to be
treated for significant reliable change to occur.
• If the outcome is measured before and after an uncontrolled event, then
RR or OR can be presented. Although RR is preferable, it will usually give
a lower estimate of influence. OR should be reserved for instances where
RR cannot be calculated, or the OR is the outcome produced through
logistical regression.
REFERENCES
Abramson, I. S. (2000). Reliable Change formula query: A statistician’s comments.
Journal of International Neuropsychological Society, 6, 365.
Beglinger, L. J., Gaydos, B., Tangphao-Daniels, O., Duff, K., Kareken, D. A., Crawford,
J., … Siemers, E. R. (2005). Practice effects and the use of alternate forms in serial
neuropsychological testing. Archives of Clinical Neuropsychology, 20, 517–529.
Calamia, M., Markon, K., & Tranel, D. (2012). Scoring higher the second time
around: Meta-analyses of practice effects in neuropsychological assessment. The
Clinical Neuropsychologist, 26, 543–570.
Calamia, M., Markon, K., & Tranel, D. (2013). The robust reliability of neuropsy-
chological measures: Meta- analyses of test–retest correlations. The Clinical
Neuropsychologist, 27, 1077–1105.
Christensen, L., & Mendoza, J. L. (1986). A method assessing change in a single sub-
ject: An alteration of the RC index. Behavior Therapy, 12, 305–308.
Chelune, G. J., Naugle, R. I., Luders, H., Sedlak, J., & Awad, I. A. (1993). Individual
change after epilepsy surgery: Practice effects and base- rate information.
Neuropsychology, 7, 41–52.
Cohen, R. J., & Swerdlik, M. (2009). Psychological Testing and Assessment (7th ed.).
Boston, MA: McGraw-Hill.
Collie, A., Maruff, P., Makdissi, M., McStephen, M., Darby, D. G., & McCrory, P. (2004).
Statistical procedures for determining the extent of cognitive change following
concussion. British Journal of Sports Medicine, 38, 273–278.
153
Crawford, J. R., & Howell, D. C. (1998). Regression equations in clinical neuropsychol-
ogy: An evaluation of statistical methods for comparing predicted and obtained
scores. Journal of Clinical and Experimental Neuropsychology, 20, 755–762.
Crawford, J. R., & Garthwaite, P. H. (2006). Comparing patients’ predicted test scores
from a regression equation with their obtained scores: A significance test and point
estimate of abnormality with accompanying confidence limits. Neuropsychology,
20, 259–271.
Duff, K. (2012). Evidence-based indicators of neuropsychological change in the individ-
ual patient: Relevant concepts and methods. Archives of Clinical Neuropsychology,
27, 248–261.
Heilbronner, R. L., Sweet, J. J., Attix, D. K., Krull, K. R., Henry, G. K., & Hart, R. P.
(2010). Official position of the American Academy of Clinical Neuropsychology
on serial neuropsychological assessment: The utility and challenges of repeat test
administrations in clinical and forensic contexts. The Clinical Neuropsychologist,
24, 1267–1278.
Hinton-Bayre, A. D., Geffen, G. M., Geffen, L. B., McFarland, K., & Friis, P. (1999).
Concussion in contact sports: Reliable change indices of impairment and recovery.
Journal of Clinical and Experimental Neuropsychology, 21, 70–86.
Hinton-Bayre, A. D. (2005). Methodology is more important than statistics when
determining reliable change. Journal of the International Neuropsychological
Society, 11, 788–789.
Hinton-Bayre, A. D. (2010). Deriving reliable change statistics from test–retest norma-
tive data: Comparison of models and mathematical expressions. Archives of Clinical
Neuropsychology, 25, 244–256.
Hinton-Bayre, A. D. (2011). Calculating the test–retest reliability co-efficient from
normative retest data for determining reliable change. Archives of Clinical
Neuropsychology, 26, 76–77.
Hinton-Bayre, A. D. (2012). Choice of reliable change model can alter decisions
regarding neuropsychological impairment after sports-related concussion. Clinical
Journal of Sports Medicine, 22, 105–108.
Hinton-Bayre, A. D. (2016). Detecting impairment post-concussion using Reliable
Change indices. Clinical Journal of Sports Medicine, 26, e6–e7.
Holdnack, J. A., Drozdick, L. W., Weiss, L., G., & Iverson, G. L. (2013). WAIS-IV,
WMS-IV, and ACS: Advanced Clinical Interpretation. Waltham, MA: Academic
Press.
Howell, D. C. (2010). Statistical Methods for Psychology (8th ed.). Belmont,
CA: Wadsworth, Cengage Learning.
Iverson, G. L., Lovell, M. R., & Collins, M. W. (2003). Interpreting change on ImPACT
following sport concussion. The Clinical Neuropsychologist, 17, 460–467.
Jacobson, N. S., Follette, W. C., & Revenstorf, D. (1984). Psychotherapy outcome
research: Methods for reporting variability and evaluating clinical significance.
Behavior Therapy, 15, 336–352.
Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to
defining meaningful change in psychotherapy research. Journal of Consulting and
Clinical Psychology, 59, 12–19.
Kneebone, A. C., Andrew, M. J., Baker, R. A., & Knight, J. L. (1998). Neuropsychologic
changes after coronary artery bypass grafting: Use of reliable change indices.
Annals of Thoracic Surgery, 65, 1320–1325.
154
Maassen, G. H. (2004). The standard error in the Jacobson and Truax reliable change
index: The classical approach to the assessment of reliable change. Journal of the
International Neuropsychological Society, 10, 888–893.
Maassen, G. H. (2005). Reliable change assessment in the sport concussion
research: A comment on the proposal and reviews of Collie et al. British Journal of
Sports Medicine, 39, 483–488.
Martin, R., Sawrie, S., Gilliam, F., Mackey, M., Faught, E., Knowlton, R., &
Kuzniecky, R. (2002). Determining reliable cognitive change after epilepsy sur-
gery: Development of reliable change indices and standardized regression-based
change norms for the WMS-III and WAIS-III. Epilepsia, 43, 1551–1558.
McSweeny, A. J., Naugle, R. I., Chelune, G. J., & Luders, H. (1993). “T scores for
change”: An illustration of a regression approach to depicting change in clinical
neuropsychology. The Clinical Neuropsychologist, 7, 300–312.
Mitrushina, M., Boone, K. B., Razani, J., & D’Elia, L. F. (2005). Handbook of
Normative Data for Neuropsychological Assessment (2nd ed.). New York: Oxford
University Press.
Rapport, L. J., Brines, D. B., Axelrod, B. N., & Theisen, M. E. (1997). Full scale IQ as
mediator of practice effects: The rich get richer. The Clinical Neuropsychologist, 11,
375–380.
Schulz, K. F., Altman, D. G., & Moher, D. (2010). CONSORT 2010 statement: Updated
guidelines for reporting parallel group randomized trials. Annals of Internal
Medicine, 152, 1–7.
Straus, S. E., Glasziou, P., Richardson, W. S., & Haynes, R. B. (2010). Evidence-Based
Medicine: How to Practice and Teach It (4th ed.). Edinburgh: Churchill Livingston.
Strauss, E., Sherman, M. S., & Spreen, O. (2006). A Compendium of Neuropsychological
Tests: Administration, Norms and Commentary (3rd ed.). New York: Oxford.
Streiner, D. L., & Cairney, J. (2007). What’s under the ROC? An introduction to receiver
operating characteristic curves. Canadian Journal of Psychiatry, 52, 121–128.
Suissa, D., Brassard, P., Smiechowski, B., & Suissa, S. (2012). Number needed to
treat is incorrect without proper time-related considerations. Journal of Clinical
Epidemiology, 65, 42–46.
Temkin, N. R., Heaton, R. K., Grant, I., & Dikmen, S. S. (1999). Detecting significant
change in neuropsychological test performance: A comparison of four models.
Journal of the International Neuropsychological Society, 5, 357–369.
Viera, A. J. (2008). Odds ratios and risk ratios: What’s the difference and why does it
matter? Southern Medical Journal, 101, 730–734.
von Elm, E., Egger, M., Altman, D. G., Pocock, S. J., Gotzsche, P. C., & Vandenbroucke,
J. P. (2007). Strengthening the reporting of observational studies in epidemiol-
ogy (STROBE) statement: Guidelines for reporting observational studies. British
Medical Journal, 335, 806–808.
Wechsler, D. (2008). Wechsler Adult Intelligence Scale–Fourth Edition. San Antonio,
TX: Pearson.
Wechsler, D. (2009). Wechsler Memory Scale–Fourth Edition. San Antonio, TX: Pearson.
155
G OR DON J. C H E LU N E
While not new, this problem has become acute within current value-
driven, evidence-
based health care systems where outcomes accountability
and the management of individual patients on the basis of epidemiological
156
HISTORICAL ANTECEDENTS
Although EBM is a relatively recent movement in health care, its historical ante-
cedents can be traced back to antiquity. In their excellent review of the history
and development of EBM, Claridge and Fabian (Claridge & Fabian, 2005) note
that perhaps the first report of a rudimentary controlled study appeared in the
biblical Book of Daniel (Holy Bible, 2011):
8
But Daniel resolved not to defile himself with the royal food and wine, and
he asked the chief official for permission not to defile himself this way….11
Daniel then said to the guard whom the chief official had appointed over
Daniel, Hananiah, Mishael and Azariah, 12 “Please test your servants for ten
days: Give us nothing but vegetables to eat and water to drink. 13 Then com-
pare our appearance with that of the young men who eat the royal food, and
treat your servants in accordance with what you see.” 14 So he agreed to this
and tested them for ten days. 15 At the end of the ten days they looked healthier
and better nourished than any of the young men who ate the royal food. 16 So
the guard took away their choice food and the wine they were to drink and
gave them vegetables instead. (Daniel 1:8–16)
Atlantic, David Sackett and his colleagues in Canada are credited with coining
the term “evidence-based medicine” (Evidence-Based Medicine Working Group,
1992) and providing the first succinct definition for its use in the field (Sackett
et al., 1996). Sackett and associates (Sackett, Straus, Richardson, Rosenberg, &
Haynes, 2000) have subsequently refined the definition of EBM further in their
book Evidence-Based Medicine: How to Practice and Teach EBM as simply “the
integration of best research evidence with clinical expertise and patient values”
(p. 1) with the goal of maximizing clinical outcomes and quality of life for the
patient. Note that the authors do not advocate the blind application of research
data to guide patient care, but give equal importance to the clinical acumen of
the practitioner and to the values of the individual patient. This tripartite com-
position of EBM is at the heart of virtually all descriptions of evidence-based
practice, and its core elements can be seen in the definition of evidence-based
practice adopted by the American Psychological Association (APA Presidential
Task Force on Evidence-Based Practice, 2006): “Clinical decisions [should] be
made in collaboration with the patient, based on the best clinically relevant
evidence, and with consideration for the probable costs, benefits, and available
resources and options” (p. 285).
Like all successful movements, evidence-based practices in health care did
not arise in a social vacuum. Financial considerations were also a major impetus
for the rapid adoption of evidence-based practices in the 1990s (Encyclopedia of
Mental Disorders, 2016). Until the middle of the twentieth century, in the United
States, health care services were provided primarily on a fee-for-service basis or
covered by private indemnity insurance plans, which generally paid for any medi-
cal service deemed necessary by a physician. There were few controls on overuse of
expensive and often unnecessary treatments and diagnostic procedures. In 1965,
the United States government amended the Social Security Act and introduced
Medicare (Title XVIII) and Medicaid (Title XIX) as national insurance policies
for the elderly and individuals who have inadequate income to pay for medical ser-
vices. It quickly became apparent that a significant portion of health care expen-
ditures in the United States was being wasted on redundant and often unproven
or ineffective tests and treatments (Horwitz, 1996), giving rise to an urgent need
for health-care reform. In 1973, the U.S. Congress passed the Health Maintenance
Organization (HMO) Act, and in 1978, Congress increased federal spending to
further develop HMOs and other forms of managed-care payment systems in an
effort to contain costs through largely administrative control over primary health
care services (Encyclopedia of Mental Disorders, 2016). Fortunately, one model of
managed care to emerge was “outcomes management,” a system that proposed
the use of epidemiological information about patient outcomes to assess and iden-
tify medical and surgical treatments that could be objectively demonstrated to
have a positive impact on a patient’s condition in a cost-effective manner (Segen,
2006). Outcomes management rather than administrative management called for
a more value-driven, evidence-based health care system in which procedures and
treatments were seen as having “value” if they could be objectively demonstrated
158
Database Content
Medline/PubMed General medical database; many journals not referenced
PsychINFO General psychological literature, including book chapters
CINAHL Nursing/A llied Health
Embase Pharmacological and biomedical database including
international entries
BIOSIS Biological and biomedical sciences; journal articles,
conference proceedings, books, and patents
HSTAT Health Sciences Technology and Assessment; clinical
guidelines, Agency for Healthcare Research and Quality
(AHRQ) and National Institutes of Health (NIH)
publications
CCRCT Cochrane Central Register of Controlled Trials
CDSR Cochrane Database of Systematic Reviews
DARE Database of Abstracts and Reviews of Effects; critically
appraised systematic reviews
Campbell Systematic reviews in education, criminal justice, and social
Collaboration welfare
159
patient care and the fundamental principles and methods for providing that care
(Evidence-Based Medicine Working Group, 1992). They described the tradi-
tional paradigm of clinical practice as resting on four assumptions:
If one reflects for a moment, one can begin to question how much of our current
day-to-day neuropsychological practice is based on clinical lore and the uncriti-
cal adoption of new cognitive tests and treatment procedures on their face-value
alone (Dodrill, 1997).
The emerging paradigm that Sackett’s group espoused rested on new assump-
tions. First, while acknowledging the value and necessity of clinical experience,
the practice of EBM stresses the importance of making systematic observations
in a clear and reproducible manner to increase confidence about patient out-
comes, the value of diagnostic tests, and efficacy of treatments. Second, under-
standing of basic disease mechanisms is important but not sufficient for clinical
practice, as one must understand symptoms within the context of the individual
patient. Finally, understanding the rules of evidence is necessary to critically
appraise the methods and results of published clinical research. With these
assumptions, a new model of health care emerged in which the clinician’s profi-
ciency and expertise were called upon to integrate a patient’s preferences, values,
and circumstances with critically appraised external research evidence (Haynes,
Devereaux, & Guyatt, 2002).
There are several elements of this definition that are worthy of comment. First,
EBCNP is presented, not as a discrete action or body of knowledge, but as a
process—an ongoing “pattern” of routine clinical practice. This pattern of prac-
tice is “value-driven,” which is distinct from “cost-effective.” “Value-driven”
is used here to indicate that the practitioner’s goal is to provide a service that
uniquely enhances the clinical outcomes of patients in terms of the diagnosis,
management, care, and ultimately quality of life for the patients (Chelune, 2002,
2010), and hence warrants reimbursement. In the context of most patient evalu-
ations, the value of a neuropsychological test or the assessment as a whole can
be judged in terms of its ability to reduce diagnostic uncertainty (Costa, 1983).
As in other areas of evidence-based practice, clinical expertise is paramount.
The evidence-based neuropsychologist uses “best research” to guide his or her
clinical decision-making process. However, the idea of “best” research implies
that the clinician has the knowledge and expertise to first acquire relevant clini-
cal research and then to critically appraise the information for its validity and
applicability to the questions s/he has about the patient (Rosenberg & Donald,
1995). Often research findings are based on group data and group comparisons,
and again the evidence-based practitioner must have skills to transform data
derived from group comparisons into statements that can be directly applied to
specific patients. Thus, as shown in Figure 7.1, EBCNP is an integrative process
Clinician’s
expertise
Best clinical
research
that starts with the clinician’s expertise, which encompasses his or her clinical
knowledge and experience, as well as skills in asking answerable patient-focused
questions, acquiring relevant clinical research through informatics, critically
appraising this research, and then applying the information in a way that respects
the patient’s needs and values. When functioning as consultant, the evidence-
based neuropsychologist must also apply the information in ways that answer
the referral source “need to know” in order to best manage the patient’s care.
The process of EBCNP should be familiar to all of us, as it parallels that of
the scientific method: hypothesis formation, literature review, study design, and
data collection, analysis, and conclusions. Consider the example of a 59-year-
old college-educated woman who has been working in middle management at a
technology firm. She suffers an ischemic left middle cerebral artery stroke and
presents to the Emergency Room with symptoms of slurred speech, confusion,
and right hemiparesis. She is treated with tissue Plasminogen Activator (tPA)
within two hours of symptom onset and appears to make a good recovery over
the course of two months. This patient may be referred for neuropsychological
evaluation by her outpatient neurologist to determine if she has any significant
residual language or memory deficits that might benefit from speech therapy.
The same patient maybe referred by the Employee Assistance Program at her
company to determine whether to let the patient come back to work or move her
from sick-leave to short-term disability. In both situations, the neuropsycholo-
gist needs to:
1. Assessing the patient is the starting point of every clinical evaluation, and
the basics of patient interviewing and deconstructing neuropsychological
referral questions are commonly taught (Lezak, Howieson, Bigler, & Tranel,
2012; Schoenberg & Scott, 2011) and are not unique to evidence-based
162
In more recent years, others (Heneghan & Badenoch, 2006) have extended the
PICO model of formulating answerable clinical questions to include Type of
Question and Type of Study (PICO-TT).
As seen in Table 7.2, clinical questions can fall into a several categories, includ-
ing questions about etiology, diagnosis, prognosis, therapy, cost-effectiveness,
and quality of life. Knowing the type of question we want to ask tells us “what”
to look for, “where” to look, and “what” to expect. In our example, if we have a
163
patient presenting with possible FTD, and our question is primarily diagnostic,
we might use PubMed to look for diagnostic validity studies that provide infor-
mation on the sensitivity and specificity of fluency procedures in discriminating
FTD from other neurodegenerative conditions such as AD. On the other hand,
if we were interested in the efficacy of using cholinesterase inhibitors to treat
FTD, we might look for treatment studies in the Cochrane Report for random-
ized clinical trials that have attempted to use these medications to treat FTD
compared to AD.
The ability to search databases efficiently is a skill that develops through fre-
quent searches on a regular basis, and there are many shortcuts that one can
learn. Use of Boolean operators such as AND, OR, and NOT, and nesting terms
within parentheses can make searches more precise (APUS Librarians, 2015). If
one expands the PICO question to incorporate the Type of question and/or Type
of study, even more refined advanced searches are possible.
For those who do frequent searches, one might anticipate that our spe-
cific search question might get different results depending on whether we use
“Frontotemporal” versus “Frontal Temporal” dementia or whether we use
Alzheimer’s “Disease” versus “Dementia.” To simplify our search, we could
use the Boolean operator “OR” to capture both “Frontotemporal” and “Frontal
Temporal” subjects, whereas we might simplify our search for subject groups
labeled Alzheimer’s “Disease” or “Dementia” by simply using “Alzheimer’s,”
without specifying disease or dementia. For example, in the case of our PICO
question where we are interested in diagnostic studies that use verbal fluency to
differentiate FTD and AD, one could put our various terms into the search bar of
a research database such as PsycINFO, using Advanced Search, designating our
subject (SU) fields as (frontal temporal OR frontotemporal) AND Alzheimer’s,
AND the keyword (KW) fluency, to generate a list of eight primary references.
Alternately, we could go to the home page of PubMed (http://w ww.ncbi.nlm.nih.
gov/pubmed/) and use the PubMed tool “Clinical Queries” (in the center column
of options) to do a more focused search. Here we can put in our search terms,
and when we enter “Search” we are then prompted with options for Category
(Etiology, Diagnosis, Treatment, Prognosis, or Clinical prediction guidelines)
and Scope of search (Broad or Narrow), with the results segregated into Clinical
Studies and Systematic Reviews. In our example, we would select “Diagnostic”
and perhaps “Broad.” For our search statement, we are interested in both AD and
FTD as variably used in the literature, so nesting terms is helpful, with the search
entry being—“(Frontal temporal OR frontotemporal) AND Alzheimer’s AND
fluency, compare.” This search generated six primary clinical studies and one
meta-analysis, a very manageable number of papers to appraise in our next step.
Appraise the Evidence
While systematically identifying potentially relevant clinical research is a nec-
essary step in evidence-based practice, it is not, in itself, sufficient. The studies
acquired by our systematic searches must still be appraised as to their merits and
limitations. The popular adage “you are what you eat,” which can be traced to the
nineteenth-century philosopher Brillat-Savarin (Gooch, 2013), stresses the rela-
tionship between a healthy diet and a healthy body. By analogy, one could argue
that our knowledge and ability to practice as neuropsychologists are likewise
dependent on the quality of the research we consume. While neuropsychology
has a robust research literature, many research studies fail to provide key details
that can inform the consumer about the quality and applicability of the inves-
tigational findings to individual clinical decision making. The evidence-based
165
neuropsychological practitioner must not only develop the skills to acquire rel-
evant clinical research, but also have the skills to critically appraise its validity,
reliability, and applicability once obtained.
There are many ways to consider the quality of evidence acquired by our sys-
tematic searches. As a starting point, one can use the levels of evidence pyramid,
which looks at different types of evidence in terms of their rigor, study design,
and generalizability. As seen in Figure 7.2, the lowest level of evidence consists of
editorials, expert opinions, and chapter reviews such as this one. Evidence at this
level is personally filtered by the author(s) and thus subject to the biases of the
writer. This may be a good starting point to gain an appreciation of a topic, but
the information may be dated and seldom should be the sole basis of evidence-
based clinical decisions. As we move up the evidence pyramid, we encounter
empirical studies that are unfiltered, that is, peer-reviewed studies that report
data from specific investigations. Case series and case reports are typically con-
sidered weak evidence, as they generally have no control groups for comparison,
or provide limited statistical analyses of the findings. Although weak by com-
parison to large group studies, case series and qualitative reports can be useful,
however, when the condition of interest is rare in the population, and case series
provide some degree of replication.
Case controlled studies provide a stronger level of evidence and are quite com-
mon in the neuropsychological research literature. These are retrospective stud-
ies in which the investigator compares one or more patient groups that already
have known conditions of interest or exposures, and looks for factors that dif-
ferentiate the groups. For example, an investigator may wish to determine if
cognitive impairment is related to severity of chronic kidney disease (CKD) by
comparing the neuropsychological performances of non–dialysis-dependent
patients (CKD Stages III and IV) with those on hemodialysis for end-stage renal
disease (Kurella, Chertow, Luan, & Yaffe, 2004). Groups are often large, and the
Cohort Studies
of E
vid
Figure 7.2 Levels of the Evidence Pyramid. EBM Pyramid and EBM Page Generator,
(c) 2006 Trustees of Dartmouth College and Yale University. All rights reserved.
Produced by Jan Glover, David Izzo, Karen Odato, and Lei Wang.
166
statistical analyses can be quite eloquent. However, selection bias can be a sig-
nificant problem in case-controlled studies, and the evidence-based researcher
should look to see if there other studies that provide convergent findings. The
astute reader also needs to review the Participants section of the research reports
very carefully. Incomplete or inadequate reporting of subject-selection proce-
dures and research findings can hamper the assessment of the strengths and
weaknesses of studies, possibly leading to faulty conclusions. Take, for example,
one of our early epilepsy-surgery studies (Hermann et al., 1999), where we exam-
ined changes in visual confrontation among 217 patients with intractable left
temporal lobe seizures who underwent one of four surgical approaches (tailored
resections with intraoperative mapping, tailored resections with extraoperative
mapping, standard resections with sparing of the superior temporal gyrus, and
standard resections including excision of the superior temporal gyrus). Changes
in naming were standardized against a group of 90 patients with complex partial
seizures who were tested twice but who did not undergo surgery. The abstract
for the paper reported: “Results showed significant decline in visual confronta-
tion naming following left ATL, regardless of surgical technique. Across surgical
techniques, the risk for decline in visual confrontation naming was associated
with a later age of seizure onset and more extensive resections of lateral temporal
neocortex” (p. 3). At first glance, it would appear that any of the four surgical
approaches would be reasonable for a patient with intractable left temporal lobe
seizures. However, on closer examination of the Participants section, we find
that the patient selection was restricted to those without neocortical lesions and
who did not have evidence on MRI of lesions other than hippocampal atrophy.
Hence, the equipoise among the surgical approaches in terms of visual confron-
tation naming morbidity does not extend to patients whose seizures were related
to tumors, cortical dysplasias, hamartomas, or other structural lesions. Also,
because this was a retrospective case-controlled study in which patients were not
randomized to different surgical approaches, we do not know whether the four
surgical approaches were truly equivalent, or whether the apparent equivalency
was due to unsuspected factors such as the skill of the neurosurgeons to select
appropriate patients for a given approach.
Higher on the evidence pyramid are cohort studies, which involve a pro-
spective design. Here the investigator compares a group of individuals with a
known factor (condition, treatment, or exposure) with another group without
the factor to determine whether or not outcomes are different over time. A clas-
sic example of a cohort study would be one designed to study the long-term
deleterious effects of smoking by comparing differences in the incidence of lung
disease between a group of smokers and a comparable group of nonsmokers after
10 years. A neuropsychological example of a prospective cohort study is one
reported by Suchy and colleagues (Suchy, Kraybill, & Franchow, 2011) in which
50 community-dwelling older adults were assessed for reactions to “novelty”
and then followed for over 1.5 years using a comprehensive mental status exam.
Those who displayed a novelty effect at baseline were found to be four times
more likely (LR+ = 3.98) of showing a reliable cognitive decline on the mental
167
status exam than those who did not have the novelty effect. Because cohort stud-
ies are essentially observational studies, they are not as robust as randomized
controlled studies since the groups under consideration may differ on variables
other than the one under investigation.
The most rigorous, unfiltered study design is that of a randomized controlled
trial (RCT). By using randomization and blinding, these studies are carefully
designed, planned experiments in which a treatment, intervention, or expo-
sure is introduced in a random order to one patient group (the intervention
group) but not the other (no-intervention group) to study its clinical impact.
Differences in outcomes are measured between the two groups and subjected
to quantitative analysis to determine cause and effect. The recent Systolic Blood
Pressure Intervention Trial (SPRINT)—Memory and Cognition in Decreased
Hypertension (MIND) trial sponsored by the National Institutes of Health
(SPRINT Research Group, 2015) is an example of a two-arm, multicenter RCT
designed to test whether a treatment program aimed at reducing systolic blood
pressure to a lower goal (120 mm Hg) than the currently recommended target
(140 mm Hg) would reduce cardiovascular disease risk and decrease the rate of
incident dementia and cognitive decline in hypertensive individuals over the age
of 50. Follow-up of the 9,361 patients randomized in the trial was stopped early
after three and a half years because the risk of cardiovascular death and mor-
bidity was dramatically lower in the intensive treatment arm. The longitudinal
cognitive outcome data are still being collected.
At the top of the evidence pyramid, we find critically appraised topics, sys-
tematic reviews, and meta-analyses. These are filtered studies, that is, they are
selected by the author on the basis of clearly defined criteria. Critically appraised
topics (CATs) are brief summaries of systematically acquired evidence deemed
by the author to represent the best available research evidence on a specific clini-
cal topic of interest. CATs are patient-centered and designed to answer explicit
clinical questions that arise repeatedly in clinical practice (e.g., “Will my elderly
patient benefit from a computerized ‘brain-training’ program?”). Individual
CATs often have a limited scope and can become outdated as new evidence
emerges in the literature. However, for the busy clinician, CATs can be a rea-
sonable substitute for the more extensive gold standard—t he systematic review.
For in-depth treatments of CATs, see the chapters by D. Berry (Chapter 11) and
J. Miller (Chapter 12) in this volume, and the paper by Bowden and associates
(Bowden, Harrison, & Loring, 2014).
Systematic reviews are more comprehensive and rigorous versions of CATs,
and also more labor-intensive to construct. Like CATs, systematic reviews are
focused on specific clinical topics and designed to answer well-defined ques-
tions. However, unlike CATs that may be based on as little as a single, critically
appraised research report, systematic reviews represent extensive literature
searches, often involving multiple databases such as those shown in Table 7.1.
They typically have an explicit and well-articulated search strategy with clear
inclusion and exclusion criteria, and after assessing the quality of the papers
located, the authors present their findings in a systematic fashion to address the
168
quality, many of these guidelines are accompanied by checklists that readers may
find useful to guide their appraisals of the articles while they are reading the
papers. For our purposes here, I will only mention three guidelines: STROBE,
CONSORT, and STARD.
STROBE stands for Strengthening the Reporting of Observational Studies in
Epidemiology. The STROBE guideline was developed through an international
collaboration of epidemiologists, methodologists, statisticians, researchers, and
journal editors who were actively involved in the conduct and publication of
observational studies (von Elm et al., 2007). The details of the STROBE guide-
line can be found on the website http://w ww.strobe-statement.org/, along with
several downloadable checklists for case-controlled, cohort, and cross-sectional
studies. While the checklists are intended for researchers who are preparing
manuscripts for publication, they are useful guides for readers as well. In fact,
the STROBE criteria have recently been reviewed by Loring and Bowden to show
how they are applicable to neuropsychological research, and how adherence to
these standards can promote better patient care and enhance the rigor of neuro-
psychology among the clinical neurosciences (Loring & Bowden, 2014).
The CONSORT (Consolidated Standards of Reporting Trials) statement of
2010 (Schulz, Altman, & Moher, 2010), like STROBE, is intended to alleviate some
of the problems that arise from inadequate reporting, but specifically in clinical
trials. CONSORT proposes a minimum set of recommendations for reporting
RCTs, with the goals of improving reporting, and enabling readers to under-
stand trial design, conduct, analysis, and interpretation, and to assess the valid-
ity of its results. The 2010 CONSORT statement contains a 25-item checklist and
flow diagram that are available for review and download at the website: http://
www.consort-statement.org/. Miller and colleagues have reviewed CONSORT
criteria and have discussed how the individual criteria could be implemented
in neuropsychological research paradigms (Miller, Schoenberg, & Bilder, 2014).
The final guideline to be mentioned here is STARD (Standards for the
Reporting of Diagnostic Accuracy Studies). Because many neuropsychological
studies have an implicit, if not explicit, intention to provide diagnostic infor-
mation, STARD is especially relevant for the evidence-based clinical neuropsy-
chologist. The objective of the STARD initiative is to improve the accuracy and
completeness of reporting of studies of diagnostic accuracy so that readers can
assess potential bias in a study and evaluate the generalizability of the findings
(Bossuyt et al., 2003). The STARD guideline consist of a 25-item checklist and
recommends the use of a flow diagram that describes the study design and selec-
tion of patients. Figure 7.3 provides an example of a STARD-inspired flow chart
for a retrospective case-controlled study in which the investigators wanted to
compare the diagnostic utility of verbal fluency patterns to distinguish patients
with prototypical Alzheimer’s versus frontotemporal dementia PET scans. The
two groups each consisted of 45 patients, but to arrive at these groups, nearly
3,100 patients in a clinic registry needed to be considered.
The STROBE, CONSORT, and STARD guidelines can all be useful to have on
hand when appraising evidence, but they are primarily meant for use in preparing
170
manuscripts for publication. The small volume How to Read a Paper: The Basics
of Evidence-Based Medicine (Greenhalgh, 2006) is specifically intended for the
consumer, and it is an extremely useful guide. The text not only guides the
evidence-based reader on how to set up literature searches, but guides the reader
through different sections of a typical research report, providing a checklist to
accompany each of the chapters. Since most neuropsychological evaluations are
designed to reduce diagnostic uncertainty (Costa, 1983) and rely heavily on the
diagnostic validity of neuropsychological tests (Ivnik et al., 2001), Chapter 7,
“Papers That Report the Results of Diagnostic or Screening Tests,” is especially
relevant for appraising the quality of diagnostic studies and their applicability
to patient decision making. Common Test Operating Characteristics (TOC)
such as sensitivity, specificity, odds ratio, and, importantly, the likelihood ratio
are discussed. These Bayesian-based indices reflect how a given test operates in
171
clinical situations, and will form the empirical basis of our discussion of the final
step in evidence-based practice, namely, applying research evidence.
Apply the Evidence
After our clinical assessment of the patient, we have taken our need for informa-
tion, translated it into an answerable question, searched the research literature
to acquire relevant information, and critically appraised it in terms of its valid-
ity and applicability to our patient’s problem. Now we must apply what we have
found to guide our clinical decision making about the patient. For example, our
search to answer the question of whether differences in semantic and phone-
mic fluency could help us differentiate patients with AD from those with FTD
yielded several papers of interest, including a meta-analysis (Henry, Crawford, &
Phillips, 2004). The meta-analysis collapsed 153 studies involving 15,990 patients
and concluded that “semantic, but not phonemic fluency, was significantly more
impaired …” and that “semantic memory deficit in Dementia of the Alzheimer’s
Type [DAT] qualifies as a differential deficit …” (p. 1212). This adds to our clini-
cal confidence that poor semantic versus phonemic fluency in a patient suspected
of AD may, in fact, have AD. However, it does not tell us how big a discrepancy
is needed to be confident that the patient has AD. Herein lies the rub. In our
working definition of EBCNP, we proposed that one of the defining features of
evidence-based practice was the ability to integrate the “best research” derived
from the study of populations to inform clinical decisions about individuals. To
move from group data to data that are applicable at the level of the individual,
the evidence-based practitioner in neuropsychology needs to shift how she or he
interprets and uses data.
Most case-controlled studies, and even many cohort studies, report statistical
differences between aggregate or mean levels of performance, with the probabil-
ity (p-value) level denoting whether the difference is reliable and repeatable and
not due to measurement error or chance fluctuations in the test scores. In what
was truly a seminal paper, Matarazzo and Herman pointed out that there is a
difference between statistical significance and clinical significance, with the latter
referring to whether an observed finding is sufficiently rare in a reference popu-
lation (e.g., normal population) such that it is more likely to have been obtained
in a population external to the reference group (e.g., “abnormal”—a group with
a condition of interest: Matarazzo & Herman, 1984). This distinction between
“how much” of a difference is needed to be statistically significant versus “how
many” to be rare and clinically meaningful (an issue of base rates) is important
in outcomes research (Smith, 2002), and central for moving from group data to
data that can be applied at the level of the individual.
For illustration, let us consider the comparison of patients with presumed
mild cognitive impairment (MCI; the condition of interest or COI) and cogni-
tively normal individuals (reference population: RP) on a mental status test such
as the Montreal Cognitive Assessment or MOCA (Nasreddine et al., 2005). The
idealized distributions of MOCA scores for both groups are depicted in Figure
172
7.4. It is clear that the average performance of the COI group is much lower than
that of the RP (M = 22.1 vs. 27.4), although there is some overlap between the
distributions. Besides looking at the mean level of performance between the
groups, we can also look at the diagnostic efficiency of the MOCA in terms of
base rates. At the point of overlap between the two distributions (position “A”),
we have placed a heavy dotted line representing the optimal cutoff below which
most individuals with the COI score, and above which most of the RP group
scores. For the MOCA, this cutoff score is ≤25 (Nasreddine et al., 2005). Those
with the COI who fall below the cutoff are “True Positives,” and those in the RP
group who score above the cutoff are “True Negatives.” There are some individu-
als with the COI who fall above the cutoff, and they represent “False Negatives,”
and some in the RP group that fall below the cutoff and are considered “False
Positives.” Knowing the number of individuals who fall below and above the cut-
off in each group allows us to calculate, among many other things, the sensitivity
(percentage of true positives in the COI group) and the specificity (percentage of
true negatives in the RP group), given the stated cutoff score.
In our example of the MOCA, Nasraddine and colleagues (2005) reported a
sensitivity of 90% and a specificity of 87% when comparing individuals with
MCI versus cognitively normal controls and using a cutoff of ≤25, that is, 90% of
individuals in their MCI group scored below the cutoff, and 87% of in the nor-
mal group scored above this cutoff. Of course, individuals in each group showed
a range of performance on the MOCA. We could use a more stringent cutoff of
≤23, shown as position “B” in Figure 7.4. At this cutoff, there would be fewer true
positives identified in the COI group (resulting in lower sensitivity), but more
true negatives in the RP group (increased specificity). This is not necessarily a
bad thing. In their volume on evidence-based medicine, Sackett and colleagues
(Sackett, Straus, Richardson, & Haynes, 2000) describe cutoffs with very high
specificity as a SpPin—a score with high specificity such that a positive result (a
Condition of Reference
Interest (COI) Population (RP)
B A C
Sensitivity = Specificity =
% True Positives True Positives True Negatives % True Negatives
False False
Positives Negatives
Performance
score below the cutoff) rules the COI “in.” We may want to have at cutoff score
that is a SpPin if we were relying heavily on the test for diagnostic purposes, since
we want to be confident in capturing as many individuals as possible with the
COI. Conversely, we could use a more liberal cutoff score such as ≤28, shown as
position “C” in Figure 7.4. At this cutoff, we would misclassify more of the RP
group (false positives), but very few of the COI group would be expected to score
above this cutoff. This cutoff results in a SnNout—a cutoff with high sensitivity
where a negative result (scores above the cutoff) rules the COI “out” (Sackett et
al., 2000). Cut-off scores that result in a SnNout may be useful for screening pur-
poses since most RP cases are ruled out, and those that remain will be expected
to go on for further evaluation.
It is clear from the discussion above that “tests” themselves do not have sen-
sitivity and specificity. Rather, it is the specific test scores that have sensitivity
and specificity, and these are specific to how the test operates within the con-
text of the specific samples examined. Plotting the sensitivity and specificity of
each score, we could generate a receiver operating curve (ROC) as well as the
likelihood of the COI with each score. Because the operating characteristics of
a test depend on the samples within which they are used, it is incumbent on
the evidence-based practitioner to critically evaluate the characteristics of the
patient groups to determine their representativeness and adequacy.
When we compare groups in terms of base-rate information, we can cast the
data into a simple 2 x 2 matrix as shown in Figure 7.5, typically with the true
positive cases with the COI placed in the upper left corner. Then, by combin-
ing the cells in different ways, we can calculate a number of useful indexes that
can inform our clinical decision making. Table 7.3 summarizes a number of
these base-rate indexes, and many online calculators are available to calculate
these and other indexes automatically (e.g., http://statpages.info/ctab2x2.html).
For the purpose of our discussion here, we will focus on six common diagnostic
indexes: prevalence, sensitivity, specificity, likelihood ratio, pre-test probability,
and post-test probability, and the online calculator located at http://araw.mede.
uic.edu/cgi-bin/testcalc.pl is particularly helpful, as it allows one to adjust the
expected prevalence of the COI to match local samples and produces a nomo-
gram, a visual illustration of how test results change the post-test likelihood of
detecting the COI, given the prevalence.
Condition of Interest
Yes No
probability or prevalence of the COI) times a modifier (based on the test results).
Test results are used to adjust a prior distribution to form a new posterior distri-
bution of scores (post-test probabilities). The more that a test result changes the
pre-test probability (what we knew before giving the test), the greater value or
contribution the test has in the diagnostic process, and potentially for contrib-
uting to treatment choices. This shift from pre-test to post-test probabilities is
captured by the positive Likelihood Ratio (LR+), which is the ratio of sensitivity
divided by the quantity (1 minus specificity), or the ratio of the percentage of
“true positives” to percentage of “false positives.” (The negative Likelihood Ratio
[LR–] represents the decrease in likelihood of the COI with a negative test result.)
Because sensitivity and specificity are properties of how a test performs in speci-
fied populations, they are independent of prevalence, and by extension, so is the
LR+. This is important because once research has established a LR+, it can be
used in settings where the prevalence of the COI may be different from the one
in which it was originally derived. Positive Likelihood Ratios of less than 1.0 are
not meaningful, 2–4 are small, 5–9 are considered moderate, and 10 or greater
are large and considered virtually diagnostically certain.
0.5 95 0.5 95
]
1 1000 90 1 1000 90
500 500
2 200 80 2 200 80
100 100
50 70 50 70
]
5 5
]
20 60 20 60
]
10 10
]
10 50 10 50
5 40 5 40
]
2 2 30
]
20 30 20
]
]
1 20 1
]
30 0.5 30 0.5 20
]
0.2 0.2
]
40 40
]
10 10
]
50 0.1 50 0.1
]
60 0.05 60 0.05
0.02 5 0.02 5
70 70
0.01 0.01
80 0.005 2 80 0.005 2
0.002 0.002
]
90 0.001 1 90 0.001 1
95 0.5 95 0.5
0.2 0.2
99 0.1 99 0.1
Prior Likelihood Posterior Prior Likelihood Posterior
Prob. ratio Prob. Prob. ratio Prob.
By drawing a line from the pre-test probability through the LR in the middle
column, we see our post-test probability of .86 indicated along the right column.
The results of a negative test (LR–) are shown in the bottom line.
The results of the above study (Filoteo et al., 2009) are based on an observed
prevalence rate PDD of 50%. But what if I work in a specialized memory-disorders
clinic where the majority of patients presenting with symptoms of Parkinsonism
and dementia are the result of DLB, and cases of PDD are relatively rare—about
15%? As an evidence-based practitioner, I know that Likelihood Ratios are inde-
pendent of prevalence. I therefore can use the LR+ finding from the Filoteo et
al. study (2009) and apply it to my setting where I estimate the prevalence of
PDD to be 15% compared to DLB. Again using the online calculator found at
http://araw.mede.uic.edu/cgi-bin/testcalc.pl, I can adjust the local prevalence to
be 0.15 and recalculate the pre-test and post-test probabilities, with the results
depicted in the panel on the right side of Figure 7.6. Sensitivity and specificity are
unchanged, as are the Likelihood ratios. However, because the pre-test probabil-
ity (prevalence) is only 15% in my setting, the post-test probabilities are different.
Given a positive test result in a setting where the prevalence of PDD is only 15%,
an additional 35% of PDD cases (a post-test probability of 51%) can now be iden-
tified. Alternatively, a negative test result has a LR–of 0.29 and a post-test prob-
ability of 0.05, allowing me to now accurately rule out all but 5% of PDD cases.
177
for a PET scan is open to discussion, but it certainly strengthens the confidence
of the neuropsychologist in making the recommendation for a PET scan. As in
the previous example, if FTD were to be diagnosed in a different sample with
different prevalence rates, post-test probability would need to be re-estimated
accordingly.
The application of clinical research derived from the study of populations to
guide individual clinical decision making is the essence of evidence-based prac-
tice. It basically amounts to answering the question, “If my clinical patient were
in this study, what is the likelihood that they would have the COI, given their
test score(s)?”
SUMMARY
Evidence-based practices in neuropsychology are not esoteric rituals performed
by the academic elite. They represent basic skill sets designed for routine clinical
practice with the goal of adding value to patient care and outcomes in an account-
able manner. Assessing our patients and deconstructing referral questions, ask-
ing answerable questions about our patients, acquiring relevant clinical research
information to answer these questions, critically appraising the information we
gather, and applying this information in an informed and thoughtful way are the
hallmarks of the evidence-based practitioner. These skills are not acquired over-
night, nor are they mastered as discrete events. Evidence-based practice is an
ongoing pattern of activity that becomes streamlined and improves with repeti-
tion and frequency. It is incumbent on us as clinicians to be good consumers of
clinical science and to learn how to best apply this information in a patient-cen-
tered way. Clinical neuropsychology has always prided itself on being an applied
science based on empirical research. Adopting evidence-based practices merely
helps us realize this goal in everyday practice.
REFERENCES
Akobeng, A. K. (2005). Principles of evidence based medicine. Archives of Disease in
Childhood, 90(8), 837–840. doi:10.1136/adc.2005.071761
APA Presidential Task Force on Evidence-Based Practice. (2006). Evidence-based
practice in psychology. American Psychologist, 61(4), 271–285.
APUS Librarians. (2015). What are Boolean operators, and how do I use them?
Retrieved June 27, 2016 from http://apus.libanswers.com/faq/2310.
Bossuyt, P. M., Reitsma, J. B., Bruns, D. E., Gatsonis, C. A., Glasziou, P. P., Irwig, L. M., …
de Vet, H. C. (2003). Towards complete and accurate reporting of studies of diagnos-
tic accuracy: The STARD Initiative. Annals of Internal Medicine, 138(1), 40–44.
Bowden, S. C., Harrison, E. J., & Loring, D. W. (2014). Evaluating research for clinical
significance: Using critically appraised topics to enhance evidence-based neuropsy-
chology. Clinical Neuropsychology, 28(4), 653–668.
Chelune, G. J. (2002). Making neuropsychological outcomes research consumer
friendly: A commentary on Keith et al. (2002). [Comment]. Neuropsychology, 16(3),
422–425.
179
Chelune, G. J. (2010). Evidence- based research and practice in clinical neuro-
psychology. [Review]. Clinical Neuropsychology, 24(3), 454– 467. doi:10.1080/
13854040802360574
Cicerone, K. D., Dahlberg, C., Kalmar, K., Langenbahn, D. M., Malec, J. F.,
Bergquist, T. F., … Morse, P. A. (2000). Evidence-based cognitive rehabilita-
tion: Recommendations for clinical practice. Archives of Physical Medicine and
Rehabilitation, 81(12), 1596–1615.
Cicerone, K. D., Dahlberg, C., Malec, J. F., Langenbahn, D. M., Felicetti, T., Kneipp,
S., … Catanese, J. (2005). Evidence- based cognitive rehabilitation: Updated
review of the literature from 1998 through 2002. Archives of Physical Medicine and
Rehabilitation, 86(8), 1681–1692.
Claridge, J. A., & Fabian, T. C. (2005). History and development of evidence-based
medicine. [Historical Article]. World Journal of Surgery, 29(5), 547–553. doi:10.1007/
s00268-005-7910-1
Cochrane, A. L. (1972). Effectiveness and Efficiency: Random Reflections on Health
Services. London, United Kingdom, Nuffield Provincial Hospitals Trust.
Costa, L. (1983). Clinical neuropsychology: A discipline in evolution. Journal of
Clinical Neuropsychology, 5(1), 1–11.
Dodrill, C. B. (1997). Myths of neuropsychology. The Clinical Neuropsychologist,
11(1), 1–17.
Encyclopedia of Mental Disorders. Managed Care. (2016). Available at www.minddisor-
ders.com/Kau-NU/managed-care.html.
Evidence-Based Medicine Working Group. (1992). Evidence-based medicine. A new
approach to teaching the practice of medicine. Journal of the American Medical
Association, 268(17), 2420–2425.
Filoteo, J. V., Salmon, D. P., Schiehser, D. M., Kane, A. E., Hamilton, J. M., Rilling,
L. M., … Galasko, D. R. (2009). Verbal learning and memory in patients with
dementia with Lewy bodies and Parkinson’s disease dementia. Journal of Clinical
and Experimental Neuropsychology, 31(7), 823–834.
Foster, N. L., Heidebrink, J. L., Clark, C. M., Jagust, W. J., Arnold, S. E., Barbas, N. R., …
Minoshima, S. (2007). FDG-PET improves accuracy in distinguishing frontotempo-
ral dementia and Alzheimer’s disease. Brain, 130(Pt 10), 2616–2635.
Gooch, T. (2013). Ludwig Andreas Feurebach. In E. N. Zalta (Ed.), The Stanford
Encyclopedia of Philosophy (Winter 2013 ed.). Retrieved June 27, 2016 from http://
plato.stanford.edu/archives/w in2013/entries/ludwig-feuerbach/.
Greenhalgh, T. (2006). How to Read a Paper: The Basics of Evidence-Based Medicine
(3rd ed.). Malden, MA: Blackwell Publishing.
Haynes, R. B., Devereaux, P. J., & Guyatt, G. H. (2002). Clinical expertise in the era of
evidence-based medicine and patient choice. Evidence-Based Medicine, 7, 36–38.
Heneghan, C., & Badenoch, D. (2006). Evidence-Based Medicine Toolbox (2nd ed.).
Maiden, MA: Blackwell Publishing.
Henry, J. D., Crawford, J. R., & Phillips, L. H. (2004). Verbal fluency performance
in dementia of the Alzheimer’s type: A meta-analysis. Neuropsychologia, 42(9),
1212–1222.
Hermann, B. P., Perrine, K., Chelune, G. J., Barr, W., Loring, D. W., Strauss, E., …
Westerveld, M. (1999). Visual confrontation naming following left anterior tempo-
ral lobectomy: A comparison of surgical approaches. Neuropsychology, 13(1), 3–9.
Hill, G. B. (2000). Archie Cochrane and his legacy. An internal challenge to physicians’
autonomy? Journal of Clinical Epidemiology, 53(12), 1189–1192.
180
Sackett, D. L., Straus, S. E., Richardson, W., & Haynes, R. B. (2000). Evidence-
Based Medicine: How to Practice and Teach EBM (2nd ed.). New York: Churchill
Livingston.
Sackett, D. L., Straus, S. E., Richardson, W. S., Rosenberg, W., & Haynes, R. B.
(2000). Evidence-Based Medicine: How to Practice and Teach EBM (2nd ed.).
New York: Churchill Livingston.
Schoenberg, M. R., & Scott, J. G. (2011). The Little Black Book of Neuropsychology: A
Syndrome-Based Approach. New York: Springer.
Schulz, K. F., Altman, D. G., & Moher, D. (2010). CONSORT 2010 statement: Updated
guidelines for reporting parallel group randomized trials. Annals of Internal
Medicine, 152(11), 726–732.
Segen, J. C. (2006). McGraw- Hill Concise Dictionary of Modern Medicine.
New York: McGraw-Hill Companies.
Smith, G. E. (2002). What is the outcome we seek? A commentary on Keith et al. (2002).
[Comment]. Neuropsychology, 16(3), 432–433.
Spector, A., Orrell, M., & Hall, L. (2012). Systematic review of neuropsychological out-
comes in dementia from cognition-based psychological interventions. Dementia
and Geriatric Cognitive Disorders, 34(3–4), 244–255.
SPRINT Research Group. (2015). A randomized trial of intensive versus standard
blood-pressure control. New England Journal of Medicine, 373(22), 2103–2116.
Suchy, Y., Kraybill, M. L., & Franchow, E. (2011). Practice effect and beyond: Reaction
to novelty as an independent predictor of cognitive decline among older adults.
Journal of the International Neuropsychology Society, 17(1), 101–111. doi:10.1017/
S135561771000130X
von Elm, E., Altman, D. G., Egger, M., Pocock, S. J., Gotzsche, P. C., & Vandenbroucke,
J. P. (2007). The Strengthening the Reporting of Observational Studies in
Epidemiology (STROBE) statement: Guidelines for reporting observational stud-
ies. Annals of Internal Medicine, 147(8), 573–577.
182
183
E R I N D. B I G L E R
neuroimaging did not appear until 1980s (Bigler, Yeo, & Turkheimer, 1989;
Kertesz, 1983). What this has meant for neuropsychology is that the assess-
ment process mostly developed independently of any type of neuroimaging
metric that potentially could be incorporated into neuropsychological diag-
nostic formulation.
In contrast, since neuropsychological assessment is all about making infer-
ences about brain function, and given the exquisite anatomical and pathologi-
cal detail that can now be achieved with contemporary neuroimaging, clinical
neuropsychology should be using neuroimaging findings. Additionally, given
the currently available technology, it would seem that neuroimaging informa-
tion should be routinely incorporated into the neuropsychological metric in
diagnostic formulation, treatment programming, monitoring, and predicting
outcome. But where to start? This book is on evidence-based practices, so this
chapter will review well-accepted clinical neuroimaging methods that have
relevance to neuropsychological outcome. For example, in various diseases,
especially neurodegenerative, as well as acquired disorders like traumatic brain
injury (TBI), the amount of cerebral atrophy coarsely relates to cognitive func-
tioning. Therefore, this review begins with clinical rating of cerebral atrophy,
but we then will move into specific region of interest (ROI) atrophic changes or
pathological markers.
Neuroimaging studies are routinely performed on most individuals with
a neurocognitive or neurobehavioral complaint or symptom, at least in the
initial stages of making a diagnosis. Incredibly sophisticated image-a nalysis
methods are available, but all such techniques require sophisticated image-
analysis hardware and software, as well as the expertise to analyze and inter-
pret. For the typical front-l ine clinician, these resources may not be available,
nor may they be available through the typical clinical radiology laboratory
performing the neuroimaging studies. Thus, another compelling reason for
neuropsychologists to use clinical rating schema for assessing neuroimag-
ing studies is the ease with which these rating methods can be performed.
Although the future is likely to involve routine advanced image-a nalysis
methods that are fully automated with comparisons to normative databases
(see Toga, 2015) performed at the time of initial imaging and integrated with
neuropsychological test findings (see also Bilder, 2011, as to how this might be
implemented), but this may still be a decade or two away from being clinically
implemented.
The discussion that follows assumes that the reader has some fundamental
understanding of basic neuroanatomy and neuroimaging, which can be obtained
from multiple sources, including Schoenberg et al. (2011) and Filley and Bigler
(2017). It is beyond the scope of this chapter to provide elaborate details of the
various clinical rating methods that are available on the basics of CT or MR neu-
roimaging. It is recommended that the reader go to the original cited reference
source for each clinical rating discussed in this chapter, should a particular type
of rating be considered.
185
ETHICAL CAVEAT
While the Houston Guidelines (see http://w ww.theaacn.org/position_papers/
houston_conference.pdf, or Hannay, 1998) for training in neuropsychology
under Section VI, Knowledge Base, explicitly state that neuropsychologists are
to possess training and knowledge in “functional neuroanatomy” and “neuro-
imaging and other neurodiagnostic techniques,” such a background may not be
sufficient for performing clinical ratings based on CT or MRI. We recommend
that, before embarking on performing clinical ratings as discussed in this chap-
ter, the individual clinician consult with a neuroradiologist who could provide
such training and initial supervision to result in the highest accuracy of rating
possible. Also, all clinical imaging will come with a radiological interpretative
report. The purposes of the neuroimaging evaluation are to provide an over-
all statement about brain anatomy and potential neuropathology, along with
whether any emergency medical circumstance may be present (e.g., neoplasm,
aneurysm, vascular abnormality, etc.). Often the neurobehavioral questions to
be addressed by the neuropsychologist are not necessarily the topic for the initial
referral to obtain a brain scan. So the post-hoc clinical rating applied to the scan
becomes important for neuropsychology when specific symptoms or problems
can be explored in light of the neuroimaging findings.
CLINICAL R ATINGS
Brain Atrophy
Normal aging is associated with some brain volume loss, often best visualized as
increased size of the lateral ventricles and greater prominence of cerebral spinal
fluid (CSF) within the cortical subarachnoid space as shown in Figure 8.1, which
compares the author’s axial mid-lateral ventricle MRI cut at age 66 with that
of his lab director at age 44, to that of an adolescent brain. Note the enlarge-
ment of the ventricular system in the author’s brain compared to the younger
brains. This increase in ventricular size occurs to compensate for the volume loss
of brain parenchyma that occurs through the normal aging process, associated
with neuronal loss. Concomitant with age-related brain volume loss is increased
prominence of cortical sulci and the interhemispheric fissure (see white arrow-
head in Figure 8.1 that points to visibly increased sulcal width in the author’s
older brain compared to the younger brains). Although this is the natural aging
process, it does, in fact, produce cerebral atrophy. So, for cerebral atrophy to be
clinically significant, it must exceed what is expected to be “normal” variation in
brain volume associated with typical aging.
Wattjes and colleagues (2009) provide some important comparisons and
guidelines for interpreting either CT or MRI in a memory clinic. This is shown
in Figure 8.2. The important comparison here with CT imaging to MRI is that
CT scanning is much faster, less expensive, and does not have restrictions like
MRI in the case of heart pacemakers, other implanted medical devices, dental or
facial bone implants, or any metal that may produce artifacts. While CT does not
provide the exquisite anatomical detail achieved by MRI, it nonetheless has the
ability to display ventricular size and white matter density along with sulcal and
interhemispheric width. Knowing the anatomical relationship of certain cortical
and subcortical structures to ventricular anatomy does permit inferences about
Figure 8.1 Age-related normal changes in ventricular size. The image on the right
is the author’s at age 66. Arrow points to ventricle. Arrowhead on the side points to
more prominence of the cortical sulci, with the bottom arrowhead pointing to greater
prominence of the interhemispheric fissure in the older brain.
187
MRI
CT
Figure 8.2 Comparison of coronal MRI (top) and CT (bottom) in the same subject,
with the cutting plane perpendicular to the hippocampal formation. CT does not
provide the kind of anatomical detail that MRI does, but note that the ventricular
system morphology can be identified with either CT or MRI. White arrow in the MRI
points to the reduced hippocampal volume and dilated temporal horn of the lateral
ventricular system. The white arrow in the CT image points to the same region as the
MRI. The distinct appearance of the hippocampus cannot be visualized in CT, but
it may be inferred by the dilated temporal horn. Also, the prominence of the Sylvian
fissure (white arrowhead points to the Sylvian fissure) and the temporal lobe gyri
signifies temporal lobe atrophy.
Illustration used with permission from Wattjes et al. (2009) and the Radiological Society of North
America, p. 176.
atrophic findings, even from CT. For example, in the illustration in Figure 8.2,
the MRI clearly shows hippocampal atrophy in association with temporal horn
dilation. The corresponding CT where the image plane is identical to MRI only
shows the temporal horn dilation, as hippocampal anatomy is ill-defined with
CT. Nonetheless, given CT-defined temporal horn dilation, an inference may
be made that hippocampal size has been reduced. Both CT and MRI show the
prominence of the Sylvian fissure (white arrowhead). Accordingly, qualitative
analyses on both CT and MRI brain scanning may be performed.
188
From Wattjes and colleagues (2009), Figure 8.3 depicts the same transverse
axial views of the same subjects comparing CT and MRI but in three different
subjects with various levels of cerebral atrophy, from minimal to extensive. Note
that the defining feature, whether CT or MRI, is prominence of the interhemi-
spheric fissure and cortical sulci.
Some of the earliest and most widely used clinical rating methods were devel-
oped by Scheltens and colleagues (see Pasquier et al., 1996; Pasquier et al., 1997;
Scheltens et al., 1993; Scheltens, Barkhof, et al., 1992; Scheltens, Launer, Barkhof,
Weinstein, & van Gool, 1995; Scheltens, Leys, et al., 1992; Scheltens, Pasquier,
Weerts, Barkhof, & Leys, 1997). In the Pasquier et al. (1996) study, four neurora-
diologists rated the degree of cerebral atrophy on a 0–3-point scale. As with any
A B C
D E F
Figure 8.3 Top images are MRI with each subject’s CT below cut in the same plane as the
MRI. Note that both CT and MRI adequately and similarly depict cortical sulci, which
increase in prominence as a reflection of cortical atrophy. These images are taken from
Wattjes et al. (2009), where global cerebral atrophy or GCA was rated on a 4-point scale,
where 0 was no atrophy, and 3 the highest rating. For these images, the control (a and c)
was rated “0,” with the subject in b and e rated a “1,” and the subject in c and f rated a “2.”
Illustration used with permission from Wattjes et al. (2009) and the Radiological Society of North
America, p. 179.
189
rating system, inter-observer agreement (mean overall kappa = 0.48) was lower
than intra-observer (mean overall kappa = .67) when comparing scan findings
from the same neuroimaging studies interpreted one month apart. Nonetheless,
these kinds of coarse ratings do provide general information about structural brain
integrity with reasonably good agreement for clinical purposes between raters.
Figure 8.4, also from the Wattjes et al. (2009) publication on diagnostic imag-
ing of patients in a memory clinic, shows images in the coronal plane from age-
related “normal” appearance (A and E) to various levels of pathological display.
Images on top are MRI, compared to the same plane coronal CT view on the
bottom. Note that in the subject depicted in A and E, there is minimal corti-
cal CSF visualized because the sulci are so narrow, such that most sulci cannot
even be visualized on the CT image. Also note the small size of the lateral and
third ventricle. Furthermore, in A and E, note the homogenous, symmetrical,
and rather uniform appearance of brain parenchyma, characteristic of normal
parenchyma. In contrast, beginning with the case presented in B and F, ventricu-
lar size begins to increase, associated with greater and greater prominence of cor-
tical sulci. Also note in D and H that CT density and the MR signal around the
ventricle also change, suggesting degradation in white matter integrity. In age-
related disorders, this is typically considered to relate to vascular changes, where
the MRI sequence referred to as fluid attenuated inversion recovery (FLAIR) is
particularly sensitive to white matter abnormalities that may be incident to age
and disease, as shown in Figure 8.5. On the FLAIR image sequence (as well as
T2 weighted sequences), a white matter hyperintensity (WMH) shows up, as its
A B C D
E F G H
Fazekas 0
No WMHs
Fazekas 1
Focal or punctate lesions:
single lesions ≤9 mm
Grouped lesions <20 mm
Fazekas 2
Beginning confluent lesions:
Single lesions 10–20 mm
Grouped lesions >20 mm in
any diameter. No more than
connecting bridges between
individual lesions
Fazekas 3
Confluent lesion:
Single lesions or confluent
areas of hyperintensity
≥20 mm in any diameter
Figure 8.5 Fazekas ratings (see Fazekas et al., 1987) for classifying white matter
hyperintensities. (WMHs).
Illustration adapted and used with permission from Prins and Scheltens (Prins & Scheltens, 2015) and with
permission from Lancet Neurology, p. 160.
name implies, as a bright white region, distinctly different from the uniform
signal intensity from surrounding white matter parenchyma. More will be
discussed about white matter ratings in the next section and how changes are
detected with CT and MRI.
Most rating scales for temporal lobe atrophy and hippocampal volume loss
are based on images obtained in the coronal plane, as shown in Figure 8.4. Any
number of studies have shown that, with the increasing presence of pathologi-
cally definable changes in brain structure adjusted for age, such as whole brain,
temporal lobe, or hippocampal atrophy findings, relate to neuropsychological
impairment (Jang et al., 2015; van de Pol et al., 2007), and are especially associ-
ated with memory impairment (DeCarli et al., 2007).
More recently, two more comprehensive rating methods have been proposed,
both of which utilize aspects of what has been discussed above. Guo et al. (2014)
introduced an elaborate rating method referred to as the Brain Atrophy and
Lesion Index (BALI), and Jang et al. (2015) introduced the Comprehensive Visual
Rating Scale (CVRS). What is important for neuropsychology about the BALI
study is that a variety of raters were trained including non-physician Ph.D.s,
with all raters capable of achieving a high degree of accuracy. Other studies have
shown that the BALI approach was sensitive in discrimination of Alzheimer’s
disease from mild cognitive impairment and normal aging (Chen et al., 2010;
Song, Mitnitski, Zhang, Chen, & Rockwood, 2013). The CVRS from Jang et al.
(2015) utilizes prior rating methods as introduced by Scheltens and colleagues
as previously described, but it incorporates atrophy and WMH ratings into a
single scale based on either axial or coronal images, as shown in Figure 8.6. What
191
Figure 8.6 The scoring table of the Comprehensive Visual Rating Scale (CVRS). T1WI, T1-weighted images. The white rectangles are the brain regions
that need to be focused on.
FLAIR, fluid-attenuated inversion recovery; WMH, white matter hyperintensity; D, deep; P, periventricular; MB, microbleeds.
From Jang et al. (2015), used with permission from the Journal of Alzheimer’s Disease and IOS Press. For details on using this, please refer to the original publication as the details are
lengthy, elaborate, and beyond the scope of discussion within this chapter, p. 1025.
192
50
Scheltens Ratings WM Total
40
30
20
10
0
Scheltens Rating WM Total = 43.718 –.2948 Score 3MS
Correlation r = –3321
–10
10 20 30 40 50 60 70 80 90 100 110
Score 3MS
0.95 Conf. Int
Figure 8.8 These scatter plots show that the WMH Scheltens et al. ratings (see
Figure 8.7) inversely relate to cognitive status, based on a modified Mini-Mental State
Exam, using a transformed score ranging from 0–100 (see Tsui et al., 2014). In this same
study, increased WMHs were associated with generalized atrophy and volume loss,
but WMH ratings had a more robust correlation associated with impaired cognitive
functioning and associated with worse WMH rating.
Image adapted from Tsui et al. (2014), p. 154.
Dotted lines represent 0.95 confidence interval, with overall white matter Scheltens Rating Score positively
correlated with modified Mini-Mental Status (3MS) test performance (r = .32, p = 0.001).
A B C
D E F
Figure 8.9 These axial images on the top row are from a FLAIR sequence and depict
various levels of white matter hyperintense (WMH) foci compared to their CT
counterpart below. As in Figures 8.3 and 8.4, the CT below is cut at the same level. Note
the superiority of the FLAIR sequence over CT imaging for detecting white matter
abnormalities. Since CT is based on an X-ray beam-density function, a darker image
within what would be expected to be white matter parenchyma reflects less density and
therefore less healthy white matter. Arrows in A and D point to the deep white matter
of the left cerebral hemisphere just above the lateral ventricle. In the MRI, the arrow
points to a punctate region of hyperintense signal; note the darker corresponding image
on CT (arrow). Using the Fazekas et al. four-point scale (see Figure 8.5) where 0 reflects
no identifiable WMHs, and 3, the highest rating, the image on the left was rated by
Wattjes et al. (2009) as a “1.” In B and E, the WMHs are beginning to become confluent
(arrowhead in B and E), garnering a “2” rating, with the images to the right (C and F)
showing confluent WMHs and a “3” rating.
Illustration used with permission from Wattjes et al. (2009) and the Radiological Society of North
America. p.180.
196
Corpus Callosum
The corpus callosum (CC) is the largest of the brain commissures, easy to identify
in the mid-sagittal MRI plane, with equally easy to identify normal appearance
and landmarks (Georgy, Hesselink, & Jernigan, 1993; see Figure 8.10). Integrity
of the CC represents a window into global white matter integrity where the CC
is particularly vulnerable to change in morphology and MR signal intensity in
disorders like MS and TBI (Bodini et al., 2013; Mesaros et al., 2009). There is a
typical thinning of the CC with aging as well. Despite the fact that the CC thins
with the aging process, excessive thinning of the CC occurs with degenerative
disorders like Alzheimer’s disease, probably reflective of not just white matter
pathology but neuronal loss and secondary axonal degeneration, hence the loss
of white matter (see Zhu, Gao, Wang, Shi, & Lin, 2012; Zhu et al., 2014). Meguro
et al. (2003) showed that CC atrophy based on visual ratings in a mixed typical-
aging and dementia group, including those with probable Alzheimer’s disease,
was associated with greater levels of executive deficits on neuropsychological
testing.
There is a higher frequency of abnormal corpus callosum morphology in
developmental disorders as well as in very low birthweight infants assessed later
in life (Skranes et al., 2007), where the rating system by Spencer et al. (2005)
as shown in Figure 8.10 was relatively effective in relating to cognitive impair-
ment. Specifically, increased qualitative scores reflecting presence of an identifi-
able abnormality were associated with lower scores on intellectual assessment
(Skranes et al., 2007). There may also be small size of the CC in schizophrenia,
both qualitatively and quantitatively assessed (see Meisenzahl et al., 1999; Tibbo,
Nopoulos, Arndt, & Andreasen, 1998) and associated with more negative symp-
toms. Anomalies in CC morphology have been associated with other neuropsy-
chiatric disorders as well (Georgy et al., 1993). Additionally, in neuropsychiatric
Figure 8.10 Corpus callosum (CC) clinical ratings according to Spencer et al. (2005)
in cases of developmental disorder. See also Figure 8.11,which shows three cases with
differing levels of CC thinning and how the Spencer et al. ratings could be performed.
Used with permission from the American Association of Neuroradiology and Williams and Wilkens,
p. 2694.
197
Figure 8.11 Midsagittal views of the corpus callosum depict the normal appearance of
the corpus callosum in the upper left and three levels of CC atrophy, using the Spencer
et al. ratings from Figure 8.10, in three cases of CC atrophy associated with traumatic
brain injury (TBI). Reduced CC size often is associated with reduced processing speed
on neuropsychological measures.
Figure 8.12 Third (III) ventricle ratings as observed in TBI. Note the illustration in
the lower right corner that depicts where III ventricle width may be determined. The
other measures involve linear distance and the width of the anterior horns of the
lateral ventricle [from Reider-Groswasser et al. (1997) and the American Association
for Neuroradiology, p. 149]. Note how thin the midline appearance is for the control
subject (upper left). As TBI severity increases, accompanied by generalized cerebral
atrophy and reduced brain volume, III ventricle width increases.
8.12, may be used to identify III ventricular dilatation and relate that to cogni-
tive and behavioral outcome from acquired brain injury (see also Groswasser
et al., 2002; Reider-Groswasser et al., 1997; Reider et al., 2002). Because of its
midline location, pathology that results in parenchymal volume anywhere in
the brain will typically result in III ventricle expansion.
Clinically, III ventricle width has also been examined in schizophrenia and
age-mediated cognitive disorders (Cullberg & Nyback, 1992; Erkinjuntti et al.,
1993; Johansson et al., 2012; Sandyk, 1993; Slansky et al., 1995; Soininen et al.,
1993), even using coarse measures generated by ultrasonography (Wollenweber
et al., 2011). Regardless of the imaging technique used, expansion of the III ven-
tricle is a marker of brain pathology where greater expansion typically relates to
worse neuropsychological outcome.
199
Hemosiderin Deposition Rating
Regardless of etiology, microhemorrhages, also referred to as “microbleeds,”
reflect vascular pathology of some sort that, as it increases, potentially relates to
neuropsychological outcome. Figure 8.13 shows susceptibility weighted imag-
ing (SWI)–identified regions of hemosiderin deposition. Hemosiderin is a blood
byproduct from hemorrhage that has ferritin (iron), and is therefore sensitive
to MR signal detection techniques. Ischemic vascular disease is associated with
not only microhemorrhages but WMHs. Clinically, microbleeds are typically
just counted, where in vascular disease microbleeds may be positively related to
WMHs (Charidimou, Jager, & Werring, 2012; Goos et al., 2010; Noh et al., 2014),
but this may not be the case in TBI (Riedy et al., 2015). In TBI, an increasing
number of microbleeds has been related to worse neuropsychological outcome
(Scheid & von Cramon, 2010), including pediatric TBI (Beauchamp et al., 2013;
Bigler et al., 2013) and tends to occur more frequently within the frontal and
temporal regions of the brain, as shown in Figure 8.14.
than 65 (Finney, Minagar, & Heilman, 2016; Schmand, Eikelenboom, van Gool, &
Alzheimer’s Disease Neuroimaging, 2011; Sitek, Barczak, & Harciarek, 2015).
Another area in neuropsychology where neuroimaging has been particu-
larly influential is in the field of epilepsy assessment. Prior to neuroimaging,
neuropsychological assessment techniques were influential in the diagnostic
decision process of determining lateralization, localization, and lesion detec-
tion, as reviewed by Loring (1991). Neuropsychological assessment remains an
important part of the overall assessment of the patient with epilepsy (Hoppe &
Helmstaedter, 2010), including pre- surgical assessment evaluations (Jones-
Gotman et al., 2010; Loring, 2010). Just as with the neurodegenerative disorders,
the neuropsychological examination in the patient with epilepsy remains the
standard for assessing neurocognitive and neuroemotional functioning, but its
role in lateralization of cognitive functioning and specification as to localization
of function has changed with improved neuroimaging and electrophysiologi-
cal methods (Baxendale & Thompson, 2010). With advances in functional MRI
(fMRI), techniques are being standardized as neurocognitive probes to assess
language, memory, and other cognitive functions in the patient with epilepsy,
which have altered and will continue to influence the traditional role of neuro-
psychology in assessing that disorder (Sidhu et al., 2015).
CONCLUSIONS
We are now well into the twenty-first century, with digitally sophisticated neu-
roimaging routinely being performed in most neurological and neuropsychiat-
ric disorders. It is time for clinical neuropsychological assessment to begin to
routinely utilize information from neuroimaging findings as part of the evalua-
tion metric. This could begin with utilizing clinical rating methods, as outlined
in this chapter. The incorporation of neuroimaging needs to be done with the
highest degree of normative standards that have benefitted clinical neuropsy-
chology as a discipline because, as Weinberger and Radulescu (2016) state, it is
a disservice to all aspects of clinical neuroscience to be uncritically accepting
of neuroimaging findings purportedly related as meaningful neurobehavioral
correlates.
REFERENCES
Baxendale, S., & Thompson, P. (2010). Beyond localization: The role of traditional neu-
ropsychological tests in an age of imaging. Epilepsia, 51(11), 2225–2230. doi:10.1111/
j.1528-1167.2010.02710.x
Beauchamp, M. H., Beare, R., Ditchfield, M., Coleman, L., Babl, F. E., Kean, M., …
Anderson, V. (2013). Susceptibility weighted imaging and its relationship to out-
come after pediatric traumatic brain injury. Cortex; A Journal Devoted to the Study
of the Nervous System and Behavior, 49(2), 591–598. doi:10.1016/j.cortex.2012.08.015
203
Bigler, E. D. (2015). Structural image analysis of the brain in neuropsychology using
magnetic resonance imaging (MRI) techniques. Neuropsychology Review, 25(3),
224–249. doi:10.1007/s11065-015-9290-0
Bigler, E. D., Abildskov, T. J., Petrie, J., Farrer, T. J., Dennis, M., Simic, N., … Owen
Yeates, K. (2013). Heterogeneity of brain lesions in pediatric traumatic brain injury.
Neuropsychology, 27(4), 438–451. doi:10.1037/a0032837
Bigler, E. D., Kerr, B., Victoroff, J., Tate, D. F., & Breitner, J. C. (2002). White mat-
ter lesions, quantitative magnetic resonance imaging, and dementia. Alzheimer
Disease and Associated Disorders, 16(3), 161–170.
Bigler, E. D., & Maxwell, W. L. (2011). Neuroimaging and neuropathology of TBI.
NeuroRehabilitation, 28(2), 63–74. doi:10.3233/NRE-2011-0633
Bigler, E. D., Yeo, R. A., & Turkheimer, E. (1989). Neuropsychological Function and
Brain Imaging. New York: Plenum Press.
Bilder, R. M. (2011). Neuropsychology 3.0: Evidence-based science and practice. Journal
of the International Neuropsychological Society: JINS, 17(1), 7– 13. doi:10.1017/
S1355617710001396
Bodini, B., Cercignani, M., Khaleeli, Z., Miller, D. H., Ron, M., Penny, S., … Ciccarelli,
O. (2013). Corpus callosum damage predicts disability progression and cognitive
dysfunction in primary-progressive MS after five years. Human Brain Mapping,
34(5), 1163–1172. doi:10.1002/hbm.21499
Boone, K. B., Miller, B. L., Lesser, I. M., Mehringer, C. M., Hill-Gutierrez, E., Goldberg,
M. A., & Berman, N. G. (1992). Neuropsychological correlates of white-matter
lesions in healthy elderly subjects. A threshold effect. Archives of Neurology, 49(5),
549–554.
Chanraud, S., Zahr, N., Sullivan, E. V., & Pfefferbaum, A. (2010). MR diffusion
tensor imaging: A window into white matter integrity of the working brain.
Neuropsychology Review, 20(2), 209–225. doi:10.1007/s11065-010-9129-7
Charidimou, A., Jager, H. R., & Werring, D. J. (2012). Cerebral microbleed detection
and mapping: Principles, methodological aspects and rationale in vascular demen-
tia. Experimental Gerontology, 47(11), 843–852. doi:10.1016/j.exger.2012.06.008
Chen, B., Xu, T., Zhou, C., Wang, L., Yang, N., Wang, Z., … Weng, X. C. (2015).
Individual variability and test-retest reliability revealed by ten repeated resting-
state brain scans over one month. PLoS One, 10(12), e0144963. doi:10.1371/journal.
pone.0144963
Chen, W., Song, X., Zhang, Y., Darvesh, S., Zhang, N., D’Arcy, R. C., … Rockwood, K.
(2010). An MRI-based semiquantitative index for the evaluation of brain atrophy
and lesions in Alzheimer’s disease, mild cognitive impairment and normal aging.
Dementia and Geriatric Cognitive Disorders, 30(2), 121–130. doi:10.1159/000319537
Cullberg, J., & Nyback, H. (1992). Persistent auditory hallucinations correlate with the
size of the third ventricle in schizophrenic patients. Acta Psychiatrica Scandinavica,
86(6), 469–472.
DeCarli, C., Frisoni, G. B., Clark, C. M., Harvey, D., Grundman, M., Petersen, R. C., …
Scheltens, P. (2007). Qualitative estimates of medial temporal atrophy as a predictor
of progression from mild cognitive impairment to dementia. Archives of Neurology,
64(1), 108–115. doi:10.1001/archneur.64.1.108
Dennis, M., Spiegler, B. J., Simic, N., Sinopoli, K. J., Wilkinson, A., Yeates, K. O., …
Fletcher, J. M. (2014). Functional plasticity in childhood brain disorders: When,
what, how, and whom to assess. Neuropsychology Review, 24(4), 389–408.
doi:10.1007/s11065-014-9261-x
204
Erkinjuntti, T., Lee, D. H., Gao, F., Steenhuis, R., Eliasziw, M., Fry, R., … Hachinski,
V. C. (1993). Temporal lobe atrophy on magnetic resonance imaging in the diagno-
sis of early Alzheimer’s disease. Archives of Neurology, 50(3), 305–310.
Fazekas, F., Chawluk, J. B., Alavi, A., Hurtig, H. I., & Zimmerman, R. A. (1987).
MR signal abnormalities at 1.5 T in Alzheimer’s dementia and normal aging.
AJR: American Journal of Roentgenology, 149(2), 351–356. doi:10.2214/ajr.149.2.351
Filley, C. M., & Bigler, E. D. (Eds.). (2017). Neuroanatomy for the Neuropsychologist.
In Morgan, J. E. & Ricker, J. H. (eds). Textbook of Clinical Neuropsychology. (2nd
Edition). New York: Taylor & Francis.
Finney, G. R., Minagar, A., & Heilman, K. M. (2016). Assessment of mental status.
Neurologic Clinics, 34(1), 1–16. doi:10.1016/j.ncl.2015.08.001
Gabrieli, J. D., Ghosh, S. S., & Whitfield-Gabrieli, S. (2015). Prediction as a humanitar-
ian and pragmatic contribution from human cognitive neuroscience. Neuron, 85(1),
11–26. doi:10.1016/j.neuron.2014.10.047
Gale, S. D., Johnson, S. C., Bigler, E. D., & Blatter, D. D. (1995). Nonspecific white
matter degeneration following traumatic brain injury. Journal of the International
Neuropsychological Society: JINS, 1(1), 17–28.
Georgy, B. A., Hesselink, J. R., & Jernigan, T. L. (1993). MR imaging of the corpus
callosum. AJR: American Journal of Roentgenology, 160(5), 949–955. doi:10.2214/
ajr.160.5.8470609
Goos, J. D., Henneman, W. J., Sluimer, J. D., Vrenken, H., Sluimer, I. C., Barkhof, F., …
van der Flier, W. M. (2010). Incidence of cerebral microbleeds: A longitudinal
study in a memory clinic population. Neurology, 74(24), 1954–1960. doi:10.1212/
WNL.0b013e3181e396ea
Groswasser, Z., Reider, G., II, Schwab, K., Ommaya, A. K., Pridgen, A., Brown,
H. R., … Salazar, A. M. (2002). Quantitative imaging in late TBI. Part II: Cognition
and work after closed and penetrating head injury: A report of the Vietnam head
injury study. Brain Injury, 16(8), 681–690. doi:10.1080/02699050110119835
Guo, H., Song, X., Schmidt, M. H., Vandorpe, R., Yang, Z., LeBlanc, E., … Rockwood,
K. (2014). Evaluation of whole brain health in aging and Alzheimer’s disease: A stan-
dard procedure for scoring an MRI-based brain atrophy and lesion index. Journal of
Alzheimer’s Disease: JAD, 42(2), 691–703. doi:10.3233/JAD-140333
Hannay, H. J., Bieliauskas, L. A., Crosson, B., Hammeke, T., Hamsher, K. D., &
Koffler, S. (1998). Proceedings of the Houston Conference on Specialty Education
and Training in Clinical Neuropsychology. Archives of Clinical Neuropsychology,
13(Special Issue).
Heo, J. H., Lee, S. T., Kon, C., Park, H. J., Shim, J. Y., & Kim, M. (2009). White mat-
ter hyperintensities and cognitive dysfunction in Alzheimer disease. Journal of
Geriatric Psychiatry and Neurology, 22(3), 207–212. doi:10.1177/0891988709335800
Hoppe, C., & Helmstaedter, C. (2010). Sensitive and specific neuropsychological assess-
ments of the behavioral effects of epilepsy and its treatment are essential. Epilepsia,
51(11), 2365–2366.
Huettel, S. A., Song, A. W., & McCarthy, G. (2014). Functional Magnetic Resonance
Imaging (3rd ed.). Sunderland, MA: Sinauer Associates.
Jang, J. W., Park, S. Y., Park, Y. H., Baek, M. J., Lim, J. S., Youn, Y. C., & Kim,
S. (2015). A comprehensive visual rating scale of brain magnetic resonance imag-
ing: Application in elderly subjects with Alzheimer’s disease, mild cognitive impair-
ment, and normal cognition. Journal of Alzheimer’s Disease: JAD, 44(3), 1023–1034.
doi:10.3233/JAD-142088
205
Johansson, L., Skoog, I., Gustafson, D. R., Olesen, P. J., Waern, M., Bengtsson, C., …
Guo, X. (2012). Midlife psychological distress associated with late-life brain atro-
phy and white matter lesions: A 32-year population study of women. Psychosomatic
Medicine, 74(2), 120–125. doi:10.1097/PSY.0b013e318246eb10
Jones- Gotman, M., Smith, M. L., Risse, G. L., Westerveld, M., Swanson, S. J.,
Giovagnoli, A. R., … Piazzini, A. (2010). The contribution of neuropsychology to
diagnostic assessment in epilepsy. Epilepsy and Behavior, 18(1–2), 3–12. doi:10.1016/
j.yebeh.2010.02.019
Kennedy, D. P., & Adolphs, R. (2012). The social brain in psychiatric and neuro-
logical disorders. Trends in Cognitive Sciences, 16(11), 559– 572. doi:10.1016/
j.tics.2012.09.006
Kertesz, A. (1983). Localization in Neuropsychology. San Diego: Academic Press.
Koedam, E. L., Lehmann, M., van der Flier, W. M., Scheltens, P., Pijnenburg, Y. A.,
Fox, N., … Wattjes, M. P. (2011). Visual assessment of posterior atrophy develop-
ment of a MRI rating scale. European Radiology, 21(12), 2618–2625. doi:10.1007/
s00330-011-2205-4
Kozel, F. A., & Trivedi, M. H. (2008). Developing a neuropsychiatric functional brain
imaging test. Neurocase, 14(1), 54–58. doi:10.1080/13554790701881731
Lawrie, S. M., Abukmeil, S. S., Chiswick, A., Egan, V., Santosh, C. G., & Best, J. J.
(1997). Qualitative cerebral morphology in schizophrenia: A magnetic resonance
imaging study and systematic literature review. Schizophrenia Research, 25(2), 155–
166. doi:10.1016/S0920-9964(97)00019-4
Leaper, S. A., Murray, A. D., Lemmon, H. A., Staff, R. T., Deary, I. J., Crawford, J. R.,
& Whalley, L. J. (2001). Neuropsychologic correlates of brain white matter lesions
depicted on MR images: 1921 Aberdeen Birth Cohort. Radiology, 221(1), 51–55.
doi:10.1148/radiol.2211010086
Lesser, I. M., Boone, K. B., Mehringer, C. M., Wohl, M. A., Miller, B. L., & Berman,
N. G. (1996). Cognition and white matter hyperintensities in older depressed
patients. The American Journal of Psychiatry, 153(10), 1280–1287.
Lezak, M. D. (1976). Neuropsychological Assessment. New York: Oxford University Press.
Lindeboom, J., & Weinstein, H. (2004). Neuropsychology of cognitive ageing,
minimal cognitive impairment, Alzheimer’s disease, and vascular cognitive
impairment. European Journal of Pharmacology, 490(1–3), 83–86. doi:10.1016/
j.ejphar.2004.02.046
Loring, D. W. (1991). A counterpoint to Reitan’s note on the history of clinical neuro-
psychology. Archives of Clinical Neuropsychology, 6(3), 167–171.
Loring, D. W. (2010). History of neuropsychology through epilepsy eyes. Archives of
Clinical Neuropsychology, 25(4), 259–273. doi:10.1093/arclin/acq024
Meguro, K., Constans, J. M., Shimada, M., Yamaguchi, S., Ishizaki, J., Ishii, H., …
Sekita, Y. (2003). Corpus callosum atrophy, white matter lesions, and frontal execu-
tive dysfunction in normal aging and Alzheimer’s disease. A community-based
study: The Tajiri Project. International Psychogeriatrics/IPA, 15(1), 9–25.
Meisenzahl, E. M., Frodl, T., Greiner, J., Leinsinger, G., Maag, K. P., Heiss, D., …
Moller, H. J. (1999). Corpus callosum size in schizophrenia—a magnetic reso-
nance imaging analysis. European Archives of Psychiatry and Clinical Neuroscience,
249(6), 305–312.
Menendez-Gonzalez, M., Lopez-Muniz, A., Vega, J. A., Salas-Pacheco, J. M., & Arias-
Carrion, O. (2014). MTA index: A simple 2D-method for assessing atrophy of the
206
Scheltens, P., Barkhof, F., Valk, J., Algra, P. R., van der Hoop, R. G., Nauta, J., &
Wolters, E. C. (1992). White matter lesions on magnetic resonance imaging in clini-
cally diagnosed Alzheimer’s disease. Evidence for heterogeneity. Brain: A Journal of
Neurology, 115(Pt 3), 735–748.
Scheltens, P., Launer, L. J., Barkhof, F., Weinstein, H. C., & van Gool, W. A. (1995).
Visual assessment of medial temporal lobe atrophy on magnetic resonance imag-
ing: Interobserver reliability. Journal of Neurology, 242(9), 557–560.
Scheltens, P., Leys, D., Barkhof, F., Huglo, D., Weinstein, H. C., Vermersch, P., …
Valk, J. (1992). Atrophy of medial temporal lobes on MRI in “probable” Alzheimer’s
disease and normal ageing: Diagnostic value and neuropsychological correlates.
Journal of Neurology, Neurosurgery, and Psychiatry, 55(10), 967–972.
Scheltens, P., Pasquier, F., Weerts, J. G., Barkhof, F., & Leys, D. (1997). Qualitative
assessment of cerebral atrophy on MRI: Inter-and intra-observer reproducibility in
dementia and normal aging. European Neurology, 37(2), 95–99.
Schmand, B., Eikelenboom, P., van Gool, W. A., & Alzheimer’s Disease
Neuroimaging, I. (2011). Value of neuropsychological tests, neuroimaging,
and biomarkers for diagnosing Alzheimer’s disease in younger and older age
cohorts. Journal of the American Geriatric Society, 59(9), 1705–1710. doi:10.1111/
j.1532-5 415.2011.03539.x
Schmand, B., Eikelenboom, P., van Gool, W. A., & Alzheimer’s Disease Neuroimaging,
I. (2012). Value of diagnostic tests to predict conversion to Alzheimer’s disease
in young and old patients with amnestic mild cognitive impairment. Journal of
Alzheimer’s Disease: JAD, 29(3), 641–648. doi:10.3233/JAD-2012-111703
Schoenberg, M. R., Marsh, P. J., & Lerner, A. J. (2011). Neuroanatomy primer: Structure
and function of the human nervous system. In M. R. Schoenberg & J. G. Scott
(Eds.), The Little Black Book of Neuropsychology: A Syndrome-Based Approach.
New York: Springer Science+Business Media LLC.
Sidhu, M. K., Stretton, J., Winston, G. P., Symms, M., Thompson, P. J., Koepp, M. J.,
& Duncan, J. S. (2015). Memory fMRI predicts verbal memory decline after
anterior temporal lobe resection. Neurology, 84(15), 1512– 1519. doi:10.1212/
WNL.0000000000001461
Sitek, E. J., Barczak, A., & Harciarek, M. (2015). Neuropsychological assessment
and differential diagnosis in young-onset dementias. Psychiatric Clinics of North
America, 38(2), 265–279. doi:10.1016/j.psc.2015.01.003
Skranes, J., Vangberg, T. R., Kulseng, S., Indredavik, M. S., Evensen, K. A., Martinussen,
M., … Brubakk, A. M. (2007). Clinical findings and white matter abnormali-
ties seen on diffusion tensor imaging in adolescents with very low birth weight.
Brain: A Journal of Neurology, 130(Pt 3), 654–666. doi:10.1093/brain/awm001
Slansky, I., Herholz, K., Pietrzyk, U., Kessler, J., Grond, M., Mielke, R., & Heiss, W. D.
(1995). Cognitive impairment in Alzheimer’s disease correlates with ventricular
width and atrophy-corrected cortical glucose metabolism. Neuroradiology, 37(4),
270–277.
Soininen, H., Reinikainen, K. J., Puranen, M., Helkala, E. L., Paljarvi, L., & Riekkinen,
P. J. (1993). Wide third ventricle correlates with low choline acetyltransferase
activity of the neocortex in Alzheimer patients. Alzheimer Disease and Associated
Disorders, 7(1), 39–47.
Song, X., Mitnitski, A., Zhang, N., Chen, W., & Rockwood, K. (2013). Dynamics of
brain structure and cognitive function in the Alzheimer’s disease neuroimaging
208
Weinberger, D. R., & Radulescu, E. (2016). Finding the elusive psychiatric “lesion” with
21st-century neuroanatomy: A note of caution. The American Journal of Psychiatry,
173(1), 27–33. doi:10.1176/appi.ajp.2015.15060753
Wilde, E. A., Bouix, S., Tate, D. F., Lin, A. P., Newsome, M. R., Taylor, B. A., … York,
G. (2015). Advanced neuroimaging applied to veterans and service personnel with
traumatic brain injury: State of the art and potential benefits. Brain Imaging and
Behavior, 9(3), 367–402. doi:10.1007/s11682-015-9444-y
Wilde, E. A., Hunter, J. V., & Bigler, E. D. (2012). A primer of neuroimaging analy-
sis in neurorehabilitation outcome research. NeuroRehabilitation, 31(3), 227–242.
doi:10.3233/NRE-2012-0793
Wollenweber, F. A., Schomburg, R., Probst, M., Schneider, V., Hiry, T., Ochsenfeld, A.,
… Behnke, S. (2011). Width of the third ventricle assessed by transcranial sonog-
raphy can monitor brain atrophy in a time-and cost-effective manner--results
from a longitudinal study on 500 subjects. Psychiatry Research, 191(3), 212–216.
doi:10.1016/j.pscychresns.2010.09.010
Zhu, M., Gao, W., Wang, X., Shi, C., & Lin, Z. (2012). Progression of corpus callosum
atrophy in early stage of Alzheimer’s disease: MRI based study. Academic Radiology,
19(5), 512–517. doi:10.1016/j.acra.2012.01.006
Zhu, M., Wang, X., Gao, W., Shi, C., Ge, H., Shen, H., & Lin, Z. (2014). Corpus callo-
sum atrophy and cognitive decline in early Alzheimer’s disease: Longitudinal MRI
study. Dementia and Geriatric Cognitive Disorders, 37(3–4), 214–222. doi:10.1159/
000350410
210
211
M I K E R . S C H O E N B E R G , K AT I E E . O S B O R N ,
A N D JASON R. SOBL E
omitted from a majority of published trials (Altman, 2013). Even more concern-
ing, data regarding harmful adverse events frequently have been omitted or mis-
represented in published reports (Ioannidis, 2009). Accordingly, the biomedical
sciences have widely adopted reporting guidelines in order to facilitate higher
quality and transparency among published research manuscripts (Des Jarlais,
Lyles, Crepaz, & Trend-Group, 2014; Moher, Schulz, Altman, & CONSORT-
Group, 2001; von Elm et al., 2007).
Reporting guidelines specify aspects of research that should be included in
published research reports. The particular details of research that need to be
reported vary depending on the type of study, but common aspects include
specifying the hypothesis, methods, primary and secondary outcome variables,
statistical procedures, results, and discussion. Various checklists have been
developed to help authors and editors assure adherence to publication guide-
lines such that the key elements of a particular study are communicated (Chan,
Heinemann, & Roberts, 2014). Comprehensive information about many widely
accepted reporting guidelines are available through the EQUATOR (Enhancing
the Quality and Transparency of Health Research) Network website at http://
www.equator-network.org/reporting-guidelines/.
In this chapter, several reporting guidelines relevant to the field of clini-
cal neuropsychology will be described, including the Consolidated Standards
of Reporting Trials (CONSORT), the Standards for Reporting of Diagnostic
Accuracy (STARD, Bossuyt et al., 2003), the Strengthening the Reporting of
Observational Studies in Epidemiology (STROBE), the Preferred Reporting
Items for Systematic Reviews and Meta-Analyses (PRISMA), and the Patient
Reported Outcome Measurement Information System (PROMIS). Following an
overview of the guidelines, the benefits of using the guidelines to identify and
rank quality evidence will be highlighted. These guidelines include the mini-
mum criteria needed for studies to be included in systemic reviews or meta-
analyses. Furthermore, evidence-based practice requires the published study
to include the elements identified by a publication guideline to allow clinicians
to judge for themselves how the implications may be applicable to a particular
patient. This chapter will dovetail with Chapters 11 and 12, where techniques
for the rapid interpretation and application of the quality clinical evidence will
be described and will include one or more examples of critical appraisal of sys-
tematic reviews.
Methodological
Guideline Purview Overview Online Resources
CONSORT Randomized 25-item checklist and www.consort-
clinical trials flow diagram displaying statement.org
participants’ progression www.equator-
through the trial network.org
STROBE Observational 6 available checklists; www.strobe-
studies in vary based on reporting statement.org
epidemiology context and study www.equator-
type (i.e., cohort network.org
vs. case-control vs.
cross-sectional)
STARD Diagnostic accuracy 25-item checklist and www.stard-
studies flow diagram displaying statement.org
study design and patient www.equator-
flow network.org
PRISMA Systematic reviews 27-item checklist and a www.prisma-
and meta-analyses four-phase flow diagram statement.org
www.equator-
network.org
PROMIS Instrument List of 9 overriding www.nihpromis.
development and standards, each with org
validation corresponding checklists
REFERENCES
Alfonso, F., Bermejo, J., & Segovia, J. (2004). [New recommendations of the
International Committee of Medical Journal Editors. Shifting focus: from unifor-
mity in technical requirements to bioethical considerations]. Revista Espan õ la de
Cardiología, 57(6), 592–593.
Altman, D. G. (2002). Poor-quality medical research: What can journals do? Journal of
the American Medical Association, 287, 2765–2767.
Altman, D. G. (2013). Transparent reporting of trials is essential. The American Journal
of Gastroenterology, 108, 1231–1235. doi:10.1038/ajg.2012.457
Altman, D. G., & Dore, C. J. (1990). Randomisation and baseline comparisons in clini-
cal trials. Lancet, 335, 149–153.
219
Altman, D. G., & Simera, I. (2010). Responsible reporting of health research stud-
ies: Transparent, complete, accurate and timely. Journal of Antimicrobial
Chemotherapy, 65(1), 1–3. doi:10.1093/jac/d kp410
Alvarez, F., Meyer, N., Gourraud, P. A., & Paul, C. (2009). CONSORT adoption and
quality of reporting of randomized controlled trials: A systematic analysis in
two dermatology journals. British Journal of Dermatology, 161(5), 1159–1165.
doi:10.1111/j.1365-2133.2009.09382.x
APA Publications and Communications Board Working Group on Journal Article
Reporting Standards. (2008). Reporting standards for research in psychology: Why
do we need them? What might they be? (2008). The American Psychologist, 63(9),
839–851. doi:10.1037/0003-066x.63.9.839
Beller, E. M., Glasziou, P. P., Altman, D. G., Hopewell, S., Bastian, H., Chalmers, I.,
Gotzsche, P. C., Lasserson, T., Tovey, D., for the PRISMA for Abstracts Group.
(2013). PRISMA for abstracts: Reporting systematic reviews in journal and confer-
ence abstracts. PLOS Medicine, 10(4), e1001419. doi:10.1371/journal.pmed.1001419
Benitez, A., Hassenstab, J., & Bangen, K. J. (2014). Neuroimaging training among
neuropsychologists: A survey of the state of current training and recommen-
dations for trainees. Clinical Neuropsychology, 28(4), 600– 613. doi:10.1080/
13854046.2013.854836
Bilder, R. M. (2011). Neuropsychology 3.0: Evidence- based science and practice.
Journal of the International Neuropsychological Society, 17, 383–384.
Bossuyt, P. M., Cohen, J. F., Gatsonis, C. A., & Korevaar, D. A. (2016). STARD
2015: Updated reporting guidelines for all diagnostic accuracy studies. Annals of
Translational Medicine, 4(4), 85. doi:10.3978/j.issn.2305-5839.2016.02.06
Bossuyt, P. M., Reitsma, J. B., Bruns, D. E., Gatsonis, C. A., Glasziou, P. P., Irwig, L. M.,
… de Vet, H. C. (2003). Towards complete and accurate reporting of studies of diag-
nostic accuracy: The STARD initiative. British Medical Journal, 326(7379), 41–4 4.
Bradford-Hill, A. (1965). Reasons for writing. British Medical Journal, 2, 870.
Chan, L., Heinemann, A. W., & Roberts, J. (2014). Elevating the quality of disability
and rehabilitation research: Mandatory use of the reporting guidelines. Canadian
Journal of Occupational Therapy, 81(2), 72–77. doi:10.1177/0008417414533077
Chelune, G. L. (2010). Evidence-based research and practice in clinical neuropsychol-
ogy. The Clinical Neuropsychologist, 24, 454–467.
Des Jarlais, D. C., Lyles, C., Crepaz, N., & Trend-Group. (2014). Improving the report-
ing quality of nonrandomized evaluations of behavioral and public health interven-
tions: The TREND statement. American Journal of Public Health, 94, 361–366.
Fuller, T., Pearson, M., Peters, J., & Anderson, R. (2015). What affects authors’ and edi-
tors’ use of reporting guidelines? Findings from an online survey and qualitative
interviews. PLoS One, 10(4), e0121585. doi:10.1371/journal.pone.0121585
Grant, S. P., Mayo-Wilson, E., Melendez-Torres, G. J., & Montgomery, P. (2013).
Reporting quality of social and psychological intervention trials: A systematic
review of reporting guidelines and trial publications. PLoS One, 8(5), e65442.
doi:10.1371/journal.pone.0065442
Hopewell, S., Dutton, S., Yu, L. M., Chan, A. W., & Altman, D. G. (2010). The quality
of reports of randomised trials in 2000 and 2006: Comparative study of articles
indexed in PubMed. British Medical Journal Open, 340, c723.
International Committee of Medical Journal Editors [homepage on the Internet].
Recommendations for the Conduct, Reporting, Editing and Publication of
Scholarly Work in Medical Journals [accessed 30Aug2016] Available from: http://
www.ICMJE.org.
220
Ioannidis, J. P. (2005). Why most published research findings are false. PLOS Medicine,
2(8), e124. doi:10.1371/journal.pmed.0020124
Ioannidis, J. P. (2009). Adverse events in randomized trials: Neglected, restricted, dis-
torted, and silenced. Archives of Internal Medicine, 169, 1737–1739.
Knottnerus, A., & Tugwell, P. (2003). The standards for reporting of diagnostic accu-
racy. Journal of Clinical Epidemiology, 56, 1118–1127.
Knottnerus, A., & Tugwell, P. (2008). STROBE—A checklist to strengthen the report-
ing of observational studies in epidemiology. Journal of Clinical Epidemiology,
61, 323.
Lee, G. P. (2016). New Editor-in-Chief introductory comments. Archives of Clinical
Neuropsychology, 31(3), 195–196. doi:10.1093/arclin/acw008
Loring, D. W., & Bowden, S. C. (2014). The STROBE Statement and neuropsy-
chology: Lighting the way toward evidence- based practice. The Clinical
Neuropsychologist, 28(4), 556–574. doi:10.1080/13854046.2012.762552
Marcopulos, B. A., Caillouet, B. A., Bailey, C. M., Tussey, C., Kent, J. A., & Frederick,
R. (2014). Clinical decision making in response to performance validity test fail-
ure in a psychiatric setting. Clinical Neuropsychology, 28(4), 633–652. doi:10.1080/
13854046.2014.896416
Miller, J. B., Schoenberg, M. R., & Bilder, R. M. (2014). Consolidated Standards Of
Reporting Trials (CONSORT): Considerations for neuropsychological research.
The Clinical Neuropsychologist, 28(4), 575–599. doi:10.1080/13854046.2014.907445
Moher, D., Hopewell, S., Schulz, K. F., Montori, V., Gotzsche, P. C., Devereaux, P. J.,
… Altman, D. G. (2010). CONSORT 2010 explanation and elaboration: Updated
guidelines for reporting parallel group randomised trials. British Medical Journal
Open, 340, c869. doi:10.1136/bmj.c869
Moher, D., Jones, A., & Lepage, L. (2001). Use of the CONSORT statement and quality
of reports of randomized trials: A comparative before-and-a fter evaluation. Journal
of the American Medical Association, 285(15), 1992–1995.
Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G., & PRISMA Group. (2009). Preferred
reporting items for systematic reviews and meta-analyses: The PRISMA statement.
PLOS Medicine, 6(7), e1000097. doi:10.1371/journal.pmed.1000097
Moher, D., Schulz, K. F., Altman, D. G., & CONSORT-Group. (2001). The CONSORT
statement: Revised recommendations for improving the quality of reports of
parallel-group randomized trials. Journal of the American Medical Association,
285(15), 1987–1991.
Noel-Storr, A. H., McCleery, J. M., Richard, E., Ritchie, C. W., Flicker, L., Cullum, S. J.,
… McShane, R. (2014). Reporting standards for studies of diagnostic test accuracy
in dementia: The STARDdem Initiative. Neurology, 83(4), 364–373. doi:10.1212/
wnl.0000000000000621
Plake, B. S., & Wise, L. L. (2014). What is the role and importance of the revised AERA,
APA, NCME standards for educational and psychological testing? Educational
Measurement: Issues and Practice, 33(4), 4–12. doi:10.1111/emip.12045
Plint, A. C., Moher, D., Morrison, A., Schulz, K. F., Altman, D. G., Hill, C., & Gaboury,
I. (2006). Does the CONSORT checklist improve the quality of reports of ran-
domised controlled trials? A systemic review. Medical Journal of Australia, 185(5),
263–267.
Pocock, S. J., Hughes, M. D., & Lee, R. J. (1987). Statistical problems in the report-
ing of clinical trials. A survey of three medical journals. New England Journal of
Medicine, 317(7), 426–432. doi:10.1056/nejm198708133170706
221
Sackett, D. L., Rosenberg, W. M., Gray, J. A., Haynes, R. B., & Richardson, W. S. (1996).
Evidence based medicine: What it is and what it isn’t. British Medical Journal,
312(7023), 71–72.
Schoenberg, M. R. (2014). Introduction to the special issue on improving neuropsycho-
logical research through use of reporting guidelines. The Clinical Neuropsychologist,
28(4), 549–555. doi:10.1080/13854046.2014.934020
Schulz, K. F., Altman, D. G., & Moher, D. (2010). CONSORT 2010 statement: Updated
guidelines for reporting parallel group randomised trials. PLOS Medicine, 7,
e1000251. doi:10.1371/journal.pmed.1000251
Schulz, K. F., Chalmers, I., Altman, D. G., Grimes, D. A., & Dore, C. J. (1995). The
methodological quality of randomization as assessed from reports of trials in spe-
cialist and general medical journals. The Online Journal of Current Clinical Trial,
Doc. No. 197 (81 paragraphs).
Scott-Lichter, D., & Editorial Policy Committee of Council of Science Editors. (2012).
CSE’s White Paper on Promoting Integrity in Scientific Journal Publications, 2012
Update (3rd revised ed.). Council of Science Editors, Wheat Ridge, CO.
Soble, J. R., Silva, M. A., Vanderploeg, R. D., Curtiss, G., Belanger, H. G., Donnell,
A. J., & Scott, S. G. (2014). Normative data for the Neurobehavioral Symptom
Inventory (NSI) and post-concussion symptom profiles among TBI, PTSD, and
nonclinical samples. Clinical Neuropsychology, 28(4), 614– 632. doi:10.1080/
13854046.2014.894576
Vandenbroucke, J. P. (2009). STREGA, STROBE, STARD, SQUIRE, MOOSE,
PRISMA, GNOSIS, TREND, ORION, COREQ, QUOROM, REMARK … and
CONSORT: For whom does the guideline toll? Journal of Clinical Epidemiology, 62,
594–596. doi:10.1016/j.jclinepi.2008.12.003
Vandenbroucke, J. P., von Elm, E., Altman, D. G., Gotzsche, P. C., Mulrow, C. D.,
Pocock, S. J., … Egger, M. (2007). Strengthening the Reporting of Observational
Studies in Epidemiology (STROBE): Explanation and elaboration. Epidemiology,
18, 805–835. doi:10.1097/EDE.1090b1013e3181577511
von Elm, E., Altman, D. G., Egger, M., Pocock, S. J., Gotzsche, P. C., Vandenbroucke,
J. P., & STROBE-Initiative. (2007). The Strengthening of Reporting of Observational
Studies in Epidemiology (STROBE) statement: Guidelines for reporting of observa-
tional studies. Lancet, 370, 1453–1457.
Welch, V., Petticrew, M., Tugwell, P., Moher, D., O’Neill, J., Waters, E., White, H., the
PRISMA-Equity Bellagio Group. (2012). PRISMA-Equity 2012 extension: Reporting
guidelines for systematic reviews with a focus on health equity. PLOS Medicine,
9(10), e1001333. doi:10.1371/journal.pmed.1001333
222
223
10
M A R T I N B U N N AG E
A third category of patients will be those who perform above the 7th percen-
tile on the test and do not demonstrate any problem behavior in their everyday
life that would suggest the presence of a memory disorder. These people are our
“true negatives.” The test says “no memory disorder,” and the persons so identi-
fied appear to have no memory disorder.
Finally, a fourth category of patients will be those who perform above the
7th percentile on the test but nonetheless demonstrate behavior in their every-
day life that would suggest the presence of a memory disorder. These people are
“false negatives.” The test says “no memory disorder” but they appear to have a
memory disorder. The association between the results of a test for the “condition
of interest” and the real-life presence of the condition of interest are represented
in Table 10.1.
While it would be excellent for the practice of clinical neuropsychology to rest
firmly on the basis of tests without any false negative or false positive results, this
is unfortunately not the case, nor is it the case for most diagnostic tests (Straus
et al., 2011). All tests are imperfect to some extent, and as a consequence, classifi-
cation accuracy of every test can be quantified in terms of the four cells shown in
Table 10.1, that is, true positives, false positives, false negatives and true negatives
(Straus et al., 2011).
To continue the MCI example, sensitivity reflects how many people with a
memory disorder in their everyday life have a positive test result. In this case, the
number of people with a memory disorder in everyday life who have a memory
test score below the 7th percentile. Usually there is an imperfect relationship
between these two sources of classification. Consequently, whilst, with a good
test, most of the people with a memory disorder in everyday life will score below
the 7th percentile on this memory test (i.e., the test is sensitive to the presence of
a memory disorder), there will be some people who have a memory disorder in
everyday life but who score better on the test. This scenario reflects a false nega-
tive (i.e., the test says there is no memory disorder when in fact there is). Also
there will be some people who score below the 7th percentile on the test who have
no apparent memory disorder in everyday life. This scenario reflects a false posi-
tive (i.e., the test says there is a memory disorder when in fact there is not). These
further metrics are shown in Table 10.1.
“Sensitivity” is calculated from the number of true positives as a percentage of
the total number of “positives” in the population. In Table 10.1, this value would
be reflected by the equation A/A + C.
“Specificity” reflects how many people without a memory disorder in real life
have a negative test result, in this case, defined as a memory score above the 7th
percentile. As before, whilst most people without a memory disorder will score
above the 7th percentile on this memory test, there will also be some people
who do not have a memory disorder but who score poorly on the test. The latter
scenario reflects a false positive error. Specificity is calculated from the number
of true negatives as a percentage of the total number of “negatives” in the popula-
tion. In Table 10.1, this would be reflected by the equation D/D + B.
Sensitivity and specificity are often expressed as percentages or decimal pro-
portions reflecting the outcomes of the two equations above. Tools for calculat-
ing these values and other values described below are readily available on the
Internet, for example, see http://ktclearinghouse.ca/cebm/practise/ca/calcula-
tors/statscalc or http://w ww.cebm.net.
There is usually a tradeoff between sensitivity and specificity for any given test.
No diagnostic test is perfect, consequently, as sensitivity increases, it is usually at
the expense of specificity and vice versa, as shown in Figure 10.1. If a clinician is
trying to capture all the people with the condition of interest, represented by the
darker distribution in Figure 10.1, then the ability of the test to do so increases
as the cut-score used moves from the left to the right in Figure 10.1. That is, from
“A” to “B” and finally to “C”. Using the cut-score of “C,” almost all those in the
darker distribution are below the cut-score and so would be correctly identified
by the test. In this circumstance, the sensitivity of the test is high. However, as
the cut-score changes from “A” to “B” and finally to “C,” it can also be seen that
the number of people within the lighter distribution (which represents people
without the condition) who are correctly classified decreases because more of
their scores fall below the cut-score as it moved to “C.” That is, as the sensitivity
of the test increases, the number of false positive test results also increases, which
means the specificity of the test decreases.
227
Figure 10.1 The trade-off between sensitivity and specificity when using three different
cut-scores to distinguish between scores within the condition of interest distribution
(darker curve on the left) versus those within the control group distribution (lighter
curve on the right).
The opposite is also true: as the cut-score changes from “C” to “B” to “A,” the
number of false positives decreases, but so does the number of true positives
identified by the test. In this latter circumstance, where the cutoff score moves
from right to left in Figure 10.1, the specificity of the test increases, but at the
expense of the sensitivity of the test.
Invariably, in clinical practice, the distribution of scores on tests between those
who have, and those who do not have, the condition of interest overlaps. It is also
the case in clinical practice that the cut-scores used to help guide the interpreta-
tion of test results need not be absolute or fixed. Consequently, the relationship
between the sensitivity and specificity of a test result and the condition of inter-
est will vary, depending upon the cut-score that is used. The choice of cut-off can
also be used to help favor either sensitivity or specificity, depending upon the
clinical question that is being asked. Sometimes, particularly when screening,
it is usually more helpful to emphasize sensitivity over specificity (Straus et al.,
2011). The reason for weighting sensitivity is that the goal of screening is usu-
ally to identify all the people who may have the condition of interest. It is more
important not to miss people with the condition of interest (false negative errors)
than it is to minimize potential false positive errors. Subsequently, the people
whose scores are classified as positive at the first screening assessment can be
reassessed with a test with a high specificity. This strategy is known as the “two-
step diagnostic process” (Straus et al., 2011).
Alternatively, in some scenarios, it would be more important to emphasize
specificity rather than sensitivity, that is, for decisions where the costs of false-
positive errors might be high. Such a circumstance might apply with tests used
to help identify people who are potentially feigning their cognitive problems. In
this scenario, given the cost of wrongly diagnosing malingering, test cut-scores
228
LR + = ( A/A+C ) / (B/B+D)
LR − = (C/A+C ) / (D/B+D)
These equations can also be written in terms of sensitivity and specificity, namely:
As likelihood ratios for a positive test result increase significantly above 1, there is
an increased probability of the condition of interest being present after a positive
test result is obtained. Conversely, as the likelihood ratio for a negative test result
decreases significantly below 1, there is a decreased probability of the condition
229
being present after a negative test result is obtained. As positive likelihood ratios
increase above 10, for example, the probability of the condition being present is
greatly increased when a positive test result is obtained. Conversely, as a negative
likelihood ratio decreases below 0.1, for example, the probability of the condition
being present is greatly reduced when the test result is negative.
The positive likelihood ratio is a way of thinking about a positive test result
affecting the base-rate estimate to increase the likelihood of the diagnosis.
Conversely, a negative likelihood ratio is a way of thinking about a negative test
result affecting the base-rate estimate to reduce the likelihood of the diagnosis.
In this way, the pre-test odds (base rate) are changed by the likelihood ratio,
resulting in the post-test odds. Likelihood ratios are interpreted with reference
to an estimated or known pre-test probability (also referred to as the “clinical
prevalence” or “base rate”). A nomogram for interpreting diagnostic test results
is shown in Figure 10.2. In the nomogram, a line is drawn from the pre-test
probability through the likelihood ratio to estimate the post-test probability (see
Fagan, 1975).
An online calculator is also available to estimate post-test probability using
likelihood ratios, see http://araw.mede.uic.edu/cgi-bin/testcalc.pl
.1 99
.2
.5 95
1 1000 90
500
2 200 80
100
50 70
5 60
20
10 10 50
5 40
20 2 30
1
30 .5 20
40 .2
50 .1 10
60 .05
.02 5
70
.01
80 .005 2
.002
90 .001 1
95 .5
.2
99 .1
Pre-Test Likelihood Post-Test
Probability (%) Ratio Probability (%)
calculation of the positive and negative predictive values of a test are reported in
relation to the four quantities in Table 10.1.
The “positive predictive value” refers to the number of people with a positive
test result who actually have the condition of interest. In the memory example
above, we would be asking how many people who obtain a memory test score
below the 7th percentile when tested have a memory disorder in everyday life.
This can be represented mathematically as A/A + B for the values in Table 10.1.
The “negative predictive value” refers to the number of people with a negative
test result who do not actually have the condition of interest. In the memory
example, this would be the number of people with a memory test score above the
7th percentile who do not have a memory disorder in everyday life. This can be
represented mathematically as D/C + D in Table 10.1.
The performance of a test in relation to a diagnostic decision is further affected
by a property of the population being tested rather than of the test itself. This
property is the “prevalence” (or base-rate or pre-test probability) of the condi-
tion of interest in the population being tested. In terms of the memory disorder
example, this would be the number of people who actually have a memory disor-
der in everyday life, out of all the people being tested. This percentage is referred
to as the base-rate or prevalence or pre-test probability of the condition of inter-
est in the population.
If we considered a community-based example, the proportion of people with
a memory disorder in everyday life among all those being tested would be much
lower than if we considered a population of people attending a tertiary-referral
dementia diagnosis clinic. In the latter setting, it would be reasonable to assume
the number of people attending with a memory disorder in everyday life would
be higher.
A specific example of the impact of the prevalence of the condition of interest
on the diagnostic performance of a test can be seen in the study of Mioshi et al.
(2006) of the diagnostic validity of the Addenbrooke Cognitive Examination–
Revised. These researchers noted the sensitivity, specificity, likelihood ratios,
and positive and negative predictive values of the test when making a dementia
diagnosis using cut-scores of 82 and 88 on the Addenbrooke test at different lev-
els of prevalence.
Consider the results of Mioshi et al. (2006) in relation to their cut-score of
88. At an estimated prevalence of 40% the positive predictive value of the test
was 0.85, meaning that there was a .85 probability that a person with a score at
88 or below had dementia. This positive predictive value changed dramatically,
however, when the presumed prevalence was 5%. At this level of prevalence, the
positive predictive value of the test became 0.31, meaning that there was only a
.31 probability that a person with a score at 88 or below had dementia. In other
words, as the prevalence of the condition of interest, in this case dementia, less-
ened in the population tested, the validity of a positive test result indicating the
presence of dementia also declined. At 5% prevalence, a positive test score was
diagnostically accurate only 31% of the time. That is, at this 5% prevalence, a
positive test score was a false positive 69% of the time. A positive test result is,
232
therefore, at this low level of prevalence, much more likely to be incorrect than it
is correct at detecting dementia.
These ideas are further expanded below using hypothetical data applied to our
previously discussed memory disorder example. In that example, the diagnosis
being made was whether or not someone had a memory disorder in everyday life,
and the cut-score on the memory test used was the 7th percentile.
Lets assume we started from an estimated prevalence of 50%, that is, the origi-
nal research study used to derive the cut-score was made up of two equal-sized
groups that were matched in terms of their demographic and clinical character-
istics other than the presence of memory disorder. If a valid memory test is used,
it might be expected to have a sensitivity of, say, 0.78 and a specificity of 0.82,
to choose two arbitrary hypothetical values. Further suppose that the cut-score
is derived from research to classify people into either the “memory disorder in
everyday life group” (disorder present) or the “no memory disorder in everyday
life group” (disorder not present). Suppose also, in this hypothetical example,
that there are 125 people in each of these two groups. These properties of the
test are represented in Table 10.3. The numbers of people in each cell (A to D) in
Table 10.3 are determined by our hypothetical sensitivity and specificity values.
The sensitivity of the test shown in Table 10.3 is given by A/A + C, which in
this example is 97/125 = 0.78 (rounded to two decimal places). The specificity
of the test is given by D/D + B which in this example is 103/125 = 0.82. In this
example, the base rate of the condition of interest was set to 50%, that is, the
disorder is present in 125 people (A + C) and is not present in 125 people (B + D),
so the base rate is 125/250.
The positive predictive value of the test is given by A/A + B, which in this
example is 97/119 = 82%. This value indicates that there is a probability of .82
that a positive test result comes from a person with the condition of interest, in
this example, memory disorder. The negative predictive value of the test is repre-
sented by D/C + D, which in this example is 103/131 = 79%. This value indicates
that there is a probability of .79 that a negative test result comes from a person
without memory disorder.
The interpretation of the usefulness of the test changes, however, if the base
rate of the condition changes, especially if the base rate falls to a lower level than
was represented in the research study. Base rates will change if the population
tested is different from the research study population, for example, if all the peo-
ple from a community-based population are tested. When compared with the
research study population, the community-based population would have a lower
base rate of the condition of interest. The base rate is lower because the preva-
lence of the condition of interest will be diluted across many more people than
in the research study, where the condition of interest was deliberately identified
and concentrated into one of the groups tested.
If the same test and cut-off from the original research study are applied to the
community-based population, the interpretation of the test results will be dif-
ferent. This is because, in the community-based population, the base rate of the
condition of interest is lower. Let us suppose, in this example, that a base rate of
9% reflects the frequency of memory disorder in the community-based popula-
tion. A reworking of the preceding calculations, but with the lower base rate, is
shown in Table 10.4.
For the example in Table 10.4, the sensitivity of the test is given by A/A + C
and stays the same as in the previous example, that is, it is 97/125 = 0.78. The
specificity of the test is represented by D/D + B and stays the same as in the
previous example, that is, it is 1030/1250 = 0.82. However, in Table 10.4, the base
rate of the condition is now 9%, the disorder is present in 125 people and is not
present in 1,250 people, that is, 125/(125 + 1250).
Therefore, the positive predictive value of the test, which is represented by
A/A + B, is now in this e xample 97/317 = 31%. That is, the probability of a positive
test result coming from a person with memory disorder is now only .31. To put
it another way, out of all the positive test results obtained, 31% are true positives
and reflect the presence of the condition of interest and 69% are false positives.
The negative predictive value of the test is represented by D/C + D, which in
this example is now 1030/1058 = 97%, that is, out of all the negative test results
obtained, 97% are true negatives and reflect the absence of the condition of inter-
est and 3% are false negatives.
These calculations and the numbers in Table 10.4 show that, when the base
rate of the condition of interest tends towards zero, we can become more confi-
dent in negative test results but less confident in positive test results. The negative
predictive value increased from 79% to 97% as the base rate decreased in these
two examples. A negative test result was more likely to be accurate and the num-
ber of false-negatives less as the base rate of the condition of interest decreased.
However, the positive predictive value decreased from 81% to 31% as the base
rate decreased from 50% to 9%. That is, a positive test result was less likely to be
accurate and the number of false positives increased in proportion to the true
positives as the base rate of the condition of interest decreased.
So, in general, when the base rate of the condition of interest is low, a test
that appears to have good diagnostic properties (when calculated under the high
base-rate conditions often found in published research studies) can actually per-
form so poorly that a positive test result is more likely to be wrong than it is
right, that is, a positive test result is more likely to be a false positive than a true
positive. The impact of prevalence upon the predictive power of diagnostic tests
is discussed in detail in Baldessarini et al. (1983).
The implication of these calculations is that, when applying these principles to
diagnostic decision-making in clinical practice, it is necessary to have some idea
of the base rate of the condition of interest within the clinical setting in which
the test is being used. This information allows for the calculations necessary to
estimate how the test will perform when taken from the research setting to a
clinical setting with a different base rate. Returning to the Mioshi et al. (2006)
example, while sensitivity and specificity were high using the cut-score of 88, the
positive predictive value of the test was shown to be poor at low base rates, it was
0.31 at a 5% base rate. In other words, at this low base rate, a positive test result
was much more likely to be a false positive than a true positive. At the higher
base rate of 40%, the positive predictive value was much higher, at 0.85, indicat-
ing a much reduced likelihood of false positive test results. Before using this test
at the prescribed cut-offs to make diagnostic decisions, any clinician would be
wise to estimate the base rate of the condition of interest in the population being
tested to avoid the error of interpreting a false positive as a true positive.
Putting these ideas together allows for the calculation of the post-test prob-
ability at any specified base rate.
The post-test probability is calculated as follows:
In the example described here, with a prevalence of 9%, the post-test probability
of the person with a memory test score below the 7th percentile having a memory
problem in everyday life is calculated as follows, when using the data presented
235
in Table 10.4. LR+ = 4.41, Prevalence = 0.09, Post-Test Probability = 30% prob-
ability of a memory disorder in everyday life. When the prevalence is assumed to
be 50%, the post-test probability becomes: LR+ = 4.41, Prevalence = 0.5, Post-Test
Probability = 82% probability of a memory disorder in everyday life. See http://
araw.mede.uic.edu/cgi-bin/testcalc.pl for an online calculator.
RELIABILITY OF MEASUREMENT
When considering the likely diagnostic validity of a test procedure, it is also cru-
cial to consider the reliability of the test result. When thinking about diagnostic
decision-making, we are essentially attempting to arrive at a decision, namely, is
the condition of interest present or not? The reliability of our test result has a cru-
cial bearing on the confidence we have in the decision we are making. Put simply,
nothing can be more valid than it is reliable. On average, a test result cannot
correlate better with some diagnostic outcome than it can correlate with itself.
If a test score is not relatively reliable, it cannot have high validity and therefore
cannot be of high diagnostic utility (Schmidt & Hunter, 1996).
When thinking about diagnostic validity as described above, it is necessary to
turn the score on a test into a decision. This is usually achieved by considering
whether the obtained test score falls above or below a cut-score that has been
empirically derived to maximize the accuracy of classification. For example,
when considering performance validity during cognitive testing, Schroeder et al.
(2012) highlighted cut-scores of less than or equal to 6 or 7 on the Reliable Digit
Span measure as being optimal, depending upon the population being tested.
Any decision about whether or not a score falls above or below a specific cut-off
needs to consider the reliability of the score itself. The reliability of the Digit
Span subtest is high, which encourages confidence in the obtained result but
if, for example, the test score were very unreliable, one could only have limited
confidence that the score obtained at one assessment would reflect the person’s
true score. If the test score is unreliable and a patient is tested again, their score
might vary, and thus the patient could be classified as being above or below the
cut-off at different points in time merely as a consequence of poor measurement
reliability. Such measurement unreliability fundamentally undermines the diag-
nostic validity possible with any test of lower reliability. Reliability of test scores
is examined in detail in Chapter 5 of the current volume.
REFERENCES
Albert, M. S., DeKosky, S. T., Dickson, D., Dubois, B., Feldman, H. H., Fox, N. C.,
… Phelps, C. H. (2011). The diagnosis of mild cognitive impairment due to
Alzheimer’s disease: Recommendations from the National Institute of Aging–
Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s dis-
ease. Alzheimer’s Dementia, 7(3), 270–279.
Baldessarini, R. J., Finklestein, S., & Arana, G. W. (1983). The predictive power of diag-
nostic tests and the effect of prevalence of illness. Archives of General Psychiatry,
40, 569–573.
Bowden, S. C., Harrison, E. J., & Loring, D. W. (2013). Evaluating research for clinical
significance: Using critically appraised topics to enhance evidence-based neuropsy-
chology. The Clinical Neuropsychologist, 28(4), 653–668.
Faust, D. (2003). Alternatives to four clinical and research traditions in malingering
detection. In P. W. Halligan, C. Bass, & D. A. Oakley (Eds.), Malingering and Illness
Deception (pages 107–121). Oxford: Oxford University Press.
Fagan, T. J. (1975). Nomogram for Bayes’s theorem. New England Journal of Medicine,
293(5), 257.
Frederick, R. I., & Bowden, S. C. (2009). The test validation summary. Assessment,
16(3), 215–236.
Gervais, R. O., Rohling, M. L., Green, P., & Ford, W. (2004). A comparison of WMT,
CARB, and TOMM failure rates in non-head injury disability claimants. Archives
of Clinical Neuropsychology, 19, 475–487.
Grimes, D. A., & Schulz, K. F. (2002). An overview of clinical research: The lay of the
land. Lancet, 359(9300), 57–61.
Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective,
impressionistic) and formal (mechanical, algorithmic) prediction procedures: The
clinical-statistical controversy. Psychology, Public Policy and Law, 2(2), 293–323.
Haynes, R. B., Devereaux, P. J., & Guyatt, G. H. (2002). Clinical expertise in the era of
evidence-based medicine and patient choice. Evidence Based Medicine, 7, 36–38.
Mioshi, E., Dawson, K., Mitchell, J., Arnold, R., & Hodges, J. R. (2006). The
Addenbrooke’s Cognitive Examination–Revised (ACE-R): A brief cognitive test
battery for dementia screening. International Journal of Geriatric Psychiatry, 21,
1078–1085.
Ruscio, J. (2007). The clinician as subject. Practitioners are prone to the same judgment
errors as everyone else. In S. O. Lilenfield & W. T. O’Donohue (Eds.), The Great
Ideas of Clinical Science: 17 Principles That Every Mental Health Professional Should
Understand (pages 29–48). New York: Routledge.
237
Schoenberg, M. R., & Scott, J. G. (Eds.). (2011). The Little Black Book of Neuropsychology: A
Syndrome-Based Approach. New York: Springer.
Schroeder, R. W., Twumasi-Ankrah, P., Baade, L. E., & Marshall, P. S. (2012). Reliable
digit span: A systematic review and cross-validation study. Assessment, 19(1), 21–30.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological
research: Lessons from 26 research scenarios. Psychological Methods, 1(2), 199–223.
Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and
methodology. Annual Review of Clinical Psychology, 5, 1–25.
Straus, S. E., Glasziou, P., Richardson, W. S., & Haynes, R. B. (2011). Evidence-Based
Medicine. How to Practice and Teach It (4th ed.). Churchill Livingstone Elsevier.
Wechsler, D. (2010). Wechsler Memory Scale–Fourth UK Edition. Administration and
Scoring Manual. London: Pearson Education, Pearson Assessment.
FURTHER READING
Straus, S. E., Glasziou, P., Richardson, W. S., & Haynes, R. B. (2011). Evidence-Based
Medicine. How to Practice and Teach It (4th ed.). Churchill Livingstone Elsevier.
238
239
11
DAV I D T. R . B E R R Y, J O R DA N P. H A R P, L I S A M A S O N K O E H L ,
A N D H A N NA H L . COM BS
Rogers, Sewell, Martin, & Vitacco, 2003) as well as invalid approaches to neu-
ropsychological test performance by those taking the tests (Sollman & Berry,
2011; Vickery, Berry, Inman, Harris, & Orey, 2001). These studies focused on
summarizing effect sizes and basic diagnostic statistics. However, many of them
coded methodological characteristics and explored the relationship of these with
the summary statistics. For example, Table 11.1 lists the methodological vari-
ables extracted in a meta-analysis of the accuracy of selected PVTs (Sollman &
Berry, 2011).
These methodological indicators were initially derived from recommenda-
tions by a pioneer in the area of detection of malingered psychological deficits,
Richard Rogers (1988). These guidelines were intended to improve research qual-
ity in the area and were updated in Rogers (2012), and adapted in reviews such as
the previously noted Sollman and Berry (2011). Given this provenance, it is not
surprising that they are much less clinically than research-oriented. Although
there is typically a degree of arbitrariness in devising scoring criteria used for
quality evaluations, it is striking how little attention to clinical generalizability
of results appears in the criteria listed in Table 11.1. An additional issue of note
is the amount of coverage given to simulator malingering designs (i.e., normal
subjects asked to feign deficits on testing), an approach that would seem uncom-
mon in the medical diagnostic literature. However, malingering is a psychologi-
cal construct, and analog production of psychological phenomena is viewed as
a legitimate procedure for understanding the construct validity of psychological
tests (Anastasi & Urbina, 1997). For example, administration of a putative test
of anxiety before and after experimental induction of the emotion should docu-
ment a predictable difference in scores on the two occasions. Similarly, instruct-
ing participants to feign psychological or neuropsychological deficits during
administration of test batteries has been useful in investigating the sensitivity of
malingering-detection instruments, although of course an appropriate clinical
control group is required to determine specificity (Rogers et al., 2003; Sollman &
Berry, 2011).
Column 2 of Table 11.2 provides a summary of the STARD criteria. In con-
sidering these entries, it is important to note that the “index test” is the proce-
dure undergoing evaluation, whereas the “reference test” is the procedure used
as the “gold standard” or “external validity criterion” (Anastasi & Urbina, 1997).
It can be seen that there is remarkably little overlap between the STARD criteria
and the methodological variables evaluated in Table 11.1, with the former cover-
ing many important technical aspects of the procedures employed in diagnostic
studies. This suggests that extant meta-analyses of PVTs are probably not of the
quality recommended for contemporary evaluations of medical diagnostic tests,
a conclusion that is likely to be of concern to clinicians who utilize these proce-
dures. This issue may also hold for standard neuropsychological tests.
A point that seems important to emphasize is that the STARD criteria do not
function simply as “quality control” standards for diagnostic studies (although
this is a useful application). In addition, they allow a reader to understand the
nature of the clinical population to which the respective study results may be
242
Table 11.1 Methodological Characteristic Scoring System from a Meta-Analysis of Performance Validity Tests
STARD
Item # STARD Criteria Application of STARD to Schipper et al. (2008)
1a Clearly identified as a Not explicitly stated. Determination of “Operating
diagnostic accuracy study? Characteristics” of the manual LMT noted in Abstract
(p. 345). Sensitivity, Specificity, and Area Under the
Curve (AUC) data presented on p. 347, 4th paragraph.
1b Are keywords sensitivity & Not included as keywords but appear in text (p. 347,
specificity used? 4th & 5th paragraphs).
2 Does research question Not explicitly, although text states goal is to compare
include diagnostic the computerized and manual forms of the test
accuracy of the index test? (p. 346, 1st paragraph) with Sensitivity, Specificity,
and AUC compared on p. 347, last paragraph.
3 Does study population have Inclusion criteria appear on p. 346, 2nd & 3rd
appropriate: paragraphs. Exclusion criteria appear in 3rd
• Inclusion/exclusion paragraph of this page. Setting and location not
criteria? specified. Spectrum of disease not specified except
• Setting? in terms of means & standard deviations for
• Location? Glasgow Coma Scale scores and duration of loss of
• Spectrum of disease? consciousness (p. 346, 2nd paragraph).
4 Was subject recruitment Archival study (p. 346, 2nd paragraph). Participants
accomplished by: selected because they had received both index and
• Presenting symptoms? reference tests.
• Results from Spectrum of TBI not entirely clear, but wide range
previous tests? implied by GCS and duration of loss of consciousness
• Fact that patients had (p. 346, paragraph 2). Estimated base rate of
already received index malingering (20%, p. 347, 1st paragraph) toward the
test or reference standard? lower end of published prevalence rates. No statement
• Obtaining appropriate regarding exclusion on the basis of comorbid
spectrum of disease? conditions present.
• Excluding other
conditions in controls?
5 Was sampling: Archival study. Requirements were: Administered full
• Consecutive? neuropsychological battery, received all 3 PVTs, failed
• Or were additional none or >2 of them (p. 346, 2nd and 3rd paragraphs).
criteria used?
6 Was data collection: Archival study; retrospective data collection (p. 346,
• Prospective or 2nd paragraph).
retrospective?
7 Were reference standard Reference standards described on p. 346, 3rd
and rationale described? paragraph, but no rationale provided. Rationale
and description provided for index test (p. 346, first
paragraph, and p. 345, first paragraph).
8 Were technical specifications PVT citations on p. 346, 1st and 2nd paragraphs.
for index & reference tests Methods and timing of administration not specified.
described, including:
• Materials?
• Methods?
• Timing?
• Citations?
245
STARD
Item # STARD Criteria Application of STARD to Schipper et al. (2008)
9 Were definition & rationale Units implied (percentage correct) for index test
given for index & reference (LMT, p. 347, 4th paragraph). Units for Reference
tests’: tests not explicitly described. Cutoff for LMT given on
• Units? p. 347, 4th paragraph.
• Cutoffs? Categories of results described on p. 346, 3rd
• Categories of result? paragraph. Cutoffs were described as recommended
• Above predetermined? cutting scores for TOMM, DMT (p. 346, 3rd
paragraph) and LMT (p. 347, 4th paragraph).
10 Were index & reference test Not described.
administrators & readers
described, including:
• Number?
• Training?
• Expertise?
11 For index & reference Not described.
tests, were:
• Readers blind?
• Any other data known to
readers?
12 • How was diagnostic Sensitivity, specificity, and AUC calculated and
accuracy calculated? presented on p. 347, paragraph 4. Only AUC included
• How was uncertainty 95% confidence intervals.
quantified?
13 Was test reproducibility Not provided.
(reliability) reported?
14 Were start & stop dates of Not provided.
recruitment provided?
15 Were participant Demographic, TBI, and compensation-seeking
characteristics described, characteristics for entire sample described on
including: p. 346, 2nd paragraph. Probable Feigners and
• Clinical? Honest characteristics were compared on p. 347, 2nd
• Demographic? paragraph.
16 Was flow diagram with Flow diagram not provided. However, p. 346, 3rd
number included and paragraph, tracks assignments of all participants.
excluded, along with
reasons, provided?
17 Were the following details Time between tests not explicitly addressed, but likely
provided: that index test and tests contributing to reference
• Time between Index and standard were completed during same evaluation.
Reference tests? Possible interventions not addressed, but seem
• Any interventions in unlikely.
interval?
(Continued)
246
Table 11.2 (Continued)
STARD
Item # STARD Criteria Application of STARD to Schipper et al. (2008)
18 Were the following details Not explicitly stated, but likely severity and spectrum
addressed: of TBI comparable to that found in outpatient
• Severity/spectrum of neuropsychological assessment practice. As noted
disease appropriate in earlier, base rate of 20% feigning in this sample was
target patients with lower than reported in many other compensation-
criteria? seeking samples. Comorbid conditions not detailed.
• Other diagnoses in Criteria for TBI not explicitly addressed. Criteria for
controls? feigning and honest group assignments addressed on
• Specify how above 2 p. 346, 3rd paragraph.
defined?
19 Is there a: Not present.
• Table with cross- Indeterminate classification described on p. 346, 3rd
tabulation of results paragraph, but not included in a table.
from Index and
Reference tests?
• Are indeterminate
and technical failures
included in table?
20 Any adverse results from Not discussed, but unlikely.
testing described?
21 Are there estimates of Estimated sensitivity, specificity, and AUC presented
diagnostic accuracy with on p. 347, 4th paragraph. Only AUC provided
confidence intervals? standard error value.
22 Was handling of index test Indeterminate subjects were excluded from diagnostic
results in terms of following accuracy calculations. Missing responses and outliers
described: not mentioned.
• Indeterminate?
• Missing responses?
• Outliers?
23 Was variability of accuracy Not addressed.
across following variables
addressed:
• Sites?
• Subgroups?
• Readers?
24 Was test reproducibility Not addressed.
addressed in terms
of stability and rater
agreement?
25 Was clinical applicability of Only briefly discussed on p. 347, last paragraph.
findings discussed?
is instructed to feign deficits on test batteries that include PVTs, and results are
compared to those from a group with a known pathology, such as traumatic
brain injury (TBI), that is tested in a context with no external motivation for
faking deficits. The Known-Groups design (which is thought to maximize exter-
nal validity, or the extent to which results are generalizable to other settings,
populations, etc., and is much closer to typical studies of medical diagnostic
tests) usually involves administering previously validated PVTs (reference tests)
as well as a new PVT (index test) to a series of patients with a target disorder,
such as TBI. The previously validated PVTs, and sometimes other criteria (e.g.,
Slick, Sherman, & Iverson, 1999) are then used to classify each patient as honest
or feigning. Results from the new PVT are compared in the two groups classi-
fied on the basis of their performances on previously validated PVTs. Both these
designs allow determination of effect sizes as well as estimated sensitivity (per-
cent of those known to have a target condition who have a positive test sign) and
specificity (percent of those known not to have a target condition who have a
negative test sign). Together with known or estimated base rates of the condition
in question, these two statistics can be used to estimate Positive Predictive Power
(percentage of those with a positive test sign who actually have the target condi-
tion) and Negative Predictive Power (percentage of those with a negative test sign
who do not have the target condition).
Vickery et al. (2001) meta-analyzed results from published studies of the accu-
racy of the most well-studied PVTs and reported an average effect size (Cohen’s d)
of 1.1, as well as average sensitivity of .56 and specificity of .96. Sollman and
Berry (2011) reported on a similar review of the most commonly studied PVTs
published since the previous meta-analysis and found a mean effect size of 1.5,
sensitivity of .69, and specificity of .90. Along with many other publications
in the area, these results supported routine clinical use of PVTs, especially in
cases where compensation-seeking, such as litigation or the pursuit of disability
awards, was present.
In 2005, the National Academy of Neuropsychology published a policy state-
ment on PVTs (Bush et al., 2005), indicating that “the assessment of symptom
validity is an essential part of a neuropsychological evaluation. The clinician
should be prepared to justify a decision not to assess symptom validity as part
of a neuropsychological evaluation” (p. 421). Along with similar statements by
other professional organizations (Heilbronner et al., 2009), it is clear that PVTs
are a well-established part of the fabric of neuropsychological assessment.
foils. Across nine blocks of trials, face difficulty was manipulated by increasing
the number of consonant letters in the stimulus varying from 3 to 4 to 5, and by
increasing the number of choices on the recognition trial from 2 to 3 to 4. Orey,
Cragar, and Berry (2000) described a manual form of the test with stimulus and
recognition materials printed on 3″ X 5″ index cards. Schipper, Berry, Coen, and
Clark (2008) published a cross-validation of the manual form of the test, which
will be the target article for application of the STARD criteria here.
The third column of Table 11.2 summarizes application of the STARD criteria
to the manuscript by Schipper et al. (2008). It may be seen that a large num-
ber of the STARD criteria were not addressed, or only briefly covered, in this
paper. Thus, if these criteria were followed in evaluating the paper for publica-
tion, it might not have been accepted in a journal that required adherence to
STARD guidelines for reports on diagnostic tests. More substantively, several
of the omitted STARD criteria in the paper raise the possibility that various
types of bias that might have affected results were not addressed, and these will
be covered next. Parenthetically, it should be noted that several other publica-
tions support the validity of the LMT (Alwes, Clark, Berry, & Granacher, 2008;
Dearth et al., 2005; Inman & Berry, 2002; Inman et al., 1998; Vagnini et al., 2006;
Vickery et al., 2004).
For reviewing evidence on the accuracy of diagnostic test procedures, “bias”
has been defined as “systematic error or deviation from the truth, either in results
or in their inferences” (Reitsma et al., 2009, p. 4). The information missing from
the report by Schipper et al. (2008) raises the possibility of the presence of several
types of bias, including, but not limited to, the following: The minimal descrip-
tion of selection criteria for participants means that the spectrum of patients for
whom the results are generalizable is unclear, and thus it is uncertain to which
groups the findings may apply. In terms of the reference standard, the description
of the accuracy of the combination of the Test of Memory Malingering (TOMM)
and the Digit Memory Test (DMT) on pp. 346–347 is helpful, but it does not
entirely rule out bias due to the criterion variables. In other words, the extent to
which application of these reference tests may have resulted in inaccurate clas-
sifications is ambiguous. The brief description of injury severity for TBI patients
on p. 346 is insufficient to understand completely their standing on disease pro-
gression and recovery variables, which again obscures the appropriate popula-
tion to which results might generalize. Although the selection of participants’
data is described as an “archival” approach, the lack of description of the process
by which participants were referred for neuropsychological evaluations, as well
as failure to specify the criteria for choice of PVTs administered in a given exam-
ination, raise the possibility of partial verification as well as differential verifica-
tion biases. These biases arise in cases in which the group used to determine
sensitivity undergoes different procedures from the group used to determine
specificity values, which may affect generalizability of results. Failure to specify
whether index test administrators and interpreters were blind to reference test
results raises the possibility of diagnostic review bias, commonly termed “cri-
terion contamination” in the psychometric literature. Criterion contamination,
250
in which findings from the criterion variables may affect interpretation of pre-
dictor variables, typically results in an overestimation of validity. The number
of indeterminate outcomes (only one PVT failed) was described on p. 346. As
results from these participants were excluded from the calculations of sensitivity
and specificity values, it is likely that one or both of these parameters is overes-
timated in the present study. The issue of withdrawals from neuropsychological
evaluations is not addressed in the report, which also may affect the accuracy
of test parameters. Together, these potential sources of bias suggest that results
from this study should be applied only cautiously to new patient groups. This
exercise illustrates how systematic application of the STARD criteria to review of
an article directs our attention to important issues that may limit the applicabil-
ity of reported results. Of course, similar issues may arise in published studies of
other PVTs as well as standard neuropsychological tests.
The data presented in Table 11.2 suggest that STARD criteria may be readily
applied to a published study of a PVT. Although STARD provides a meticulous
assessment of the quality and generalizability in an evaluation of the diagnostic
validity of a PVT, this application may be too time-consuming for a busy clinician
confronted with a novel clinical problem. Therefore, it may be helpful to illustrate
the application of a briefer, but still acceptable, set of criteria to a common clinical
concern. This will be addressed by setting out a hypothetical assessment scenario
and working through the application of these criteria to the problem.
CLINICAL PROBLEM
A 40-year-old man presents to a psychological provider with memory com-
plaints approximately a year following a motor vehicle accident with a reported
mild concussion. The patient relates that he was rear-ended while stopped at an
intersection. He reports problems with attention and prospective memory, stat-
ing that he must rely on lists to function at work and in his personal life since
the accident. In passing, the patient mentions that he is considering pressing
charges and possibly pursuing civil litigation against the other driver involved
in the accident.
The relevant question, then, is whether or not, in the case of 40-year-old male
with a mild TBI and subjective memory symptoms, there is a PVT that pro-
vides reliable diagnostic information beyond clinical impression to determine if
a compensation-seeking evaluee is putting forth his best effort.
worksheets are a condensed subset of the STARD criteria. Thus, these work-
sheets allow the relatively rapid implementation of major STARD criteria. There
are CAT worksheets designed to allow clinicians to analytically evaluate avail-
able evidence on diagnosis, prognosis, harm, and therapy. For most psycholo-
gists and neuropsychologists, the diagnostic CAT worksheet provides the most
utility as it provides a systematic way to evaluate research evidence on measures
relevant to clinical practice in this area. Furthermore, the CAT approach allows
clinicians to relate this evidence to individual patients in specific diagnostic set-
tings. Although these methods may not be necessary for clinical questions that
fall under a clinician’s expertise, the framework can be used to find and interpret
evidence for novel clinical questions or to update an area of knowledge.
The subsequent diagnostic CAT example is based on the worksheet accessible
online at http://ktclearinghouse.ca/cebm/practise/ca and is used to illustrate the
clinical utility of the CAT method for PVTs. Although CAT worksheets are most
valuable when dealing with novel clinical questions, the TBI problem is used
Source: Schipper, L. J., Berry, D. T. R., Coen, E. M., & Clark, J. A. (2008). Cross-
validation of a manual form of the Letter Memory Test using a known-groups
methodology. The Clinical Neuropsychologist, 22, 345–349.
252
Table 11.3.2 CAT Worksheet Example, Part 2 Are the valid results of this diagnostic study important?
EXAMPLE CALCULATIONS
Target Disorder (Probable Cognitive Feigning)
Present Absent Totals
1.Diagnostic test result (LMT) Positive (<93% on LMT) 8 1 9
a b a+b
Negative (≥93% on LMT) 2 38 40
c d c+d
Totals 10 39 49
a+c b+d a+b+c+d
Test-Based Operating Characteristics
2. Sensitivity = a/(a + c) = 8/10 = 80%
3. Specificity = d/(b + d) = 38/39 = 97.5%
4. Likelihood ratio for a positive
test result = LR+ = sensitivity/(1 -
specificity) = 80%/2.5% = 32
5. L
ikelihood ratio for a negative test
result = LR– = (1 - sensitivity)/
specificity = 20%/97.5% = 0.21
Sample-Based Operating Characteristics
6. S ample positive predictive value = a/(a
+ b) = 8/9 = 88.9%
7. S ample negative predictive value = d/(c
+ d) = 38/40 = 95%
253
8. S ample base rate = (a + c)/(a + b + c +
d) = 10/49 = 0.20
Population-Based Operating
Characteristics (Revised Estimates)
9. Pre-test probability (estimated
population base rate) = 0.35 (mid-range
compensation-seeking mild TBI base
rate; Greve, Bianchini, & Doane, 2006)
10. Pre-test odds = prevalence/(1 -
prevalence) = 35%/65% = 0.54
11. Post-test odds for a positive test
result = pre-test odds x LR+ = 0.54 x
32.0 = 17.3
12. P ost-test odds for a negative test
result = pre-test odds x LR– = 0.54 x
0.21 = .11
13. P ost-test probability if test sign is
positive = 17.3/18.3 = .95
14. Post-test probability if test sign is
negative = 0.11/1.11 = .10
Conclusion—Yes, the LMT has
moderately strong sensitivity and strong
specificity, suggesting that it would be
potentially useful.
Note—data in above table and statistics appearing below based on p. 347, paragraph 4, Schipper et al. (2008).
254
Table 11.3.3 CAT Worksheet Example, Part 3 Can you apply this valid,
important evidence about a diagnostic test in caring for your patient?
1. Is the diagnostic test Yes, the LMT is available for purchase from its
available, affordable, author, has been cross-validated, and appears
accurate, and precise in your comparable in operating characteristics to
setting? other PVTs. Training as a neuropsychologist or
psychometrist entails accuracy and precision in
administering neuropsychological tests.
2. Can you generate a clinically Yes, and the literature on assessment of mild
sensible estimate of your traumatic brain injury suggests a base rate of
patient’s pre-test probability about 30–40% for malingered neurocognitive
(from personal experience, dysfunction when the evaluation includes a
prevalence statistics, practice forensic component. For current purposes, a pre-
databases, or primary test feigning base rate of 35% was assumed, based
studies)? on reports in the literature. The compensation-
• Are the study patients seeking subjects in the study are comparable to the
similar to your own? current patient. There does not appear to be any
• Is it unlikely that the reason to believe that the base rate of malingered
disease possibilities neurocognitive dysfunction in this population has
or probabilities have changed over time.
changed since the
evidence was gathered?
3. Will the resulting post-test Yes. Failing the LMT provides a positive predictive
probabilities affect your power value of .89, making it likely that Malingered
management and help your Neurocognitive Dysfunction (MNCD) is present.
patient? Passing the LMT provides a negative predictive
• Could it move you power of .95 at the specified base rate of 35%. The
across a test-treatment LMT is no more uncomfortable to take than most
threshold? other neuropsychological tests.
• Would your patient
be a willing partner in
carrying it out?
4. Would the consequences of Failing the LMT provides moderately strong
the test help your patient? evidence of feigned cognitive deficits. While not
necessarily likely to be viewed as helpful by the
compensation-seeking patient, such a result could
result in early termination of testing. Passing the
LMT provides strong evidence of valid test results.
Conclusion—Yes, failing the LMT raises the probability of feigned deficits from about
.35 to .89 at the estimated base rate of 35%. Passing the LMT reduces the probability
of feigned deficits from .35 to .05 at the estimated base rate of 35%.
255
REFERENCES
Alwes, Y. R., Clark, J. A., Berry, D. T. R., & Granacher, R. P. (2008). Screening for
feigning in a civil forensic setting. The Journal of Clinical and Experimental
Neuropsychology, 30, 1–8.
Anastasi, A., & Urbina, S. (1997). Psychological Testing (7th ed.). Upper Saddle River,
NJ: Simon & Schuster.
Berry, D. T. R., Baer, R. A., & Harris, M. J. (1991). Detection of malingering on the
MMPI: A meta-analysis. Clinical Psychology Review, 11, 585–598.
Boone, K. B. (2007). Assessment of Feigned Cognitive Impairment: A Neuropsychologic
al Perspective. New York: Guilford Press.
Bossuyt, P. M., Reitsma, J. B., Bruns, D. E., Gatsoni, C. A., Glasziou, P. P., Irwig,
L. M., … de Vet, H. C. W. (2003). Towards complete and accurate reporting of
studies of diagnostic accuracy: The STARD initiative. British Medical Journal,
326, 41–4 4.
256
Bush, S. S., Ruff, R. M., Troster, A. I., Barth, J. T., Koffler, S. P., Pliskin, N. H., … Silver,
C. H. (2005). Symptom validity assessment: Practice issues and medical necessity: NAN
Policy & Planning Committee. Archives of Clinical Neuropsychology, 20, 419–426.
Dearth, C. D., Berry, D. T. R., Vickery, C. D., Vagnini, V. L., Baser, R. E., Orey,
S. A., & Cragar, D. E. (2005). Detection of feigned head injury symptoms on the
MMPI-2 in head injured patients and community controls. Archives of Clinical
Neuropsychology, 20, 95–110.
Greve, K. W., Bianchini, K. J., & Doane, B. M. (2006). Classification accuracy of the
test of memory malingering in traumatic brain injury: Results of a known-groups
analysis. Journal of Clinical and Experimental Neuropsychology, 28, 1176–1190.
Heilbronner, R. L., Sweet, J. J., Morgan, J. E., Larrabee, G. J., Millis, S. R., & Conference
Participants. (2009). American Academy of Clinical Neuropsychology Consensus
Conference statement on the neuropsychological assessment of effort, response
bias, and malingering. The Clinical Neuropsychologist, 23, 1093–1129.
Hsu, J., Brozek, J. L., Terracciano, L., Kries, J., Compalati, E., Stein, A. T., …
Schünemann, H. J. (2011). Application of GRADE: Making evidence-based recom-
mendations about diagnostic tests in clinical practice guidelines. Implementing
Science, 6, 62.
Inman, T. H., & Berry, D. T. R. (2002). Cross-validation of indicators of malinger-
ing: A comparison of nine neuropsychological tests, four tests of malingering, and
behavioral observations. Archives of Clinical Neuropsychology, 17, 1–23.
Inman, T. H., Vickery, C. D., Berry, D. T. R., Lamb, D., Edwards, C., & Smith, G. T.
(1998). Development and initial validation of a new procedure for evaluating ade-
quacy of effort given during neuropsychological testing: The Letter Memory Test.
Psychological Assessment, 10, 128–139.
Larrabee, G. J. (2007). Assessment of Malingered Neuropsychological Deficits.
New York: Oxford University Press.
Larrabee, G. J. (2012). Performance validity and symptom validity in neuropsychologi-
cal assessment. Journal of the International Neuropsychological Society, 18, 625–631.
Lijmer, J. G., Mol, B. W., Heisterkamp, S., Bonsel, G. J., Prinz, M. H., van der Meulen,
J. H. P., & Bossuyt, P. M. M. (1999). Empirical evidence of design-related bias in stud-
ies of diagnostic tests. Journal of the American Medical Association, 282, 1061–1066.
Meyer, G. J., Finn, S. E., Eyde, L. D., Kay, G. G., Moreland, K. L., Dies, R. R., … Reed,
G. M. (2001). Psychological testing and psychological assessment: A review of evi-
dence and issues. American Psychologist, 56, 128–165.
Orey, S. A., Cragar, D. E., & Berry, D. T. R. (2000). The effects of two motivational
manipulations on the neuropsychological performance of mildly head-injured col-
lege students. Archives of Clinical Neuropsychology, 15, 335–348.
Reitsma, J. B., Rutjes, A. W. S., Whiting, P., Vlassov, V. V., Leeflang, M. M. G., Deeks, J. J.
(2009). Chapter 9: Assessing methodological quality. In J. J. Deeks, P. M. Bossuyt,
& C. Gatsonis (Eds.), Cochrane Handbook for Systematic Reviews of Diagnostic Test
Accuracy Version 1.0.0. The Cochrane Collaboration, 2009. Available from: http://
srdta.cochrane.org/.
Rogers, R. (1988). Clinical Assessment of Malingering and Deception (1st ed.).
New York: Guilford Press.
Rogers, R. (2012). Clinical Assessment of Malingering and Deception (3rd ed.).
New York: Oxford Press.
Rogers, R., Sewell, K. W., Martin, M. A., & Vitacco, M. J. (2003). Detection of feigned
mental disorders: A meta-analysis of the MMPI-2 and malingering. Assessment, 10,
160–177.
257
Schipper, L. J., Berry, D. T. R., Coen, E., & Clark, J. A. (2008). Cross-validation of a
manual form of the Letter Memory Test using a Known-Groups methodology. The
Clinical Neuropsychologist, 22, 345–349.
Slick, D. J., Sherman, E. M., & Iverson, G. L. (1999). Diagnostic criteria for malingered
neurocognitive dysfunction: Proposed standards for clinical practice and research.
The Clinical Neuropsychologist, 13, 545–561.
Sollman, M. J., & Berry, D. T. R. (2011). Detection of inadequate effort on neuro-
psychological testing: A meta-analytic update and extension. Archives of Clinical
Neuropsychology, 26, 774–789.
Straus, S., Richardson, W. S., Glasziou, P., & Haynes, R. B. (2011). Evidence-Based
Medicine: How to Practice and Teach EBM (4th ed.). Edinburgh: Churchill
Livingstone.
Treweek, S., Oxman, A. D., Alderson, P., Bossuyt, P. M., Brandt, L., Brozek, J., …
DECIDE Consortium. (2013). Developing and evaluating communication strategies
to support informed decisions and practice based on evidence (DECIDE): Protocol
and preliminary results. Implementing Science, 8, 6.
Vagnini, V. L., Sollman, M. J., Berry, D. T. R., Granacher, R. P., Clark, J. A., Burton, R.,
… Saier, J. (2006). Known-groups cross-validation of the Letter Memory Test in
a compensation-seeking mixed neurologic sample. The Clinical Neuropsychologist,
20, 289–305.
Vickery, C. D., Berry, D. T. R., Dearth, C. S., Vagnini, V. L., Baser, R. E., Cragar, D. E., &
Orey, S. A. (2004). Head injury and the ability to feign neuropsychological deficits.
Archives of Clinical Neuropsychology, 19, 37–48.
Vickery, C. D., Berry, D. T. R., Inman, T. H., Harris, M. J., & Orey, S. A. (2001).
Detection of inadequate effort on neuropsychological testing: A meta-analytic
review of selected procedures. Archives of Clinical Neuropsychology, 16, 45–73.
Whiting, P. F., Rutjes. A. W., Westwood, M. E., Mallett, S., Deeks, J. J., Reitsma, J. B.,
… the QUADAS-2 Group. (2011). QUADAS-2: A revised tool for the quality assess-
ment of diagnostic studies. Annals of Internal Medicine, 155, 529–536.
258
259
12
The primary aim of the present chapter is to review the methods designed
to help with integrating evidence into clinical practice, namely, the methods
of critical appraisal (critically appraised topics; CAT) for systematically evalu-
ating the clinical relevance of empirical evidence, with particular emphasis in
this chapter on neuropsychological intervention studies. Beginning with the
development of a focused clinical question, the process of evaluating an inter-
vention study with CAT analysis—including methods for assessing study qual-
ity, calculating patient-relevant statistics, and moving from group-level data to
individual patient recommendations—will be reviewed in detail, with worked
examples throughout. A secondary aim is to demonstrate the utility, and per-
haps more important, the simplicity of making empirically based recommenda-
tions for patients, and how these data might be utilized and integrated into an
EBP. Ultimately, the goal is to improve patient outcomes and facilitate the merg-
ing of science with clinical practice via a simple systematic method.
(4) determining whether these valid and important results are applicable to a
particular patient. From these determinations, quantifiable metrics are derived
that are then used to provide tailored recommendations as to whether or not the
reviewed intervention is likely to be beneficial. In addition to making informed
treatment recommendations, a further benefit of CAT analysis is that it creates a
condensed summary of the best available evidence, with practice recommenda-
tions that can be stored for later reference. These recommendations can then be
used as an evidence-foundation for development of clinical policies and practice
guidelines. As each step of the CAT methodology is reviewed, the specific pro-
cedures will be demonstrated and applied using a realistic hypothetical clinical
scenario, presented in Box 12.1.
Patient (or Problem), the Intervention, the Comparison, and the Outcome, and
can be applied to a multitude of clinical questions. Appropriate formulation of
PICO-based questions can also assist with identifying relevant search terms and
keywords to be used in the literature reviews. Although it is not entirely neces-
sary to specify each component, the more information and detail are provided,
the more refined the obtained results will be. In most instances, identifying the
patient or problem is straightforward and obvious. In our scenario, for example,
the patient is a 72-year-old college-educated man. Alternatively, identifying the
patient’s concern about increasing forgetfulness would frame the question using
a problem-focused perspective. Either approach would be appropriate, though
focusing on the patient’s problem of interest may aid in establishing search
terms. For even greater depth, describing both the patient and the problem of
interest is feasible.
In the example provided, the patient himself has specified a particular inter-
vention (i.e., computer-based cognitive training) that is of interest to him,
though this is not always the case. This is where good clinical skill and expertise
are needed to help identify potential interventions. Though some patients may
be particularly well informed and come with interest in an identified interven-
tion, more often the clinician is relied upon as the expert to suggest a treatment.
Identifying a comparison condition may also be obvious in some instances
where the patient’s concerns are of a type familiar to the clinician, in others,
the comparison may be implied, but in still others, it may need to be specified.
When comparing medications, for example, a patient may want to know if one
particular drug may be more effective than another, or whether or not a cur-
rent medication has more side-effects than an alternative. In behavioral health
and therapy, similar questions can be asked about the relative benefits of one
type of intervention compared to another (e.g., prolonged exposure therapy vs.
cognitive-processing therapy, mindfulness-based stress reduction vs. anxiolytic
medication). In such instances, the alternative or comparison should be speci-
fied whenever possible. Oftentimes, however, the comparison may be no treat-
ment or intervention at all (i.e., maintain status quo), which is the case in our
example.
The fourth major component of a clinically focused question is to identify the
outcome of interest. Even though this often serves as the starting point in formu-
lating a question (e.g., What’s the goal?), it is important to ensure that an articu-
lated, operationalized, and realistic goal has been clearly specified. Doing so not
only clarifies the goal, but also helps frame the scope and trajectory of interven-
tion. For example, the outcome of interest in one study may be delay of cogni-
tive decline or prevention of dementia, whereas in another study, the outcome
of interest may be restoration of cognitive functioning or increasing functional
independence. Each of these goals may be applicable to similar patient popula-
tions and comparable interventions, but reflect inversely related outcomes (i.e.,
reducing odds of adverse outcome vs. increasing odds of beneficial outcome). As
will be shown later, this has particular relevance for the metrics derived from
CAT analysis.
264
Systematic
Reviews
e
nc
ide
TRIP Database Critically-Appraised FILTERED
ev
searches these Topics
of
[Evidence Syntheses]
INFORMATION
simultaneously
ty
ali
Qu
Critically-Appraised Individual
Articles [Article Synopses]
Figure 12.1 Levels of empirical evidence. EBM Pyramid and EBM Page Generator,
(c) 2006 Trustees of Dartmouth College and Yale University. All rights reserved.
Produced by Jan Glover, David Izzo, Karen Odato, and Lei Wang.
among published studies, and these data can oftentimes prove elusive. For this
reason, several specific sets of reporting guidelines have been developed, each
of which targets a specific type of publication. Perhaps most relevant to neuro-
psychological interventions is the Consolidated Standards of Reporting Trials
(CONSORT) statement, which includes a participant flowchart (Schulz, Altman,
& Moher, 2010) and has been reviewed for neuropsychological studies (Miller,
Schoenberg, & Bilder, 2014). Also of interest, the Preferred Reporting Items for
Systematic Reviews and Meta-Analysis (PRISMA) statement (Moher, Liberati,
Tetzlaff, & Altman, 2009) includes relevant points for inclusion in systematic
reviews and meta-analysis. These checklists contain the minimum necessary
criteria for transparent reporting of research findings, and are more and more
becoming required components for publication. In order to complete a CAT
analysis, for example, categorical outcome data must be readily available or able
to be determined, as these data provide the basis for calculation of risk-reduction
(RR), number-needed-to-treat (NNT), and related critical appraisal metrics.
Adherence to the CONSORT statement will ensure that these data are readily
reported, as a participant flowchart is required. In some studies, these data will
be readily available and explicitly reported. In others, these values may need to
be calculated (e.g., when the outcome is reported as a percentage of participants).
Hence, when identifying evidence, seeking out studies that adhere to report-
ing guidelines can facilitate CAT analysis. Additional information on reporting
guidelines is available through the Equator Network (www.equator-network.
org), and readers are strongly encouraged to familiarize themselves with this
information.
As has been previously mentioned, a core component of EBP is that the best
available evidence is used. Figure 12.1 provides a general hierarchical frame-
work, and several grading systems have been developed to further facilitate
rapid assessment of evidence quality. Perhaps one of the most well-k nown is that
from the Oxford Centre for Evidence-Based Medicine (CEBM), which provides
a five-step model for various clinical questions (Howick et al., 2011). For all but
prevalence studies, a systematic review is considered Level 1 evidence (http://
www.cebm.net/ocebm-levels-of-evidence/). For an intervention study, a single
RCT is considered Level 2 data, and a non-randomized trial is considered Level
3. The Grading of Recommendations Assessment, Development and Evaluation
(GRADE) Working Group is another example of an international collaboration
that has led to the development of a well-established method for grading the
quality of evidence (www.gradeworkinggroup.org/index.htm). It is important to
use the highest level of evidence available when conducting a CAT analysis, as
the quality of output from critical appraisal is directly tied to the quality of input.
However, as an aside, a common misunderstanding of the CAT process is that a
successful CAT is only one that finds relevant, high-quality evidence. In fact, this
is not the only purpose of a CAT. A successful CAT finds and evaluates the best
available evidence, whether that evidence is derived from a high-quality study or
a poor-quality study, or even no study at all. The outcome of the CAT is then to
provide the clinician undertaking the CAT with an informed evaluation of the
268
best available evidence, which may include the conclusion that there is no good
evidence bearing on a particular question. Alternatively, it may be concluded
that the best available evidence is derived form a poor-quality study that should
not be relied upon to guide clinical practice.
To continue with our example, an initial search of the Cochrane Library
returned no results (searched: June 20, 2015). Turning to PubMed to find a high-
quality, well-reported systematic review of computerized cognitive training
returned several results, though none were sufficiently reported to allow CAT
analysis. Working down the evidence ladder, a single RCT was sought next.
Search terms included: cognitive training AND older adults (note the use of the
Boolean operator “and” to refine the search to results including both terms).
Search filters applied were: clinical trial, published in the last 5 years, studying
humans, and published in English. These search criteria returned 291 results.
Titles and abstracts were initially reviewed, and promising articles were reviewed
in detail (exemplifying the importance of a good title!). The 10-year update of
the Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE)
cognitive training trial by Rebok et al. (2014) was identified as a relevant study
and was selected for further CAT analysis. It is considered Level 2 evidence per
the Oxford CEBM system, which, given the lack of a systematic review, was the
highest level of evidence available at the time of the search (http://w ww.cebm.
net/ocebm-levels-of-evidence/). The ACTIVE trial was one of the largest RCTs
to date of cognitive training in older adults (n = 2,832; ages 65–94). Targeted
cognitive domains included in the intervention were memory, processing speed,
and reasoning and the comparison group was a no-contact control group (Ball
et al., 2002). Further review of the paper found that categorical outcomes could
be calculated from the data presented, thereby making this study useful for full
CAT analysis.
four cells. Note that each cell represents a unique group of participants, there
should be no overlap in membership between cells. As previously mentioned,
this can be far easier said than done, as these data are not always explicitly pre-
sented. Even for studies with high-quality methods, should these data not be
available in the report, this would be a valid stopping point for further CAT
analysis. Thus, without proper and thorough reporting, studies rated highly for
methods quality can still produce less interpretable data.
Although these data may be directly reported in some papers, in others they
may be more elusive. For examples, in the paper reviewed here, the research par-
ticipants meeting outcome criteria are reported in terms of a percentage (found
in Rebok et al., 2014, their Table 2, p. 21), thereby requiring working backwards
to calculate the exact number of participants who met outcome criteria (note
that, for this particular purpose, however, reporting percentages in this manner
can facilitate CAT analysis, discussed later). When determining these values, it
is very important to keep in mind the specific outcome of interest, and whether
or not it is beneficial or harmful. In the Rebok et al. paper, the reported values
relate to the categorical outcome variable of the number of people who remained
“at or above baseline” (i.e., remained stable) over the course of the 10-year inter-
val, which would be considered a benefit. If, however, the outcome of interest
was cognitive decline (as is the case in the presently worked example), a further
calculation would be necessary to establish the risk of harm, defined as cognitive
decline. This highlights the importance of remaining mindful of the outcome of
interest.
For both the treatment group and the control group, event rates need to be cal-
culated for the outcome of interest. These are simply the proportion of individu-
als demonstrating the outcome of interest out of the total number of participants
in the respective group, expressed as a percentage (this is why the use of percent-
ages in the Rebok et al., 2014, paper is beneficial; see their Table 2, p. 21) or deci-
mal proportion. In the case of an ITT analysis, the event rates are the proportion
of people showing the outcome out of the total number of people randomized to
that treatment condition. Referencing Table 12.1, the value in cell C represents
the percentage of individuals in the control group who satisfy outcome criteria.
Dividing this value by the total number of individuals in the control group ren-
ders the Control Event Rate (CER). An analogous value is calculated for the treat-
ment, or experimental, group, referred to as the Experimental Event Rate (EER).
The corresponding value in Table 12.1 would be the value of cell A divided by
the total number of individuals in the experimental group (A/A + B). The CER
272
and EER serve as the foundation of CAT analysis for a treatment study and are
utilized to derive the remaining statistics. In any given intervention study, there
will be an event rate associated with each of the treatment conditions, as well
as for each control condition. The Rebok et al. (2014) paper, for example, would
yield three EERs and three CERs (one experimental event rate and control event
rate for each of the memory training, reasoning training, and processing-speed
training conditions).
Using the calculated event rates, the extent of reduction in risk of the outcome
can then be calculated in both absolute and relative terms, which is then used to
determine the number of individuals to treat before an outcome is expected (for
details of formula, see Straus et al., 2011). The Absolute Risk Reduction (ARR) is
simply the extent to which an individual’s risk of the outcome is reduced from
receiving the treatment. ARR is the absolute value of the difference between the
EER and the CER, that is, (|CER –EER|) (Straus et al., 2011). The Relative Risk
Reduction (RRR) is an extension of the ARR, calculated as a function of the over-
all prevalence of the outcome in the specific control group studied. RRR is calcu-
lated by dividing the ARR by the CER and represents the reduced risk associated
with the specific intervention in comparison to those who received the alternative
treatment. Depending on the outcome of interest, relative and absolute risk can
also be thought of as “beneficial” (i.e., relative benefit increase; absolute benefit
increase). This would be applicable for desirable outcomes in which the standard
of care is improvement in functioning (e.g., maintenance of functional indepen-
dence), as opposed to prevention of an adverse event (e.g., transition to assisted
living). In the example study, if we were to utilize the event rates reported directly
in the paper as being good outcomes (i.e., the percentage of individuals remaining
at or above baseline at follow-up), the resulting event rates would lead to calcula-
tion of the relative benefit increase and absolute benefit increase.
The Number Needed to Treat (NTT) is the inverse of the ARR (i.e., dividing 1
by the ARR) and provides an indication of the number of people who would need
to receive the treatment in order to prevent one additional harmful or unde-
sirable outcome (Straus et al., 2011). The NNT is reported as a whole number
and is expressed relative to the length of time represented by the study. In the
ACTIVE trial, for example, the resulting NNT would reflect the number of indi-
viduals who would need to be treated in a 10-year period (as this was the length
of follow-up) in order to prevent one additional undesirable outcome defined in
terms of deterioration from baseline. The duration of intervention introduces
an added calculation when comparing CAT analyses and NNTs between mul-
tiple interventions. Unless the studies were completed in the same amount of
time, one of the two time intervals will need to be adjusted by using the shorter
period as a scale factor in an effort to mathematically account for the difference
in length. As shown in Straus et al. (2011; p. 84): NNT (hypothetical) = NNT
(observed) x (observed time/hypothetical time). In general, smaller NTT val-
ues are preferable, though there is no universal standard for what is considered
acceptable. Utilizing clinical expertise, individual costs and burdens associated
with the treatment, and the severity of the outcome that is to be prevented are all
273
factors that should be considered when evaluating NNTs, and doing so requires
careful consideration.
As with most neuropsychological measures, the risk reduction statistics and
NNT values calculated should not be considered precise estimates, but rather
a foundation on which a confidence interval (CI) can be constructed that will
contain the true value. CIs can and should be calculated for both event rates, as
well as the NNT, for clinical decision-making. These are typically calculated as a
95% CI and interpreted much the same way as any other CI in clinical practice.
When determining the significance of an effect, if the 95% CI built around RRR
or ARR includes zero, it cannot be said with any certainty that the treatment is
associated with any reliable reduction in risk over the control group. The formula
for calculating the CI is long and complex, and it is advisable to use an online cal-
culator or CAT maker (for example: www.cebm.net/catmaker-ebm-calculators)
to calculate the CIs in order to reduce the possibility of calculation errors.
To calculate the relevant statistics for the ACTIVE trial, we will focus ini-
tially on the memory intervention training, as this is an intervention of primary
interest for our hypothetical patient. As previously mentioned, the reporting of
percentages by Rebok et al. facilitates CAT analysis, as the event rates for cog-
nitive stability are reported directly. For our purposes, we are using cognitive
decline as the outcome of interest to facilitate interpretation as a reduction in
risk. Thus, we simply need to use the remaining percentage of participants as our
event rates, which reflect the proportion of people with memory decline in the
respective treatment and control groups.
to calculate the event rate, not, as was done in the preceding sections, as a
proportion of those who completed the study, but as a proportion of the people
who were randomized at the commencement of the study. At the beginning
of the study, 704 people were randomized to the control condition and 712 to
the Speed Training condition (see Rebok et al., 2014, Figure 1). Therefore, an
ITT analysis that assumed that all those lost to follow-up were treatment suc-
cesses (stable) could estimate the CER as 149 declined out of 704 randomized
to the control condition, or .212. The EER would be 93 declined out of 712
randomized to the Speed Training condition, or .131. These event rates lead
to RRR = .38 (95% CI: .20 –.57), an ARR of .081 (95% CI: .042 –.120), and an
NNT = 12 (95% CI: 8–24). Fortunately, the ITT analysis still leads to the infer-
ence of an effective treatment, although the NNT of 12 is not as impressive
as the NNT of 4 reported from the Speed Training completers analysis in the
previous section.
This approach is just one of several possible “imputation” methods for ITT,
in the above example, where all missing data are assumed to have not experi-
enced the event. In the case of the Rebok et al. study, this approach assumes
that all missing cases were good outcomes (stable) in all conditions, including
the control group. The Oxford CEBM CATmaker calculator estimates an ITT
analysis in the same way. See Gupta (2011) and the Cochrane Handbook (http://
handbook.cochrane.org/chapter_16/16_2 _2 _i ntention_to_t reat_issues_for_
dichotomous_data.htm) for discussion of pros and cons of various alternative
approaches to ITT analysis.
potential benefits, in the context of the patient’s expected outcomes and prefer-
ences, is also important. Bearing in mind availability of resources and the palat-
ability of an intervention for a patient can readily influence adherence. Failure to
do so may increase the odds of a poor outcome.
An important caveat of the NNT is that it does not account for individual
patient factors affecting risk. The NNT reflects the probabilities of risk-reduc-
tion in a population similar to that included in the study subject to appraisal. To
refine CAT analysis even further, the NNT can be custom-tailored to account
for a patient’s individual level of risk. This is done by comparing the individual
patient to the control group on the basis of individual characteristics, which is
expressed as the decimal f t. A value of 1.0 implies that the patient has a chance of
benefit from treatment comparable to that of the average control patient. A value
of 2.0 would indicate double the chances of benefit, and a value of 0.5 signifies
half the chance of benefit. Assigning f t values is certainly not a precise science
and draws heavily on clinical acumen and expertise. Dividing the study NNT by
our patient’s individual risk can be used as a more refined estimate of individual
risk for a particular patient. The degree of individualization can only go so far,
and there is no clear restriction on the size of the f t value. However, if consider-
able adjustment is necessary, it raises the question of whether the reviewed study
is appropriate. Unless the appraised study has previously been determined to
be the absolute best available, if there are drastic differences between a patient
and the control group, it may be best to try to find a study of more comparable
participants.
In the case of our patient, he is of similar age and educational background to
the average of the control group. However, recall that he has two first-degree rel-
atives with dementia, which increases his risk at least twofold (Cupples, Farrer,
Sadovnick, Relkin, Whitehouse, & Green, 2004). Although typically there would
be no reason to calculate any additional CAT statistics for an intervention that
is not associated with a significant risk reduction (in this case, the memory-
training intervention), we will proceed with our example for continuity and
illustration purposes.
By dividing the NNT from the memory-training group in our study by 2.0
(our patient’s individual chance of benefit), we arrive at an individual NNT of
10. For the processing-speed training, the revised completers NNT is 2. Thus,
we would need to treat 10 higher-risk people like our patient for 10 years with
the study’s memory-training intervention to prevent one of them from declining
in memory. In other words, our patient has a 1 in 10 chance, or 10% likelihood,
of benefitting from the memory treatment. Contrasted to the processing-speed
intervention, where our patient has a 50% chance of maintaining his process-
ing-speed ability should he undergo treatment, the odds that his memory will
improve or remain stable are much, much lower. Our patient clearly has a pref-
erence for cognitive training, and implementation is feasible. Results from our
CAT analysis suggest that there may be benefits for processing speed. However,
our patient’s expectation is to improve his memory functioning, where results
from this study yield equivocal findings. Armed with these data from our
277
complete CAT analysis, we can now provide our patient with information that is
much more informative (e.g., percentages, odds, NNTs, etc.).
Using Completed CATs
Once a CAT has been completed in a specific study, it should certainly be saved
for future reference. A single-page summary, documenting each of the afore-
mentioned steps, presenting the quantitative outcomes and relevant event rates,
and conclusions drawn should be prepared as part of the appraisal, and format-
ted worksheets are readily available to facilitate archiving (see Oxford CEBM
site referenced previously). In addition to the event rates and NNT values, a suc-
cinct clinical conclusion should be stated, as well as any caveats. It is also impor-
tant that the summary include the citation of the article and who completed
the CAT analysis, as well as an expiration date. Specifying an expiration date
will help ensure that only current evidence is used for clinical decision mak-
ing. By archiving completed CATs, a repository of evidence will accumulate
that can be quickly referenced when a similar clinical scenario is encountered,
facilitating EBP.
While the current number of published CATs is limited, this is a remediable
situation. With the simplicity of CAT analysis, updating reporting guidelines to
require a completed CAT analysis with each published RCT is one feasible solu-
tion. Doing so would also move away from the limitations associated with using
alpha levels below a minimum threshold and mean group differences. Creating
a venue for publishing CATs is also warranted, as doing so would not only facili-
tate publication, it would simultaneously create a searchable repository of com-
pleted CATs that could be used by many clinicians, comparable to searching the
empirical literature.
As an alternative to a repository of completed CATs, a collaborative knowledge-
base of published categorical outcomes could be generated from current and
future clinical trials that would allow individual CATs to be completed on the
basis of patient parameters entered by clinicians. For example, a clinician could
enter the demographic and clinical data for a given patient, and results returned
would include several NNTs that are automatically generated. Individual risk
factors could also be automatically accounted for, and adjusting for time differ-
ences between CATs (i.e., scaling the NNT) automated to facilitate comparison
between interventions. Essentially, such an automated CAT maker would oper-
ate based on individual patient parameters, which could be used to custom-tailor
evidence-based recommendations.
in RCTs, it can be difficult to make the leap from the participants in the study to
our individual patients, who are more often than not going to present with more
complex clinical situations than the ideal patients studied. Use of individual risk
adjustments can help with this problem of generalizability to some extent, but
this method can only do so much. It goes without saying that continued research
on new interventions and with diverse populations is one way to address this
issue, but certainly it is not feasible to study every possible clinical presenta-
tion or intervention available. Improving reporting of current research, however,
would at least allow greater application of CAT methods to more of the research
that does get published.
Another limitation is that there is currently a very limited number of pub-
lished CATs available. Although completing a CAT requires minimal time and
effort, especially with the use of automated calculators, very few individual CATs
have been formally published. This may be due in part to the lack of a publish-
ing forum, but it is also likely to be related to their relatively low utilization,
particularly among the behavioral sciences. Not only does this decrease aware-
ness of critical appraisal methods, without regular dissemination or a searchable
repository, when practitioners complete a CAT on a study that has already been
appraised, they are engaging in redundant work.
REFERENCES
APA Presidential Task Force. (2006). Evidence-based practice in psychology. American
Psychologist, 61(4), 271–285. doi:10.1037/0003-066x.61.4.271
Ball, K., Berch, D. B., Helmers, K. F., Jobe, J. B., Leveck, M. D., Marsiske, M., … Willis,
S. L. (2002). Effects of cognitive training interventions with older adults: A ran-
domized controlled trial. Journal of the American Medical Association, 288(18),
2271–2281.
279
Canese, K. (2006). PubMed celebrates its 10th anniversary! NLM Tech Bulletin,
Sep–Oct(352), e5.
Chambless, D. L., & Hollon, S. D. (1998). Defining empirically supported therapies.
Journal of Consulting and Clinical Psychology, 66(1), 7–18.
Cochrane Collaboration. (2014a, May 26, 2014). Cochrane Database of Systematic
Reviews in Numbers. Retrieved June 20, 2015, from http://community.cochrane.
org/cochrane-reviews/cochrane-database-systematic-reviews-numbers.
Cochrane Collaboration. (2014b, March 31, 2015). Cochrane Organizational
Policy Manual. Retrieved June 20, 2015, from http://community.cochrane.org/
organisational-policy-manual.
Cook, T. D., & Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis for
Field Settings. Boston, MA: Houghton Mifflin.
Cupples, L. A., Farrer, L. A., Sadovnick, A. D., Relkin, N., Whitehouse, P., & Green, R. C.
(2004). Estimating risk curves for first-degree relatives of patients with Alzheimer’s
disease: The REVEAL study. Genetics in Medicine, 6(4), 192–196. Retrieved from
http://d x.doi.org/10.1097/01.GIM.0000132679.92238.58.
Gupta, S. K. (2011). Intention-to-treat concept: A review. Perspectives on Clinical
Research, 2(3), 109–112.
Howick, J., Chalmers, I., Glasziou, P., Greenhalgh, T., Heneghan, C., Liberati, A.,
… Thornton, H. (2011). The 2011 Oxford CEBM Evidence Levels of Evidence
(Introductory Document). Retrieved June 22, 2015, from http://w ww.cebm.net/
ocebm-levels-of-evidence/.
National Library of Medicine. (2014a, June 18, 2015). Fact Sheets: MeSH. Retrieved
June 20, 2015, from http://w ww.nlm.nih.gov/pubs/factsheets/mesh.html.
National Library of Medicine. (2014b, August 26, 2014). Fact Sheets: PubMed. Retrieved
June 20, 2015, from http://w ww.nlm.nih.gov/pubs/factsheets/pubmed.html.
Masic, I., Miokovic, M., & Muhamedagic, B. (2008). Evidence based medicine—New
approaches and challenges. Acta Informatica Medica, 16(4), 219–225. http://doi.org/
10.5455/aim.
Miller, J. B., Schoenberg, M. R., & Bilder, R. M. (2014). Consolidated Standards of
Reporting Trials (CONSORT): Considerations for neuropsychological research.
The Clinical Neuropsychologist, 28(4), 575–599. doi:10.1080/13854046.2014.907445.
Moher, D., Liberati, A., Tetzlaff, J., & Altman, D. G. (2009). Preferred reporting items
for systematic reviews and meta- analyses: The PRISMA statement. Annals of
Internal Medicine, 151(4), 264–269, W264.
National Academy of Neuropsychology. (2001). Definition of a clinical neuropsycholo-
gist. The Clinical Neuropsychologist, 3(1), 22.
Rebok, G. W., Ball, K., Guey, L. T., Jones, R. N., Kim, H. Y., King, J. W., … Willis,
S. L. (2014). Ten-year effects of the advanced cognitive training for independent and
vital elderly cognitive training trial on cognition and everyday functioning in older
adults. Journal of the American Geriatric Society, 62(1), 16–24. doi:10.1111/jgs.12607.
Sackett, D. L., Rosenberg, W. M., Gray, J. A., Haynes, R. B., & Richardson, W. S. (1996).
Evidence based medicine: What it is and what it isn’t. British Medical Journal,
312(7023), 71–72.
Sauve, S., Lee, H. N., Meade, M. O., Lang, J. D., Farkouh, M., Cook, D. J., & Sackett,
D. L. (1995). The critically appraised topic: A practical approach to learning critical
appraisal. Annals of the Royal College of Physicians and Surgeons of Canada, 28(7),
396–398.
280
Schulz, K. F., Altman, D. G., & Moher, D. (2010). CONSORT 2010 statement: Updated
guidelines for reporting parallel group randomized trials. Annals of Internal
Medicine, 152(11), 726–732. doi:10.7326/0003-4819-152-11-201006010-00232.
Straus, S. E., Glasziou, P., Richardson, W. S., & Haynes, R. B. (2011). Evidence-Based
Medicine: How to Practice and Teach It (4th ed.). Edinburgh, UK: Churchill
Livingston.
TRIP Database. (1997). Retrieved June 20, 2015, from https://w ww.tripdatabase.com/
info/.
Wyer, P. C. (1997). The critically appraised topic: Closing the evidence-transfer gap.
Annals of Emergency Medicine, 30(5), 639–640.
281
13
S T E P H E N C . B OW D E N
terms of what are now well-established, though evolving, criteria for quality. An
evidence-based practitioner also learns to identify quality opinions. For exam-
ple, an expert opinion on any area of actively researched clinical practice that
does not reflect the conclusions from a well-conducted systematic review should
immediately raise questions regarding the scientific status of that expert opin-
ion. Critical reflection on quality of evidence is a set of skills readily acquired by
practitioners and underlies the critical-appraisal techniques described in detail
in Chapters 8, 11, and 12 in this volume. Effective critical-appraisal skills begin
with scrutiny of individual diagnostic-validity or treatment studies (Chapters 11
and 12), then generalize to other forms of study and then to critical-appraisal of
systematic reviews. Clinicians who use critical appraisal skills can be confident
that they are using criteria to evaluate the quality of clinical evidence that have,
in turn, been subject to some of the most intense scientific scrutiny and peer-
review of any quality criteria in the history of health care, and have not been
bettered (Greenhalgh, 2006).
as they typically map onto each other and represent different levels of analy-
ses within the broader personality hierarchy” (p. 75, this volume). Lee and col-
leagues conclude their chapter by showing that some of the best contemporary
psychopathology-assessment instruments incorporate these theoretical innova-
tions to provide clinicians with the kind of theoretically motivated tests that
build on a strong history of criterion-related validity evidence. Together, these
chapters suggest that neuropsychologists have access to robust models of cog-
nition and psychopathology, models with an impressive array of explanatory
power, to guide theoretical refinement and evidence-based assessment practices,
to an extent not available before.
under conditions of high reliability, all recommended RCIs are likely to lead to
similar clinical inferences. However, under conditions of lower reliability and,
in particular circumstances regarding the observed baseline scores, some meth-
ods may be biased to detect change. These authors conclude that more work is
required to identify the optimal application of RCIs.
In Chapter 7, Chelune then shifts the focus to the key skills for knowledge
maintenance and professional development, when a clinician aims to achieve
the highest level of clinical expertise. Chelune describes some of the histori-
cal origins of evidence-based practice, showing that effective evidence-based
practice is a values-driven combination of best-evidence, expertise, and patient
values. Chelune also highlights the distinction from traditional notions of
expertise derived from experiential learning. He then describes the core skills in
evidence-based practice, including the ability to ask answerable questions, find
good study-evidence, then evaluate the quality of evidence for patient impact
and relevance. The goal of finding and evaluating quality evidence with patient
impact involves techniques that are described under critical-appraisal (Straus et
al., 2011), a recurrent theme in this book, and described in detail in Chelune’s
chapter along with worked examples in Chapters 11 and 12. Chelune shows how
the focus of expertise has shifted from the older notions of experiential learn-
ing in seasoned clinicians to the criterion of how well any clinician, younger or
older, is able to access, identify, and interpret quality evidence. Chelune defines
evidence-based clinical neuropsychological practice, “not as a discrete action or
body of knowledge, but as a process—an ongoing ‘pattern’ of routine clinical
practice” (Chapter 7, this volume, pp. 160).
One of the most dramatic technological developments over recent decades in
clinical neuroscience has been modern neuroimaging. However, with dramatic
developments in clinical and research techniques, the evidence-based practi-
tioner faces the challenge of discriminating between established techniques
with demonstrated clinical validity, versus techniques that are not yet ready for
clinical application. In Chapter 8, Bigler describes the diversity of modalities
currently available from clinical neuroimaging. Bigler shows the many ways in
which clinical imaging modalities, particularly CT and MRI, measures of white
matter integrity, gross morphology, and blood-product deposition, all provide
alternative and complementary methods to quantify brain integrity. In addition,
in this chapter, a range of inexpensive, accessible methods with established valid-
ity is highlighted that can be used to improve the objectivity of clinical interpre-
tation of neuroimaging findings. Bigler concludes his chapter by describing the
tremendous innovations in imaging together with newer techniques such as dif-
fusion tensor imaging, noting that these and other research techniques are not
yet the basis for generalizable clinical diagnostic methods.
Improving interpretation of the scientific importance, and patient relevance,
of published clinical research findings would be of less benefit if there were not a
related aspiration by many scientific journal editors to improve the standards of
published studies. This initiative has been evolving for some years, initially known
as the CONSORT consortium (www.equator-network.org/reporting-guidelines/
286
consort/). There is a now a unified endeavor to improve the quality and trans-
parency of reporting standards in health care that falls under the rubric of the
EQUATOR network, incorporating the CONSORT guidelines and many other
publication guidelines besides. In Chapter 9 in this volume, Schoenberg and col-
leageus describe the relevance of these initiatives to the practice of evidence-based
neuropsychology. Schoenberg and colleagues reiterate a point made by Chelune
in Chapter 7, that these widely embraced publication guidelines are beneficial to
both producers and consumers of research. The guidelines provide a comprehen-
sive, though not necessarily exhaustive, checklist of criteria by which to design
and report many types of clinical studies. The guidelines also provide checklists
to guide peer review, as well as the broader framework in which the critical-
appraisal process is nested. Readers familiar with the EQUATOR network guide-
lines will appreciate that the critical-appraisal methods (see Chapters 11 and 12,
this volume) are partly derived from the quality-criteria incorporated within the
guidelines for respective study designs, whether treatment-intervention or diag-
nostic validity.
The STARD criteria (www.equator-network.org/reporting-guidelines/stard/)
for evaluating the validity of a test are one example of the guidelines highlighted
by Schoenberg. The STARD criteria require a careful description of what have
become the two most important classification accuracy statistics in clinical prac-
tice, namely, sensitivity and specificity. Without a good understanding of how
to interpret these criterion-related validity statistics, any clinician is hampered
in her or his ability to understand the scientific process of diagnosis. However,
skillful interpretation of these statistics alone is not enough to be a skilled diag-
nostician. Instead, these statistics interact with a quantity that varies from one
clinical setting to another. The latter quantity is known as the local base-rate
(or prevalence, or the pre-test probability) of the condition to be diagnosed.
In Chapter 10, Bunnage describes these statistics, methods of calculation, and
online and graphical calculation aids. These aids make relatively easy the process
of re-estimating the probability of the patient having the diagnosis after the test
results are known. These skills are so fundamental to clinical practice that no cli-
nician should be unfamiliar with these statistics. The accurate interpretation of
sensitivity and specificity has profound implications for clinical judgement and
constitutes a fundamental skill for a scientific clinician, providing the numerical
content of evidence-based diagnosis.
CONCLUSION
In view of the rapidly evolving scientific basis of neuropsychological practice,
neuropsychologists, like all other health care professionals, need to acquire skills
for continuous learning (Straus et al., 2011). The methods of evidence-based
practice have evolved to address these needs (Straus et al., 2011). As in other areas
of professional expertise, the benefits of training and a good education have a
finite shelf-life. Unless practitioners learn effective skills for maintaining and
updating their knowledge, they are likely to become increasingly out of date. The
need to update knowledge is one reason for mandating professional development
activities in most professional organizations, as in psychology. As was shown in
the first chapter of this volume, expertise in neuropsychology and in health care
in general is no longer defined as a function of the number of patients seen or the
number of years in clinical practice. It is possible to develop and maintain severe
misunderstandings about patients and their clinical conditions, through many
years of practice. This is because learning from experience may mislead the clini-
cian, who may fail to correct these misapprehensions. The methods of evidence-
based practice, elements of which are described in this volume, provide one of
the most effective and widely accepted strategies for maintaining and updating
clinical knowledge.
REFERENCES
Alkemade, N., Bowden, S. C., & Salzman, L. (2015). Scoring correction for MMPI-2
Hs scale in patients experiencing a Traumatic Brain Injury: A test of measurement
invariance. Archives of Clinical Neuropsychology, 2015, 39–48.
288
Index
290 I nde x
I nde x 291
292 I nde x
I nde x 293
294 I nde x
I nde x 295
296 I nde x
I nde x 297
298 I nde x
I nde x 299
300 I nde x
I nde x 301
regression to the mean, 37, 107, 124, 131– comparison of RCI models, 137–141
133. See also predicted true scores individualized error term RCI, 135–136
Reider-Groswasser, I., 197, 198 interpreting statistical significance of
relative risk reduction (RRR). See under RCI scores, 128–129, 142–143
risk analysis, for treatment benefit, limitations to reliable change analyses,
or harm 148–150
reliability and mean practice effects, 130–131
alternate ways to estimate, 103–106, 105f multiple regression approach to RCI,
clinical experience and, 103 134–135
confidence interval for true score based number needed to treat and, 147
on one assessment, 96t, 108–114, 113t practice effects and neuropsychological
defined, 98 measures, 129–131, 133–134
of diagnostic measurement and practice points for neuropsychologists,
assessment, 235 140–141, 150–152
family of standard errors of and regression to the mean, 131–133
measurement, 96t, 100t, 109–112, relative risk and odds ratios, 148
110t, 111f reliable change and effect size, 143–144
impact on interpretation of test reliable change and responsiveness,
validity, 116–117 144–147, 144t
internal consistency, 103–105 reporting group analyses for
many implications for accurate clinical interpretation of change, 141–143
practice, 118 standard deviation method, 125
methods for estimating, 99–103, value of in evidence-based practice,
102f, 105f 121–125
predicted true scores, 106–108, 108t Repeatable Battery for the Assessment
prediction interval for true score based of Neuropsychological Status
on standard error of prediction, 100t, (RBANS), 23–24
108–109, 114–115 reporting guidelines
relevance to all patient information, 99 benefits and pitfalls of adopting,
relevance to evidence-based practice, 216–218
115–116 checklists for authors, 212
retest correlations/reliability, 99–106, CONSORT (Consolidated Standards
102f, 134 of Reporting Trials), 121–123, 169,
role in test score interpretation, 98–99 212–214, 213t, 217, 265, 267–270, 274,
sources of error, 104–105 285, 286
testing schedule and, 134 EQUATOR (Enhancing the Quality
true score correct center for confidence and Transparency of Health
intervals, 96t, 100t, 108–109 Research), 2, 3, 9, 10, 212, 213t,
reliable change indices (RCIs) 214–216, 267, 285, 286
alternative models four basic aspects of published
Chelune, 130–135, 137–139 research, 211
Crawford, 135–138, 140–143 overview of, 212, 213t
Jacobson and Truax model, 122, PRISMA (Preferred Reporting Items
125–127, 149 for Systematic Reviews and Meta-
Massen formula, 133 Analyses), 168, 212, 213t, 215, 217, 267
McSweeny, 131–133, 135, 137–139 PROMIS (Patient-Reported Outcome
RCISRB, 131 Measurement Information System),
Temkin formula, 135 105, 212, 213t, 216
basics and history of reliable change role in evidence-based practice, 218
methodology, 125–128, 126t, 127t shortcomings of published research,
choosing the RCI model, 136–137 211–212
302
302 I nde x
I nde x 303
304 I nde x
I nde x 305