Review of Psychological Testing: A Practical Approach To Design and Evaluation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/247437640

Review of Psychological Testing: A Practical Approach to Design and


Evaluation

Article  in  Canadian Psychology · November 2005


DOI: 10.1037/h0087036

CITATIONS READS

46 2,710

1 author:

Stuart J. Mckelvie
Bishop's University
75 PUBLICATIONS   1,056 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Replication View project

Effect of Exposure to Details on Execution Procedures on Attitude towards Capital Punishment View project

All content following this page was uploaded by Stuart J. Mckelvie on 15 July 2015.

The user has requested enhancement of the downloaded file.


CP 46-4 10/31/05 3:47 PM Page 258

258 Book Reviews

Chair in the Department of Psychiatric Medicine at the (e.g., under attitude scale construction, Chapter 2),
University of Virginia. She has published in journals such but more information would help the test consumer
as the Journal of Clinical Psychiatry and Journal of Sex & to judge which transformation to use and to decide if
Marital Therapy. interpretation should reflect ordinal or interval differ-
Wendy Pullin is Professor of Psychology at Concordia ences. The statistics review is clear, although it is stat-
University College of Alberta in Edmonton. She is respon- ed that power analysis shows that about 20 cases are
sible for the French adaptation and translation of sufficient to detect a moderate relationship (correla-
Margaret Matlin’s English (2004 edition), The Psychology of tion) between two variables (p. 11). However, as Kline
Women, to be published by Éditions DeBoeck Université of herself shows (p. 82), 85 cases are required to detect a
Belgium in 2006. moderate correlation (r = .30) with alpha at .05 and
power at .80. Secondly, although many variables are
normally distributed, it should be pointed out (p. 22)
THERESA J.B. KLINE that converting a raw score to a standard z score does
Psychological Testing: A Practical Approach to Design and not in itself require or guarantee normality. The chap-
Evaluation ter ends with a very good discussion of two important
Thousand Oaks, CA: Sage Publications, 2005, 368 pages questions: establishing the meaning of the construct
(ISBN 1-4129-544-3, US$79.95 Hardcover) and its relationship to other constructs and opera-
Reviewed by STUART J. MCKELVIE tionally defining it with test items.
Chapter 2 describes how to create items for a test
using empirical, theoretical, and rational approaches,
Most textbooks on psychological testing cover pur- and practical advice is given for searching the litera-
poses, history, classical theory, psychometric proper- ture and contacting experts. Although the term “dust
ties (norms, reliability, validity), test construction, bowl” empiricism might have been explained, a useful
issues such as group differences, heredity and environ- set of nine guiding rules is presented (albeit without
mental influences, professional and ethical concerns, reasons), and advice is offered for the number of
and a description of the major maximum and typical items per construct. Useful information is given on
performance tests in educational, clinical, and indus- attitude scale construction, but an example of a stan-
trial settings. Dr. Theresa Kline’s book differs in that it dardized maximum or typical performance test might
emphasizes the practical questions of how to construct have helped.
tests and evaluate them, with a focus on theory, psy- Chapter 3 explains how to decide on the test
chometric properties, and test construction. It fills an taker’s response to the items. It covers open- and
important niche. closed-ended questions, “continuous responses,”
Consistent with its goals, the book is presented ipsative versus normative scales, and statistical prob-
sequentially, leading the reader through the logical lems with difference and change scores. There are
steps of test construction and evaluation: statistics, details of statistical calculations for closed-ended ques-
construct definition, item writing, required responses, tions, but it is not always clear which statistics are
samples for norms (four chapters), classical and mod- inferential and which are measures of effect size (pp.
ern test theory, reliability and validity (six chapters), 54-55). Otherwise, clear advice is given, particularly
and ethical and professional issues and a brief review for writing distractors to multiple-choice items.
of selected tests (two chapters). Although the section on “continuous responses” actu-
Chapter 1 describes the problem of measurement, ally covers “discrete” scales with only one mention of a
gives a brief history of testing, with an emphasis on true continuum (the straight line graphic scale), it
the U.S., and a review of statistical concepts (especially contains a useful discussion of rating scales, particular-
correlation and regression), prior knowledge of which ly Likert scales, including the optimal number of scale
is assumed. Although Pearson’s product moment cor- points and the optimal labelling system. Variants on
relation coefficient is thoroughly described, Likert scales are referred to as “Likert-type,” which is
Spearman’s rank order correlation is not, even common in the literature, but because they often dif-
though it appears later to (p. 186). Levels of measure- fer from the original Likert scale that asks for a judg-
ment are presented, because statistical procedures are ment of agreement to attitude statements (M. Aftanas,
related to them. However, it has been argued that personal communication, December 2, 2004), it
level of measurement is less important for statistical might be more accurate to describe them otherwise
analysis than for permissible transformations of raw (e.g., ordered response categories scales). The stan-
scores and interpretation of results (Gaito, 1980). dard correction for guessing is also presented, but it is
Levels of measurement are mentioned subsequently not clear if the writer recommends it or would suggest
CP 46-4 10/31/05 3:47 PM Page 259

Comptes rendus de lecture 259

other approaches to minimizing guessing. ential.” However, one could argue that the sample reli-
Chapter 4 gives a full account of probability and ability coefficient is an estimate of a population value,
nonprobability techniques for obtaining participants which is an inferential question. Furthermore, when
who will represent the population of interest, along validity coefficients are discussed (p. 217), inference is
with guidance for survey response rates and the prob- accepted and sample size is regarded as important.
lem of missing data. It might have been pointed out Why is this true for validity and not reliability? Finally,
explicitly that the data from these respondents will in contrast to most texts, Dr. Kline correctly observes
provide the norms for standardized tests and that raw that confidence intervals should be centred on the
scores will be transformed to derived scores using this estimated true score, not the observed score (p. 178).
framework (as described earlier, p. 22). The question However, an equation (7-8) is presented in raw scores
of sample size is addressed, but the examples given when it should be in difference scores.
(p. 82) refer to planning for inferential statistics Although the procedures for assessing validity
rather than estimating population means, where the reflect the traditional tripartite division into content,
optimal number is much greater. Test constructers criterion, and construct, Chapter 9 begins with a criti-
need to know how many should be in their normative cal discussion of these distinctions in light of modern
sample. approaches (see also the end of Chapter 10). The fol-
Chapter 5 contains the classical test theory of true lowing topics are nicely covered: consulting test-takers
and error scores, along with its limitations, and (face validity) and test experts (content validity), and
includes a clear presentation of the important statisti- validity coefficients via correlation and regression (cri-
cal calculations for item analysis. It is stated that terion validity). Regression includes detailed examples
longer tests are more reliable, but not that there are of calculations, with consideration of convergent and
diminishing returns (see also p. 175), and factor discriminant validity, correction for unreliability,
analysis is mentioned, but is not discussed until restriction of range and decision-making with the
Chapter 10. Chapter 6 covers modern test theory in standard error of estimate. Group differences and test
more detail than other standard textbooks in psycho- bias, discriminant function analysis, and individual
logical testing. Following an excellent comparison classification are discussed, although it was not noted
with classical test theory, it describes the one-, two- that in the Taylor-Russell tables a low validity coeffi-
and three-parameter logistic and multiple response cient may be acceptable if the selection ratio is low.
models. This information is accompanied by many The chapter ends with good descriptions of meta
boxes and tables that show detailed statistical calcula- analysis and synthetic validity. In the former, the dis-
tions and examples of computer outputs. The average tinction between the credibility interval and the confi-
psychology undergraduate will find this detail quite dence interval might have been clearer.
challenging, and will require some instruction to fully Chapter 10 deals with validity in terms of internal
understand it. The chapter ends with parameter esti- structure, covering the technical topics of principal
mation, how respondents are scored, how models are components analysis, common factor analysis, analysis
tested, and a discussion of the pros and cons of mod- of covariance structures, and multitrait-multimethod
ern test theory. assessments. Again, these techniques are illustrated
Chapters 7 and 8 deal with reliability of scores and profusely with calculations and computer outputs,
items and raters, respectively. In Chapter 8, informa- which will be helpful for the serious student. Advice
tion is more extensive than most texts, covering vari- covers sample size and threats to validity from response
ous interrater reliability indices and the question of bias, social desirability, and methods variance.
reliability generalization. In Chapter 7, it is not clear Although geared towards the U.S. rather than
why test reliability is referred to as the “bane” of test Canada, Chapter 11 covers the important questions of
developers (p. 167), but the standard theory of relia- ethical and professional issues, including test adminis-
bility is clearly presented, along with the three major tration, integrity testing, computerized testing, effects
kinds (test-retest, alternate-form, and internal consis- of coaching, test legislation, test bias, and, finally, lan-
tency). Calculation examples are excellent. Full con- guage translation issues, which are becoming increas-
sideration is given to the question of time delay in test- ingly important. Because there is a large coaching
retest reliability, although immediate alternate-form is industry, this issue might have been expanded, differ-
not mentioned. There is a general discussion of stan- entiating the effects on validity of three kinds of past
dards for reliability, but no guidelines are given for experience: test familiarity, coaching to the test, and
the three different kinds. It is also stated that sample general educational skills (Anastasi, 1981).
size is less important than sample representativeness The last chapter illustrates the principles in the
because “the reliability index is descriptive, not infer- book with selected examples of maximum and typical
CP 46-4 10/31/05 3:47 PM Page 260

260 Book Reviews

performance tests, rather than complete reviews. This book is one of a series of volumes generated
Nevertheless, the discussion of the Wechsler scales following the more than 30 years of the Banff
might have mentioned scorer reliability and the fact Conferences on Behavioural Science. The conference
that the index scores are based on factor analysis. In reflected in this volume was held in the year 2000
addition, most writers classify the Scholastic although the editors point out that the book’s con-
Assessment Test and Graduate Record Examination as tents reflect not only the conference proceedings but
aptitude rather than achievement tests. Given that the developments in theory, practice, and policy with
aim of the book is to educate the test consumer, it respect to resilience, since that time. The book is orga-
might have included one detailed model test evalua- nized into three sections. The first discusses issues of
tion in which the construct definition, test construc- theory, definition, challenges to understanding the
tion, norms, reliability, and validity of a test were criti- nature of resilience. The second provides descriptive
cally examined. information and some evaluation findings related to
Overall, this book is an interesting departure from intervention programs designed to promote resilience
the usual text on testing. Some parts could be in individuals and families. The third section is
improved (notably typographical errors in which focused on programs directed toward positive change
numbers or an equation are wrong, pp. 19, 109, 178, at the neighbourhood or whole community level.
219), but Dr. Kline takes great pains to present test As the editors note in the Preface, resilience refers
theory with working examples of the calculations and to “a dynamic process encompassing positive adapta-
computer printouts encountered during test construc- tion within the context of significant adversity.
tion and evaluation. With instruction, the book could Implicit in this notion are two critical conditions: 1)
serve as an undergraduate text and certainly as a grad- exposure to significant threat or severe adversity; and
uate text in a practically oriented course. It is ideally 2) the achievement of positive adaptation despite
suited to professional psychologists wishing to con- major insults on the developmental process” (Luthar,
struct or evaluate a psychological test. Dr. Kline’s sensi- Cicchetti, & Becker, 2000, p. 543). Not all the chapters
ble advice will serve them well. of this book fully reflect this definition, however. This
is not particularly a criticism of the book, but rather, a
Theresa J. B. Kline is Professor of Psychology at the reflection of the present status of the discourse sur-
University of Calgary. Research interests and areas of spe- rounding the construct. As Emmy Werner points out
cialty include team performance, decision making, train- in her well-written chapter reviewing the history and
ing, and organizational assessment. Her most recent book present status of research and practice from a
is Teams that Lead: A Matter of Market Strategy, Leadership resilience perspective, there remain many challenges.
Skills and Executive Strength. (2003, Lawrence Erlbaum). She highlights the challenge of applying what we
Stuart J. McKelvie is Professor of Psychology at Bishop’s know from prospective studies of children exposed to
University. He teaches psychometrics and psychological a variety of psychosocial risk factors. These studies
testing to undergraduates and has published a number of have identified some consistency in protective factors
articles in the field. in the face of adversity (e.g., easy temperament, cogni-
tive competence, sensitive parenting, support in the
References community), shown that the timing of risk and protec-
Anastasi, A. (1981). Coaching, test sophistication, and tive experiences makes a difference, that resilience is
developed abilities. American Psychologist, 36, 1086-1093. a process, and that there are wide individual differ-
Gaito, J. (1980). Measurement scales and statistics: ences in the responses of high-risk individuals to
Resurgence of an old misconception. Psychological adversity and opportunity. Turning this knowledge
Bulletin, 87, 564-567. into coherent, effective, and replicable intervention
programs is one challenge that the programs
described in Sections II and III are still struggling
RAY DEV. PETERS, BONNIE LEADBEATER, and with. It is not that many of the programs described
ROBERT J. MCMAHON (Eds.) have not shown positive results, it is just that for many,
Resilience in Children, Families and Communities: Linking the connection between what is known about
Context to Practice and Policy. resilience, and what is actually happening in preven-
New York: Kluwer Academic/Plenum Publishers, tion and intervention efforts, is not altogether clear.
2005, 201 pages Perhaps that is not a bad thing though. Richard
(ISBN 0-306-48655-5, US$79.95 Hardcover) Tremblay’s enjoyable chapter challenges us to consid-
Reviewed by MARGARET K. MCKIM er that the resilience “epidemic” has possibly gone too
far. Using the example of the development of disrup-

View publication stats

You might also like