A Systematic Analysis of Art Therapy Assessment

Florida State University Libraries
Electronic Theses, Treatises and Dissertations The Graduate School
2005
A Systematic Analysis of Art Therapy

Assessment and Rating Instrument
Literature
Donna J. Betts
Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]
THE FLORIDA STATE UNIVERSITY
SCHOOL OF VISUAL ARTS AND DANCE
A SYSTEMATIC ANALYSIS OF ART THERAPY ASSESSMENT
AND RATING INSTRUMENT LITERATURE
By
DONNA J. BETTS
A Dissertation submitted to the

Department of Art Education
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Degree Awarded
Spring Semester, 2005
Copyright 2005
Donna J. Betts
All Rights Reserved
The members of the Committee approve the dissertation of Donna J. Betts
defended on April 11, 2005.
Marcia L. Rosal
Professor Directing Dissertation
Susan Carol Losh

Outside Committee Member
David E. Gussak
Committee Member
Penelope Orr
Committee Member
Approved:
Marcia L. Rosal, Chair, Art Education Department
Sally McRorie, Dean, School of Visual Arts and Dance
The Office of Graduate Studies has verified and approved the named committee members.
ii
One must not always think that feeling is everything. Art is nothing without form.
Gustave Flaubert
iii
For my parents
David and Wendy Betts
iv
ACKNOWLEDGEMENTS
Henry Adams said, A teacher affects eternity; he can never tell where his influence
stops. Thank you, Dr. Marcia Rosal, for your mentorship and the expertise you provided as the
chair of my doctoral committee. I am profoundly grateful.
Deep appreciation is extended to my doctoral committee members, Dr. David Gussak,

Dr. Susan Losh, and Dr. Penelope Orr for their collective wisdom and guidance. Thank you to
my statistician, Qiu Wang, for many hours spent analyzing data and educating me about meta-
analysis procedures. Thanks also to mentor Dr. Betsy Becker. The assistance provided by art
therapist Carolyn Brown as a coder of primary studies is very much appreciated.
Gratitude is expressed to the colleagues, mentors, and friends who offered their support
and guidance: Anne Mills, Barry Cohen, Dr. Linda Gantt, Dr. Sue Hacking, and Dr. Tom
Anderson.
To the friends and family members who provided me with continuous motivation through
the course of doctoral study, especially Evelyn Moore, Annette Moll, and Jeffrey Gray thank
you.
I am grateful for the research grant bestowed upon me by the Florida State University
Office of Graduate Studies.
v
TABLE OF CONTENTS
List of Tables ix
List of Figures x
Abstract xi
1. INTRODUCTION 1
Problem to be Investigated 3
Purpose of the Study 3
Justification of the Study 3
Meta-Analysis Techniques 5
Research Questions and Assumptions 7
Brief Overview of Study 8
Definition of Terms 8
Conclusion 14
2. LITERATURE REVIEW 15
Historical Foundations and Development of Art Therapy Assessment Instruments 15

Criticisms of the Original Assessment Tools 16
Development of Art Therapy Assessments and Rating Instruments 20
Art therapy Assessments 20
Rating Instruments 21
Why Art Therapy Assessments Are Important 24
Issues With Art Therapy Assessments 26
Practical Problems 26
Philosophical and Theoretical Issues 29
vi
Avenues For Improving Art Therapy Assessments 31
Computer Technology and the Advancement of Assessment Methods 34
Analyzing The Research on Art Therapy Assessment Instruments 35
Syntheses of Psychological Assessment Research 35
Syntheses of Creative Arts Therapies Assessment Research 38
Conclusion 41
3. METHODOLOGY 43
Approach for Conducting the Systematic Analysis 43

Problem Formulation 43
Criteria for Selection of Art Therapy Assessment Studies 44
The Literature Search Stage 45
Locating Studies 45
Extracting Data and Coding Study Characteristics 46
The Research Synthesis Coding Sheet 46
Categorizing Research Methods 47
Testing of the Coding Sheet 47
Data Evaluation 48
Identifying Independent Comparisons 48
Data Analysis and Interpretation 49
4. RESULTS 50
Descriptive Results 50
Citation Dates 50
Citation Types 51
Art Therapy Assessment Types 52
Patient Group Categories 53
Rater Tallies 53
Inter-Rater Reliability 54
Meta-Analysis of Inter-Rater Reliability 56
Examination of Potential Mediators 57
vii
Concurrent Validity 59
Meta-Analysis of Concurrent Validity 59
Examination of Potential Mediators 61
Conclusion 62
5. CONCLUSIONS 63
Research Questions 63
Question One 63
Methodological Problems Identified in the Primary Studies 64
Problems with Validity and Reliability of the Articles Researched 64
Methods for Improvement of Tools 68
Question Two 72
Question Three 72
Limitations of the Study 73
Validity Issues in Study Retrieval 73
Issues in Problem Formulation 74
Judging Research Quality 74
Issues in Coding 74
Other sources of Unreliability in Coding 74
Validity Issues in Data Analysis 74
Issues in Study Results 75
Recommendations for Further Study 75
Conclusion 76
APPENDICES 78
REFERENCES 126
BIOGRAPHICAL SKETCH 142
viii
LIST OF TABLES
1. Studies Reporting Percent Agreement (Hacking 1999) 28
2. Patient Group Categories 53
3. Number of Raters 54
4. Inter-Rater Reliability 55
5. Kappa Effect Size 57
6. Percentage Effect Size 57
7. Potential Mediating Variables (Inter-Rater Reliability) 59
8. Concurrent Validity Frequencies 59
9. Concurrent Validity Effect Sizes 60
10. Potential Mediating Variables (Concurrent Validity) 61
11. Author/Coder Favor Tally 67
12. Author/Coder Favor Frequencies 68
13. Sampling Flaws 68
14. Rating Systems Needing Revision 71
ix
LIST OF FIGURES
1. Citation Date Histogram 51
2. Citation Type 52
3. Art Therapy Assessment/Rating Instrument Type 52
4. Author-Identified Study Weaknesses 65
x
ABSTRACT
Art-based assessment instruments are used by many art therapists to: determine a clients
level of functioning; formulate treatment objectives; assess a clients strengths; gain a deeper
understanding of a clients presenting problems; and evaluate client progress. To ensure the
appropriate use of drawing tests, evaluation of instrument validity and reliability is imperative.
Thirty- five published and unpublished quantitative studies related to art therapy assessments and
rating instruments were systematically analyzed. The tools examined in the analysis are: A
Favorite Kind of Day (AFKOD); the Birds Nest Drawing (BND); the Bridge Drawing; the
Diagnostic Drawing Series (DDS), the Child Diagnostic Drawing Series (CDDS); and the Person
Picking an Apple from a Tree (PPAT). Rating instruments are also investigated, including the
Descriptive Assessment of Psychiatric Art (DAPA), the DDS Rating Guide and Drawing
Analysis Form (DAF), and the Formal Elements Art Therapy Scale (FEATS).
Descriptive results and synthesis outcomes reveal that art therapists are still in a nascent
stage of understanding assessments and rating instruments, that flaws in the art therapy
assessment and rating instrument literature research are numerous, and that much work has yet to
be done.
The null hypothesis, that homogeneity exists among the study variables identified in art
therapy assessment and rating instrument literature, was rejected. Variability of the concurrent
validity and inter-rater reliability meta-analyses results indicates that the field of art therapy has
not yet produced sufficient research in the area of assessments and rating instruments to
determine whether art therapy assessments can provide enough information about clients or
measure the process of cha nge that a client may experience in therapy.
Based on a review of the literature, it was determined that the most effective approach to
assessment incorporates objective measures such as standardized assessment procedures
(formalized assessment tools and rating manuals; portfolio evaluation; behavioral checklists), as
xi
well as subjective approaches such as the clients interpretation of his or her artwork. Due to the
inconclusive results of the present study, it is recommended that researchers continue to explore
both objective and subjective approaches to assessment.
xii
CHAPTER 1
INTRODUCTION
A wide variety of tests are available for the purpose of evaluating individuals with
cognitive, developmental, psychological, and/or behavioral disorders. Broadly defined, a test is:
a set of tasks designed to elicit or a scale to describe examinee behavior in a
specified domain, or a system for collecting samples of an individuals work in a
particular area. Coupled with the device is a scoring procedure that enables the
examiner to quantify, evaluate, and interpretbehavior or work samples.
(American Educational Research Association [AERA], 1999, p. 25)
The Buros Institute of Mental Measurements, an authority on various assessment
instruments and publishers of the Mental Measurements Yearbook and Tests in Print, publishes
critical reviews of commercially available tests (Buros Institute, Test Reviews Online, n.d.[a]).
They provide an online assessment database that lists hundreds of tests in the following
categories: achievement, behavior assessment, developmental, education, English and language,
fine arts, foreign languages, intelligence and general, aptitude, mathematics, miscellaneous,
neuropsychological, personality, reading, science, sensory-motor, social studies, speech and
hearing, and vocations. In the category of personality tests, roughly 770 instruments are listed.
These are defined as Tests that measure individuals ways of thinking, behaving, and
functioning within family and society and include: anxiety/depression scales, projective and
apperception tests, needs inventories; tests assessing risk-taking behavior, general mental health,
self- image/-concept/-esteem, empathy, suicidal ideation, emotional intelligence,
depressio n/hopelessness, schizophrenia, abuse, coping skills/stress, grief, decision- making, racial
attitudes, eating disorders, substance use/abuse (or propensity for abuse); general motivation,
perceptions, attributions; parenting styles, marital issues/satisfaction, and adjustment (Buros
Institute, Test Reviews Online, n.d.[b]).
1
These personality instruments are typically administered by psychologists, social
workers, counselors, and other mental health professionals. Art therapists, masters- or PhD- level
mental health practitioners, are also often expected to use assessment tools for client evaluation.
Art therapists most often use instruments that are known in the field as art-based assessments or
art therapy assessments. These two terms are used interchangeably throughout this manuscript,
as are the words assessment, instrument, test and tool.
According to the American Art Therapy Association (2004a), assessment is the use of
any combination of verbal, written, and art tasks chosen by the professional art therapist to assess
the individuals level of functioning, problem areas, strengths, and treatment objectives. Art
therapy assessments can be directed and/or non-directed, and can include drawings, paintings,
and/or sculptures (Arrington, 1992). However, for the purposes of this dissertation, an art therapy
assessment instrument is an objective, standardized measure designed by an art therapist (as
opposed to a psychologist), and incorporates drawing materials (as opposed to paint, clay, or
other media).
Referred to by some as projective techniques (Brooke, 1996), art therapy assessments are
alluring with their ability to illustrate concrete markers of the inner psyche (Oster & Gould
Crone, 2004, p. 1). The most practical art therapy assessments are easy to administer, take a
reasonably brief amount of time to complete, are non-threatening for the client, and are easily
interpreted (Anderson, 2001).
An assessment is most useful when the art therapist has solid training in its administration
(Hagood, 2002), and when, over time and with systematic study, he or she achieves mastery of
the technique (Kinget, 1958).
An art-based instrument is only as good as the system used to rate it. A rating manual that
accompanies an assessment should be illustrated and should link the scores to the examples and
the instructions to the rater (Gantt, 2004).
Dozens of art-based tools exist, and they are used with a variety of client populations in
different ways (refer to Appendix A for a comprehensive list of art therapy assessments and
rating instruments; Appendix B for a description of two of the most well-known art therapy
tools; and Appendix C for a description of how an art therapy assessment is developed).
In the American Art Therapy Association Education Standards document (AATA,
2004b), art therapy assessment is listed as a required content area for graduate art therapy
2
programs. The most recent membership survey of the American Art Therapy Association
(Elkins, Stovall, & Malchiodi, 2003) reported that 31% of respondents provide assessment,
evaluation, and testing services as a part of their regular professional tasks. This demographic
represents nearly one-third of art therapists who completed the survey although this ratio is not
staggering, it nevertheless indicates that assessment is a fairly important task in art therapy
practice.
Problems with art therapy assessment tools are plentiful, ranging from validity and
reliability issues, to a lack of sufficient training in assessment use and limited understanding of
the merits and the drawbacks of different instruments. Because of the relatively common use of
art-based tests, there is cause for concern.
Problem to be Investigated
Purpose of the Study
A major problem in the field of art therapy is that many art therapists are using
assessment instruments and are doing so without knowing how applicable these tools are, and
without understanding the implications of poor validity and lack of reliability. Therefore, the
purpose of this study was to examine the existing literature related to art therapy assessments and
rating instruments with the goal of presenting a systematic analysis of these methods.
Justification of the Study
The relatively common use of art therapy assessment instruments addresses the demand
to demonstrate client progress, and can therefore help to improve the quality of treatment that a
client receives (Deaver, 2002; Gantt & Tabone, 2001). Despite the various uses and apparent
benefits of art therapy assessments, however, there are some problems with the ways in which
these tools have been developed and validated.
Much of the research is small in scale. Many studies with children have used matched
groups, but have failed to control for age differences (Hagood, 2002). Hacking (1999) found that
the literature is poorly cumulated, i.e., that there is a lack of orderly development building
directly on the older work (p. 166). Many studies conclude with unacceptable results and call
for further research. Furthermore, some researchers do not publicly admit to flaws in their work
(Hacking, 1999), continue to train others to use their assessments despite these flaws, and are
reluctant to share the details of their research to facilitate improvement in the quality of
assessments and rating scales.
3
Other problems are more theoretical or philosophical: for instance, it has been argued that
art-based instruments are counter-therapeutic and even exploit clients. In addition, many
practitioners believe that the use of art-based assessments is depersonalizing, as these tools
typically fail to incorporate subjective elements such as the clients own verbal account of his or
her artwork. Burt (1996) even criticized quantitative approaches to art therapy assessment
research as being gender biased, asserting that the push for quantitative research excludes
feminist approaches.
Because of the numerous merits and drawbacks of art therapy assessment instruments,
this topic is hotly debated among the most prominent art therapists. A recent issue of Art
Therapy: Journal of the American Art Therapy Association is dedicated to this topic and
includes: Linda Gantts special feature The Case for Formal Art Therapy Assessments (2004, pp.
18-29); a commentary by Scottish art therapist Maralynn Hagood (2004, p. 3); and the Journals
associate editor Harriet Wadesons response to Gantts feature in the same issue (2004, pp. 3-4).
In her opening commentary of another recent issue of Art Therapy, the associate editor
stated that the articles contained therein reflect aspects of the controversy that has surrounded
art-based assessments in recent years (Wadeson, 2003, p. 63). At the foundation of this
controversy are those who support the use of art therapy assessments and those who challenge it.
In 2001, a panel of prominent art therapists argued their divergent viewpoints, which
reflected the general disagreement about the applicability of art therapy assessments and to what
extent such tools help or hinder clients (Horovitz, Agell, Gantt, Jones & Wadeson, 2001).
Art therapy assessment cont inues to be a primary focus of debate in North America.
Since art therapy is more developed in America and in England than it is in other countries, it is
also necessary to consider the status of assessment in England.
Most British art therapists work for the National Health Service and do not need to be
concerned about insurance reimbursement (M. Hagood, personal communication, May 30,
2004). The therapeutic process is their primary focus. British art therapists generally frown upon
the use of artwork for diagnosis (Hagood, 1990), and are disturbed by the common use of art
therapy assessments in the United States (Burleigh & Beutler, 1997). Furthermore, while
assessment is a popular subject on this side of the Atlantic, there is a notable absence of
published literature on the topic overseas. In the British art therapy journal, INSCAPE, it appears
that since its first edition in 1968, only two articles pertaining to assessment were published
4
(Gulliver, circa 1970s; Case, 1998). In the British Journal of Projective Psychology, only three
articles relating to the evaluation of patient artwork have been published, and these focus solely
on graphic indicators: no formal art therapy assessment instruments are discussed (Uhlin, 1978;
Luzzatto, 1987; Hagood, 1992).
In America, some attempts have been made to improve the status of assessment research
via panel presentations at conferences (Cox, Agell, Cohen, Gantt, 1998, 1999; Horovitz et al.,
2001); an informal survey of assessment use in child art therapy (Mills & Goodwin, 1991); and a
literature review of projective techniques for children (Neale & Rosal, 1993). However, to date
no comprehensive review of art therapy assessment or rating instrument literature has been
published. Hackings (1999) small-scale weighted analysis of published studies on art therapy
assessment research comes close in that she applied meta-analysis techniques to amalgamate
published art-based assessment literature. Hackings work thereby provides a foundation for a
more systematic analysis of this research and serves as a rationale for the present study.
Meta-Analysis Techniques
The meta-analysis literature is an appropriate resource for information related to
conducting a systematic analysis of art therapy assessment literature. Cooper (1998) designated
meta-analysis (also known as research synthesis or research review) as the most frequently
used literature review method in the social sciences. A meta-analysis serves to replace those
earlier papers that have been lost from sight behind the research front (Price, 1965, p. 513), and
to give future research a direction so as to maximize the amount of new information produced.
The meta-analysis method entails the gathering of sound data about a particular topic in
order to attempt a comprehensive integration of previous research using measurement procedures
and statistical analysis methods (Cooper, 1998). It requires surveying and analyzing large
populations of studies, and involves a process akin to cluster sampling, treating independent
studies as clusters that classify individuals according to the projects in which they participated
(Cooper, 1998). A meta-analysis can serve as an important contribution to its field of focus. It
can also generate consensus among scholars, focus debate in a constructive manner (Cooper,
1998), and give future research a direction so as to maximize the amount of new information
produced (Price, 1965).
The application of meta-analysis procedures to systematically analyze art therapy
assessment research is conducive to: (1) amalgamating earlier assessment research so as to
5
provide a comprehensive source of integrated information; (2) identifying problems in statistical
methods used by primary researchers and providing remedies; (3) addressing problems with poor
cumulation; and (4) providing information to improve the development of additional art therapy
assessments (Cooper, 1998).
Hackings (1999) research lends support to the application of meta-analysis techniques in
synthesizing art therapy assessment literature. First, Hacking cited a lack of orderly development
building directly on previous research (i.e., poor cumulation [Rosenthal, 1984]) as a rationale for
conducting a synthesis of the research, even despite the emphasis on qualitative studies in this
area. Second, meta-analysis addresses the methodological difficulties identified with the
literature review (Hacking, 1999, p 167):
(1) Selective inclusion of studies is often based on the reviewer's impressionistic view of
the quality of the study;
(2) Differential subjective weighting of studies in the interpretation of a set of findings;
(3) Misleading interpretations of study findings;
(4) Failure to examine characteristics of the studies as potential explanations for disparate
or consistent results across studies;
(5) Failure to examine mediating variables in the relationship.
Furthermore, a systematic analysis of art therapy assessment research is effective in
amalgamating all earlier assessment research so as to provide a comprehensive source of
integrated information, because Given the cumulative nature of science, trustworthy accounts of
past research are a necessary condition for orderly knowledge building (Cooper, 1998, p. 1).
Meta-analysis techniques also enable the identification of problems in statistical methods
used by previous assessment researchers and provision of remedies, as was evidenced in
Hackings (1999) review. This is important because The value of any single study is derived as
much from how it fits with previous work as from the studys intrinsic properties (Cooper,
1998, p. 1).
An in-depth study of art therapy assessments and rating instruments is warranted and
timely. The purpose of assessment research is to discover the predictive ability of a variable
or a set of variables to assess or diagnose a particular disorder or problem profile (Rosal, 1992,
p. 59). Increasing this predictive ability is one purpose of conducting a systematic analysis of
assessment research.
6
Practitioners have developed and used art therapy assessments, and there is every reason
to believe that they will continue to do so. The present study applies meta-analysis techniques to
integrate the research on art therapy assessments and rating instruments.
Finally, the present synthesis assists in addressing some of the unanswered questions
pertaining to this topic: these questions follow in the next section.
Research Questions and Assumptions
The overriding question of the present study is: what does the literature tell us about the
current state of art therapy assessments?
The study hypothesis is: there is no heterogeneity among the study variables identified in
art therapy assessment and rating instrument literature.
The research questions and assumptions are stated as follows:
(1) Question: To what extent are art therapy assessments and rating instruments valid and
reliable for use with clients?
1a) Assumption: Methodological difficulties on previous art therapy assessment
research exist.
1b) Assumption: It is possible to address the problems of validity and reliability,
to improve upon the existing tools, and to develop better tools.
(2) To what extent do art therapy assessments measure the process of change that a client
may experience in therapy?
2a) Assumption: Art therapy assessments measure the process of change that a
client may experience in therapy.
(3) Do objective assessment methods such as standardized art therapy tools give us
enough information about clients?
3a) Assumption: The most effective approach to assessment incorporates
objective measures such as standardized assessment procedures (formalized
assessment tools and rating manuals; portfolio evaluation; behavioral checklists),
as well as subjective approaches such as the clients interpretation of his or her
artwork.
Addressing these questions and assumptions in the present study contributes to the body of
knowledge around this topic, and the results of the systematic analysis are a useful contribution
to the field of art therapy.
7
Brief Overview of Study
In undertaking the tasks of art therapy assessment and rating instrument research data
collection and analysis, Coopers (1998) stages for conducting a meta-analysis provided a useful
guideline: problem formulation; literature search; extracting data and coding study
characteristics; data evaluation; and data analys is and interpretation.
During the problem formulation stage, criteria for inclusion in the present study were
identified. The literature search process entailed locating studies from three sources: informal
channels, formal methods, and secondary channels. Once a satisfactory number of primary
studies were found, it was determined what data should be extracted. A coding sheet was then
designed. Two individuals were trained in coding procedures by the researcher, followed by
evaluation of the data. This stage involved critical assessment of data quality: it was determined
whether the data were contaminated by factors that are irrelevant to the central problem.
For data analysis and interpretation, the researcher consulted with a statistician. This
process involved some of the following steps: (a) simple description of study findings (Glass,
McGaw, & Smith, 1981); (b) correlation of study characteristics and findings; (c) calculation of
mean correlations, variability, and correcting for artifacts (Arthur, Bennett, & Huffcutt, 2001);
(d) decision- making about whether to search for mediators; (e) selection of and testing for
potential mediators; (f) linear analysis of variance models for estimation (Glass et al., 1981); (g)
integration of studies that have quantitative independent variables; and (h) interpretation of
results and making conclusions (Arthur et al., 2001). Finally, the results of the study are included
in the present manuscript.
Definition of Terms
The following terms are referred to directly in this dissertation and/or they are relevant to
the topic.
Adding Zs method: The most frequently applied method (of 16 possible methods) used for
combining the results of inference tests so that an overall test of the null hypothesis can be
obtained (Cooper, 1998: see pp. 120-122 for details). Not all findings have an equal likelihood of
being retrieved by the synthesist, so significant results are more likely to be retrieved than
nonsignificant ones: This implies that the Adding Zs method may produce a probability level
that underestimates the chance of a Type 1 error (p. 123). One advantage of the Adding Zs
method is that it allows the calculation of a Fail-safe N.
8
Aggregate analysis : As opposed to estimating effect magnitude, aggregate analysis is an
approach identified by Rosenthal (1991), whereby descriptive evidence across primary studies is
integrated (Cooper, 1998). If the descriptive statistics can be put on a common metric so that
they can be compared across studies, then they can be related to coded study characteristics
(Hall, Tickle-Degnen, Rosenthal, & Mosteller, 1994).
Art-based assessment : The use of any combination of directed or non-directed verbal, written,
and art tasks chosen by the professional art therapist to: determine a clients level of functioning;
formulate treatment objectives; assess a clients strengths; gain a deeper understanding of a
clients presenting problems; and evaluate client progress.
Chi-square (X2 ): Statisticians refer to X2 as an enumeration statistic (Animated Software
Company, n.d.[a]). Rather than measuring the value of each of a set of items, a calculated value
of Chi square compares the frequencies of various kinds (or categories) of items in a random
sample to the frequencies that are expected if the population frequencies are as hypothesized by
the investigator. Chi square is often used to assess the goodness of fit between an obtained set
of frequencies in a random sample and what is expected under a given statistical hypothesis. For
example, Chi square can be used to determine if there is reason to reject the statistical hypothesis
that the frequencies in a random sample are as expected when the items are from a normal
distribution.
Coding sheet: A sheet created by the meta-analyst to tally information collected from primary
research reports (Cooper, 1998).
Cohens d index: The measure of an effect size used when the means of two groups are being
compared (Cooper, 1998). It is typically used with t tests or f tests based on a comparison of two
conditions. The d index expresses the distance between two group means in terms of their
common standard deviation. Specifically, it is the difference between population means divided
by the average population standard deviation (Rosenthal, 1994).
Combined significance levels : Combined exact probabilities that are associated with the results
of each comparison or estimate of a relation (Cooper, 1998). Becker (1994; see also Rosenthal,
1984) described 16 methods for combining the results of inference tests for obtaining an overall
test of the null hypothesis. When the exact probabilities are used, the combined analysis results
account for the different sample sizes and relationship strengths of each individual comparison.
9
Concurrent validity: The degree to which the scores on an instrument are related to the scores on
another instrument administered at the same time (for the purpose of testing the same construct),
or to some other criterion available at the same time (Fraenkel & Wallen, 2003, p. G-2).
Convergent validity: Refers to the notion that two separate instruments will concur, or converge,
upon similar results. The StanfordBinet IQ test should have convergent validity with the
Wechsler IQ Scales (ACPA, n.d.).
Criterion validity: A type of validity measurement that is used to evaluate the degree to which
one measurement agrees with other approaches for measuring the same characteristic. There are
three types of criterion validity called concurrent, predictive, and known groups (Google, n.d.).
Effect size (ES): The degree to which the phenomenon is present in the population, or the degree
to which the null hypothesis is false (Cohen, 1988, pp. 9-10). ES is a name given to a family of
indices that measure the magnitude of a treatment effect (Becker, 1998). These are: odds ratio;
relative risk; and risk difference. Unlike significance tests, these indices are independent of
sample size. ES measures are the common currency of weighted analysis studies that summarize
the findings from a specific area of research. There is a wide array of formulas used to measure
ES. In general, ES can be measured in two ways: a) as the standardized difference between two
means, or b) as the correlation between the independent variable classification and the individual
scores on the dependent variable. This correlation is called the effect size correlation (Rosnow
& Rosenthal, 1996)
Fail-safe N (or Fail-safe sample size): Answers the question, How many findings totaling to a
null hypothesis confirmation (e.g., Zst = 0) would have to be added to the results of the retrieved
findings in order to change the conclusion that a relation exists? (Cooper, 1998, p. 123).
Rosenthal (1979) called this the tolerance for future null results. The Fail-safe N is a valuable
descriptive statistic (Cooper, 1998). It permits evaluation of the cumulative result of a synthesis
against assessment of the extent to which the synthesist has searched the literature. The Fail-
safe N , however, also contains an assumption that restricts its validity. That is, its user must find
credible the proposition that the sum of the unretrieved studies is equal to an exact null result (p.
124).
File-drawer problem: is one aspect of the more general problem of publication bias (Becker,
1994, p. 228). The set of available studies does not represent the set of all studies ever
10
conducted, and one reason for this is that researchers may have reports in their file drawers
that were never published because their results were not statistically significant.
Fishers Z: In meta-analysis, a statistical procedure used to transform a distribution of rs so as to
make up for the tendency of rs to become skewed when determining effect size estimates. Zr is
distributed nearly normally, and in virtually all meta-analytic procedures r should always be
transformed to Zr (Rosenthal, 1994).
Fixed effects/fixed estimates effects (conditional) model: (vs. Random effects [unconditional]
model) In this model, the universe to which generalizations are made consists of ensembles
of studies identical to those in the study except for the particular people (or primary sampling
units) that appear in the studies (Hedges, 1994, p. 30).
Hedges and Olkin adjustment for small sample sizes: A more precise procedure for adjusting
sample size than merely multiplying each estimate by its sample size and then dividing the sum
of these products by the sum of the sample sizes (Cooper, 1998). The adjustment has many
advantages but also has more complicated calculations (see Hedges and Olkin, 1985).
Homogeneity analysis: Compares the observed variance to that expected from sampling error,
and is the approach used most often by meta-analysts (Cooper, 1998). It includes a calculation
of how probable it is that the variance exhibited by the effect sizes would be observed if only
sampling error was making them different (p. 145).
Inter-coder reliability: The reliability between codings indicated by two or more individuals who
tally the contents of a primary research article on a coding sheet used in a meta-analytic study
(see Orwin, 1994; and Stock, 1994).
Inter-rater reliability: The degree to which different raters or observers give consistent estimates
of the same phenomenon (Yazdani, 2002a).
Inter-relationship between Reliability and Validity: For an instrument or test to be useful, both
reliability and validity must be considered. A test with a high reliability may have low validity:
i.e., results may be very consistent but inaccurate. A reliable but poorly validated instrument is
totally useless. A valid but unreliable instrument can be of some value, thus it is often said that
validity is more important than reliability for a measurement. However, to be truly useful, an
instrument must be both reasonably valid and reasonably reliable.
Interaction effects: Interaction between two factors exists when the impact of one factor on the
response depends on the setting of the other factor.
11
Kappa: The improvement over chance reached by coders: Cohens kappa is measure of
reliability that adjusts for the chance rate of agreement (Cooper, 1998).
Maximum likelihood estimates: Are the model coefficient values that maximize the likelihood of
the observed data (Tate, 1998, p. 342). (See also Pigott, 1994, pp. 170-171.)
Meta-analysis: The process of gathering sound data about a particular topic in order to produce a
comprehensive integration of the previous research (Cooper, 1998).
Mediating variable : A variable that intervenes with a relationship between variables. Formerly
known as moderating variable.
Odds ratio : The ratio of the odds of an event in one group divided by the odds in another group
(Evidence Based Emergency Medicine, n.d.). When the event rate or absolute risk in the control
group is small (less than 20% or so), then the odds ratio is very close to the relative risk.
Poor cumulation: Lack of orderly development building directly on previous research
(Rosenthal, 1984).
Projective technique : A clinical technique qualifies as a projective device if it presents the
subject with a stimulus, or series of stimuli either so unstructured or so ambiguous that their
meaning for the subject must come in part from within himself (Hammer, 1958, p. 169).
Publication bias: Bias that is induced by selective publication, in which the decision to
publish is influenced by the results of the study (Begg, 1994, p. 399).
Q statistic : Used to test whether a set of d indexes is homogeneous: Hedges and Olkin (1985)
identified the Q statistic, or Qt (Cooper, 1998). It has a Chi square distribution with N 1
degrees of freedom, or one less than the number of comparisons (see Cooper, p. 146). Q-between
is used to test whether the average effects from groupings are homogeneous (Cooper, 1998), and
Q-within is used to compare groups of r indexes.
R index: The Pearson product- moment correlation coefficient (Cooper, 1998). It is the most
appropriate effect size metric for expressing an effect size when one wishes to describe the
relationship between two continuous variables.
Random effects (unconditional) model: (vs. Fixed effects/fixed est imates effects model) In this
model, the study sample is presumed to be literally a sample from a hypothetical collection
(or population) of studies. The universe to which generalizations are made consists of a
population of studies from which the study sample is drawn (Hedges, 1994, p. 31).
12
Reliability: Means repeatability or consistency (Yazdani, 2002a). A measure is considered
reliable if it would give us the same result over and over again (assuming that what we are
measuring isnt changing). Because the measurements are taken repetitively to determine
reliability, measurement reliability is often referred to as test-retest reliability.
Rating instrument : A scoring procedure that enables the examiner to quantify, evaluate, and
interpretbehavior or work samples (AERA, 1999, p. 25).
T-test: The t test employs the statistic (t) to test a given statistical hypothesis about the mean of a
population (or about the means of two populations) (Animated Software Company, n.d.[b]).
Test of equivalence of proportion: Indicates the homogeneity of effect size for each variable and
their relation, and is inappropriate when the vast majority of non-significant results are not
available (in which case the assumption of p=1 would create a false disparity between the
significant and non-significant findings, imposing hetereogeneity) (Hacking, 1999).
Type I error: Incorrectly rejecting the null hypothesis when it is true (Tate, 1998).
Validity: A property or characteristic of the dependent variable. An instrument (measuring tool)
is described as being valid when it measures what it is supposed to measure (Yazdani, 2002b). A
test cannot be considered universally valid. Validity is relative to the purpose of testing and the
subjects tested and therefore an instrument is valid only for specified purposes. Also it should be
made clear that validity is a matter of degree. Instruments or tests are described by how valid
they are, not whether they are valid or not.
Vote-counting methods: The simplest methods for combining independent statistical tests
(Cooper, 1998). Vote counts can focus only on the direction of the findings or they can take into
account the statistical significance of findings. However, vote counting has been criticized on
several grounds (Bushman, 1994). (See also Cooper, 1998, pp. 116-120; and Bushman, 1994, pp.
193-214.)
Weighted (wd) technique : Produces an unbiased estimate of effect size for the corrected group
sizes (Wolf, 1986).
Z score: A special application of the transformation rules (Animated Software Company,
n.d.[c]). The Z score for an item indicates how far and in what direction that item deviates from
its distributions mean, expressed in units of its distributions standard deviation. The
mathematics of the Z score transformation are such that if every item in a distribution is
converted to its z score, the transformed scores will necessarily have a mean of zero and a
13
standard deviation of one. Z scores are sometimes called standard scores. The Z score
transformation is especially useful when seeking to compare the relative standings of items from
distributions with different means and/or different standard deviations. Z scores are especially
informative when the distribution to which they refer is normal. In every normal distribution the
distance between the mean and a given Z score cuts off a fixed proportion of the total area under
the curve. Statisticians have created tables indicating the value of these proportions for each
possible Z score.
Conclusion
This chapter presented information about the research problem to be investigated.
Specifically, the purpose of the present study was discussed. The arguments in favor and
opposed to the use of art therapy assessments were explained, and a rationale for analyzing the
research was stated, in an attempt to justify the study. The research questions, assumptions, and a
brief overview were provided. The final section included the Definition of Terms.
The next chapter, the literature review, further anchors the rationale for this body of
work, A Systematic Analysis of Art Therapy Assessment and Rating Instrument Literature.
The review aids in directing the present study and provides increased knowledge in this subject
area, thereby reducing the chance of duplicating others ideas (Fraenkel & Wallen, 2003).
14
CHAPTER 2
LITERATURE REVIEW
In this review of the literature, the foundations of art therapy assessment instruments are
presented to provide information about historical milestones in psychological evaluation that
influenced the development of art therapy tools. Criticisms of the original projective assessment
tests are detailed, followed by a review of published materials in the field of art therapy that had
a further impact on this area. The importance of art therapy assessments is described and
contrasted with a discussion of practical problems and philosophical and theoretical issues.
Avenues for improvement of art therapy assessments are delineated, including a section on
computer technology and the advancement of art the rapy assessment methods, and analyzing the
research on art therapy assessment instruments. This is supported with references from the
literature on syntheses of psychological assessment research and syntheses of creative arts
therapies assessment research.
Historical Foundations and Development of Art Therapy Assessment Instruments
An overview of the foundations of art-based and projective assessment procedures
illustrates their impact on the field of art therapy. Psychologists, psychiatrists, anthropologis ts,
and educators have used artwork in evaluation, therapy, and research for over 100 years
(MacGregor, 1989). From 1885-1920, educators collected and classified childrens art (D. B.
Harris, 1963). In 1887, Corrado Ricci, an art critic with interests in psychology, published the
first known book of childrens art, in which drawings were presented as potential
psychodiagnostic tools (J. B. Harris, 1996). In 1931, Eng published an extensive bibliography of
the literature on childrens art. This review covered English, German, French, and Norwegian
publications from 1892 to 1930, citing early developmental and descriptive studies.
Prior to 1900, a number of scientific articles describing the spontaneous artwork of
mental patients were published in the United States and Europe (Lombroso, 1882; Simon, 1888;
Tardieu, 1886; Hrdlika, 1899). These papers were mostly impressionistic. Several studies related
15
to the psychodiagnostic potential of art followed in 1906 (Klepsch & Logie, 1982), including
Fritz Mohrs work in establishing standardized procedures and methods using drawing tests with
psychiatric patients (Gantt, 1992). Mohr, a German investigator, reviewed the 19th century
literature and dismissed the earlier contributions as merely descriptive, especially those of the
French (MacGregor, 1989, p. 189). Mohrs work served as a foundation for certain psychological
projective instruments such as the House-Tree-Person Test (HTP) (Buck, 1948), and the
Thematic Apperception Test (TAT) (Murray, 1943), as well as some evaluative procedures used
by art therapists. The structural elements of an artwork were of particular concern to Mohr,
because in his experience, they revealed information about the artists thought processes. He
found that the more fragmentary the picture, the more fragmentary the thought process.
Hermann Rorschach published his famous inkblot projective test in 1921, and it is widely
used even today (Walsh & Betz, 2001). In 1929, another popular tool was developed: the
Goodenough Draw-A-Man technique. Draw-A-Man was the first systematized art-based
assessment method for estimating intelligence. Currently known as the Goodenough-Harris
Draw-A-Man Test (D. B. Harris & Roberts, 1972), this instrument is the earliest example of a
class of open-ended drawing tasks called human figure drawings (HFDs) and has since been
incorporated into IQ tests such as the Stanford-Binet. Other work also led to the use of drawings
in nonverbal intelligence tests, including Lowenfelds (1947) research demonstrating that
children pass through a sequence of orderly developmental stages in their drawings (Gantt,
1992).
Criticisms of the Original Assessment Tools
For 50 years, the research on projective drawings has yielded mixed results (Gantt &
Tabone, 1998, p. 8). During the 1970s and 1980s, the use of these tools declined due to
decreased belief in psychoanalytic theory, greater emphasis on situational determinants of
behavior, questions regarding the cost-effectiveness of these tools, and poor reviews about their
validity (Groth-Marnat, 1990). Although projective tests are still popular among psychologists,
several authors have questioned their scientific value, pointing to questionable research findings
(Chapman & Chapman, 1967; Dawson, 1984; Kahill, 1984; Klopfer & Taulbee, 1976; Roback,
1968; Russell- Lacy, Robinson, Benson, & Cranage, 1979; Suinn & Oskamp, 1969; Swensen,
1968; Wadeson & Carpenter, 1976).
16
According to the Buros Institute Test Reviews Online (n.d.[a]), the Draw-A-Person
(DAP) test was last reviewed in the Seventh Mental Measurements Yearbook in 1972. Roback
(1968) examined 18 years (1949-1967) of findings on the Draw-A-Person (DAP) Test. Overall,
the studies cited failed to support Machovers (1949) hypothesis, that drawing a person is a
natural vehicle for the expression of ones body needs and conflicts, and that the figure drawn is
related to the individual artist with the same level of intimacy characterizing that individuals
handwriting, gait, or any other of his or her own expressive actions. It was conc luded that there is
a great need for validated and standardized scales for the use of figure drawings in estimating
personality adjustment.
Swensens 1968 review of human figure drawing studies published since 1957 revealed
that the quality of research in this area had improved considerably. The evidence suggested that
the reliability of a specific aspect of a drawing is directly related to its validity: global ratings
were found to be the most valid and reliable, whereas individual signs were found to be the least
valid and reliable. It was also found that the presence of certain signs was related to the overall
quality of the drawings, and as such, it was suggested that future research should control for the
quality of the drawings. In contrast to Roback (1968), Swensen concluded that his findings
provided support for the use of human figure drawings in assessment.
In their 15-year review of the personality test literature, Suinn and Oskamp (1969) found
only a small amount of evidence relating to the valid ity of even the most popular tests, including
the DAP and the HTP. They summarized their findings as follows:
Reviewing the results of studies of drawing tests, it appears that their validity is
highly tenous [sic]. The major assumption of true body projection in drawings
does not necessarily hold. In addition, artistic skill may be a distinct influence on
drawings. Doubt has been cast on the hypothesis that the sex of the first drawn
figure is a good index of sexual identification, and the scar-trauma hypothesis has
had conflicting results. There is some evidence of the usefulness of the Draw-A-
Person Test in screening adjusted from maladjusted individuals or organics from
functional disorders, but the usefulness of the test in individual prediction is
limited. There has been very little worthwhile research on the ability of the
House-Tree-Person Test to predict diagnoses, personality traits, or specific
outcomes. (pp. 129-130)
17
This study further substantiated the case against the use of these tools.
Klopfer and Taulbee queried, Will this be the last time that a chapter on projective tests
appears in the Annual Review of Psychology? Will the Rorschach be a blot on the history of
clinical psychology? (1976, p. 543). These witty writers reviewed more than 500 journal articles
pertaining to projective techniques from 1971 through 1974, and determined that if projective
techniques are dead, some people dont seem to have gotten the message (p. 543). To justify
their hesitation in conducting a comprehensive literature review, Klopfer and Taulbee stated,
Even if one were to consider research on projective tests as the beating of a dead horse, a lot of
people seem eager to get in on the flogging, so much so that the voluminous nature of the current
literature makes an exhaustive review impossible (pp. 543-544). As such, they explored
problems of validation and stressed the three most widely used projective tests -- the TAT, the
Rorschach, and Human Figure Drawings. These three projectives accounted for more than 70%
of all the references identified. The most distinct contributions of tests were noted, especially
related to the finding that personality and motivation did not fit the behavioral or self- concept
categories. Klopfer and Taulbee concluded that psychologists would probably continue to
develop, use, and rely upon projective instruments as long as they maintain an interest in the
inner person and probing the depths of the psyche.
In 1976, Wadeson and Carpenter published a comparative study of art expression of
patients with schizophrenia, unipolar depression, and bipolar manic-depression. The artworks of
104 adult inpatients with affective psychoses and 62 inpatients with acute schizophrenia were
examined. Little support was provided for the hypotheses, and substantial within-diagnostic
group variability and between-group overlap was seen. However, some trends in the
hypothesized directions were identified, but these disappeared when a subsample of age- matched
patients was compared. Despite these findings, patient artworks and associations to the pictures
were found to be valuable in understanding the patient, regardless of diagnosis.
Russell- Lacy et al. (1979) studied the validity of assessing art productions made by 30
subjects with acute schizophrenia as a differential diagnosis technique. The subjects pictures
were hypothesized to differ from pictures by other acute psychiatric patients and by subjects with
no diagnosis. The only element found to be associated specifically with schizophrenia was
repetition of abstract forms. Factors associated with psychiatric admission, regardless of
diagnosis, included the presence of pictorial imbalance, overelaboration, childlike features,
18
uncovered space, detail, and color variety. It was concluded that the use of art as a technique in
differential psychiatric diagnosis is questionable.
In 1984, Dawson investigated differences between the drawings of depressed and
nondepressed adults. A method for obtaining objective scores for content and structural variables
was developed. Participants were patients of a Veterans Administration Medical Center who
scored either on the high end or the low end of the Beck Depression Inventory. It was
hypothesized that the drawings of depressed subjects would have less color, more empty space,
smaller forms, more missing details, more shading, and fewer extra details than those of
nondepressed subjects. It was also anticipated that specific contents would be found to be more
prevalent in the drawings of subjects who reported suicidal ideation and depressed subjects. A
linear combination of variables was expected to significantly differentiate the drawings of
nondepressed and depressed subjects. The Depressed group left significantly more empty space
in their drawings and included fewer extra details than the Nondepressed group. The difference
between the group means was in the predicted direction but was not significant for the variables:
size, color, missing details, and suicide symbols. A discriminant function analysis of the
variables did not discriminate between the drawings of the depressed and nondepressed subjects
above a chance level. Some support for the hypotheses was found, which provided a rationale for
continued research in this area. It was suggested that future research include the exploration of
other measures of depression as criteria for identifying the groups used to analyze drawing
variables, and the investigation of the structural variables, Empty Space, Size, Color, Extra
Details and Missing Details, in the drawings of other clinical groups.
Kahill (1984) examined the quantitative literature published between 1967 and 1982 on
the validity and reliability of human figure drawing tests used as projectives with adults.
Focusing on the assertions of Machover (1949) and Hammer (1958), Kahill discussed reliability
estimates and evidence pertaining to the body- image hypothesis. Validity of structural and
formal drawing variables (e.g., size, placement, perspective, size, and omission) and the content
of figure drawings (e.g., face, mouth and teeth, anatomy indicators, and gender of first-drawn
figures) was addressed, and the performance of global measures and the influence of
confounding factors was described. It was concluded that establishing the meaning of figure
drawings with any predictability or precision is difficult due to the inadequacies of figure
drawing research.
19
The historical foundations of projective and drawing assessment techniques derived from
the field of psychology established precedence for the development of similar techniques in the
field of art therapy. In the 1950s, when art therapy came about simultaneously in the United
States and in England, it was not long before art therapists identified a need for assessment
methods that could provide the client with a wider variety of fine art materials than merely a
pencil and a small piece of paper.
Development of Art Therapy Assessments and Rating Instruments
Art therapy assessments. Some of earliest standardized art therapy instruments that have
influenced the development of subsequent art therapy tools include the Ulman Personality
Assessment Procedure (Ulman, 1965, 1975, 1992; Ulman & Levy, 1975, 1992), the Family Art
Evaluation (Kwiatkowska, 1975, 1978), and Rawley Silvers tests (1983, 1988/1993, 1990, 1996,
2002).
The Ulman Personality Assessment Procedure (UPAP) had its beginnings in a psychiatric
hospital. In 1959 the hospitals chief psychologist began sending patients to art therapist Elinor
Ulman so that she could use art as a method of providing quick dia gnostic information (Ulman,
1965, 1975, 1992). Ulman developed the first standardized drawing series: materials included
four pieces of gray bogus paper and a set of 12 hard chalk pastels. The patient was asked to
complete the series of four drawings in one single session, and each drawing had a specific
directive. This diagnostic series was very influential in the development of other tools such as the
Diagnostic Drawing Series (DDS) (Cohen, Hammer, & Singer, 1988). Ulman did not develop a
standardized rating system for the UPAP, but she did make recommendations for future research
(Ulman & Levy, 1975, 1992). She suggested that instead of focusing on content of pictures, that
form and its correlation with personal characteristics might enhance reliability in the use of
art-based assessment (p. 402). Gantt and Tabone (1998) noted Ulmans recommendations and
designed a rating system with formal elements.
During her tenure at the National Institute of Mental Health from 1958 until 1973, art
therapist Hanna Yaxa Kwiatkowska (1975, 1978) developed a structured evaluation procedure
for use with families. Kwiatkowska, influenced by Ulmans seminal work, had this to say about
her contemporary: Her (Ulmans) exquisite sensitivity and broad experience allowed her to
provide important diagnostic conclusions drawn from four tasks given to the patients
investigated individually (1978, p. 86). Kwiatkoskas instrument, known as the Family Art
20
Evaluation, consists of a single meeting of all available members of the nuclear family. The
family is asked to produce the following drawings: 1) a free picture; 2) a picture of your family;
3) an abstract family portrait; 4) a picture started with the help of a scribble; 5) a joint family
scribble; 6) a free scribble. Following completion of the drawings, the art therapist facilitates a
discussion with the family about the artwork and the process. Kwiatkowskas evaluation is
significant as one of the earliest standardized evaluation procedures developed by an art
therapist.
Art therapist Rawley Silver became interested in assessment in the 1960s (Silver, 2003).
Her doctoral dissertation, The Role of Art in the Conceptual Thinking, Adjustment, and Aptitudes
of Deaf and Aphasic Children (1966), was influential in the development of Silvers
assessments, The Silver Drawing Test of Cognition and Emotion (SDT) (1983, 1990, 1996,
2002), and the Draw A Story (DAS) (1988, 1993, 2002). The SDT includes three
tasks: predictive drawing, drawing from imagination and from observation. The DAS is a semi-
structured interview technique using stimulus drawings to elicit response drawings, and has had
a considerable impact on the field of art therapy.
While Ulmans, Kwiatkowskas, and Silvers work was valuable in influencing the
development of additional art therapy instruments, more contemporary researchers have made
contributions to the development of systems to rate art-based assessment tools.
Rating instruments. To reiterate the AERA definition, a rating instrument is a scoring
procedure that enables the examiner to quantify, evaluate, and interpretbehavior or work
samples (1999, p. 25). Most art therapy assessment rating instruments are comprised of scales
used to determine the extent to which an element is present in a drawing (such as amo unt of
space used in the picture). A rating scale presents a statement or item with a corresponding
scale of categories, and respondents are asked to make judgments that most clearly approximate
their perceptions (Wiersma, 2000, p. 311).
Rating instruments vary in the types of scales that they use to measure test items.
Generally there are four types of scales, each of which has a different degree of refinement in
measuring test variables: nominal, ordinal, interval, and ratio (Aiken, 1997). The question is
which type of scale is the best to use in a rating instrument. In addition to using references from
the literature to address this question, the author surveyed members of the American
21
Psychological Association Division 5, Measurement, Evaluation and Statistics, via their
listserve.
Nominal and ordinal measures are convenient in describing individuals or groups (Aiken,
1997). With an ordinal scale that forces the rater to choose either good or bad, present or
not present, for example, consistent responses, resulting in higher inter-rater reliability, are
more likely (S. Rock, personal communication, February 21. 2005). However, comparing the
numbers in terms of direction or magnitude is illogical (Aiken, 1997). Furthermore, many things
are not simply yes or no in the real world, but gradations a yes/no answer loses a lot of
information and forces people to set a criterion for judgment. How true does it have to be before
I say yes (N. Turner, personal communication, Februrary 21, 2005)?
Used in conjunction with the Diagnostic Drawing Series (DDS) (Cohen, Hammer &
Singer, 1988), the DDS rating system (Cohen, 1986/1994) (Appendices D and E) is comprised of
23 scales. Many DDS scales are ordinal and force a choice between two items, such as yes or
no (presence or absence of a given item), yet the criteria that are being rated are not trivial (not
superficial and readily ratable) (A. Mills, personal communication, February 21, 2005).
The Descriptive Assessment of Psychiatric Artwork (DAPA) (Hacking, 1999), an
instrument used for rating spontaneous artwork (i.e., any type of drawing or painting), is
comprised of five scales: Color, Intensity, Line, Area, and Emotional Tone (Appendix F). Like
the DDS, the DAPA has ordinal scales.
The most precise level of measurement is the ratio scale, which includes a zero value to
indicate a total absence of the variable being measured (Aiken, 1997). This true zero, coupled
with the equal intervals between numerical values on a ratio scale, enables measurements to be
explained in a meaningful way. However, the more choices the rater has, such as with an interval
or ratio-type scale, the more likely the scores will vary from rating to rating (S. Rock, personal
communication, February 21. 2005). Another advantage of interval/ratio (sometimes referred to
as Likert-type or Graphic) scales is that many variables are gradations rather than just yes
or no (N. Turner, personal communication, Februrary 21, 2005). The trade off is that it takes
people longer to read Likert or interval/ratio scales than nominal/binary or ordinal check lists.
Intervals in measurement scalesare established on the basis of convention and
usefulness. The basic concern is whether the level of measurement is meaningful
and that the implied information is contained in the numerals. The meaning
22
depends on the conditions and variables of the specific study. (Wiersma, 2000, p.
297)
Graphic/interval rating scales are advantageous because they are simple and easy to use,
and they suggest a continuum and equal intervals (Kerlinger, 1986). In addition, these scales can
be structured with variations such as continuous lines, vertical segmented lines, and lines broken
into marked equal intervals. Six or seven choices (such as a scale ranging from very strongly
disagree to very strongly agree, with more moderate items in between) will increase
reliability (Hadley & Mitchell, 1995).
Likert scales (regardless of the number of rating points) are assumed by many to be
essentially interval, reflecting an underlying interval scale of measurement (S. Rock, personal
communication, February 21. 2005). Others argue that, while the underlying dimension is
interval (an even ratio), the scale is at best an ordinal scale. To overcome this ambiguity, many
professionals treat Likert scales as interval scales and move on with their work. The FEATS
(Gantt & Tabone, 1998) (Appendices G and H), developed for rating the Person Picking an
Apple From a Tree (PPAT) assessment drawings (Gantt, 1990), is an example of an equal-
appearing Likert/interval scale (L. Gantt, personal communication, January 31, 2005). The
intervals between the numbers on the FEATS scales cannot be assumed to be exactly equivalent
all along the scale for example, a four cannot be assumed to be exactly twice as much as a two.
A strength of the FEATS is that it provides sample drawings that guide the rater, thereby
increasing the FEATS reliability, but at a cost:
Such sample descriptions lengthen the raters task, particularly when it includes
many ratings. A graphic rating scale if often (but not always) sufficiently clear if
only the end points are given sample behavior descriptions and the idea of equal
intervals is allowed to carry this burden regarding the intermediate scale points.
(Hadley & Mitchell, 1995, p. 329)
Nominal and ordinal measures should not be compared in terms of direction or
magnitude, but they are more likely to produce consistent responses, whereas interval/ratio scales
can be compared in terms of direction or magnitude, but the scores will be more variable. So
which type of scale is better? The choice should be specific to the instruments purpose:
There is no conclusive evidence for using Likert-type versus binary-choice items
in rating instruments. The format that best represents the underlying construct
23
you are trying to measure should guide the selection of format. One must define
the purpose of the scale, weigh the pros and cons of various format options
including score interpretationand make the best choice while being aware of
the limitations of score interpretation due to format. Both methodsand points of
viewhave value as long as we realize their limitations as well. (B. Biskin,
personal communication, February 22, 2005)
Content checklists are sometimes included in an art based rating instrument, separate
from the scales. As the term checklist suggests, these are typically comprised of nominal items
that would be checked on the list as either present or not present in a drawing. For example, in
addition to the 23 DDS scales, there is a content checklist. The FEATS also has a content tally
sheet. The rater is asked to place a checkmark for all items they see in the picture they are rating,
such as whether the orientation of the picture is horizontal or vertical.
The preceding discussion sheds light on the importance of thorough rating instrument
design. An assessment is only as good as the system used to rate it, because well-constructed,
standardized scales for rating artwork are vital in order to validate assessment findings and to
determine the reliability of subjects scores.
Assessment is an imperfect science, and as has been presented in this literature review,
the variety and quality of the research is diverse. An in-depth discussion of this hotly debated
topic is included later in the chapter. First, the broad implications of working with assessment
tools should be considered, beginning with the reasons for their importance.
Why Art Therapy Assessments Are Important
It is the consensus of most mental health professionals, agency administrators, and
insurance companies that regardless of the formality or structure, assessmentand reassessment
at appropriate timesconstitutes the core of good practice (Gantt, 2004, p. 18). Furthermore,
funding for research, treatment, or programming is only provided to those who demonstrate the
efficacy of treatments or interventions. Standardized assessments are fundamental to all
disciplines that deal with intervention and change, including the field of art therapy.
Assessme nts are used in different settings to plan intervention or treatment and to
evaluate results (Gantt, 2004). Some examples of this include: the Federal governments mandate
on the use of assessments in certain facilities; the Joint Commission on Accreditation of
Healthcare Organizations (JCAHO) regular inspection of evaluation and assessment procedures
24
in selected institutions; public schools use of standardized tests for the Individualized Education
Plan (IEP); and the National Council on Agings delineation of standards for assessment.
It is thought that the ongoing use and development of art therapy assessment tools is
important to the advancement of the field. It has been suggested that exploration of original ways
to evaluate clients in art therapy be encouraged, as creative investigation can be fruitful (Betts,
2003, p. 77). Many art therapy practitioners believe that assessments provide increased
understanding of a clients developmental level, emotional status, and psychological framework.
Such tools are also used for formulating treatment goals and gaining a deeper understanding of
the clients presenting problems.
Clinicians are under pressure to demonstrate client progress in therapy. In art therapy, for
instance, an assessment can be administered at the outset of treatment, during the middle phase
of treatment, and again upon termination of services, and the artwork can be compared to
determine the course of client progress. When practitioners and institutions are accountable for
charting and reporting client progress, treatment standards are raised, and this has a trickle-down
effect that tends to improve the quality of treatment a client receives (Deaver, 2002; Gantt &
Tabone, 2001).
Deaver (2002) asserted that research might be beneficial in understanding the efficacy of
various art therapy assessments, techniques, and approaches used with clients. She presented
basic descriptions and examples of qualitative and quantitative approaches to art therapy
research, and put forth ideas to bridge the gap between research and practice, within the context
of providing improved services for clients.
In 2001, Gantt and Tabone presented data on the PPAT and FEATS, to demonstrate how
these instruments assisted them in making clinical decisions and identifying predictor variables.
They found that PPAT drawings served as an effective aid in predicting how patients would
respond to specific treatments, and this resulted in shorter treatment time. Shorter treatment
helped the hospital to increase their efficiency, and to provide patients with improved quality of
treatment.
Advocates of assessment assert that the various instruments help to provide meaningful
information about clients (Rubin, 1999). Art therapy assessments may be beneficial in terms of
observing patterns, generating comparative data, and addressing issues of reliability and validity
(Malchiodi, 1994). Assessments are used to gain insight into a clients mood and psychological
25
state, and to unveil diagnostic information. A clients spontaneous artwork can also be used to
make diagnostic impressions. Art can tell us much not only about what clients feel but also
about how they see life and the world, their unique flow of one feeling into another, and the deep
structure that underlies this flow of feeling (Julliard & Van Den Heuvel, 1999, p. 113). A
clients artwork enables the therapist to perceive the clients socio-cultural reality: the clients
feelings about himself or herself, his or her family, environment, and culture.
Although there are many benefits to justify the use of art therapy assessment techniques,
there are also several problems. Most clinicians have mixed opinions about the applicability of
assessments. These are elucidated in the next section.
Issues With Art Therapy Assessments
Practical Problems
Some of the problems with art therapy assessment instruments are concrete and relate to
lack of scientific rigor. Many tools are generally deficient of data supporting their validity and
reliability, and are not supported by credible psychological theory (McNiff, 1998). Those who
choose to assess clients through art have neglected to convincingly address the essence of
empirical scientific inquiry findings that link character traits with artistic expressions;
replicable results based upon copious and random data; and uniform outcome measures which
justify diagnosis of a client via his or her artwork.
Gantt and Tabone (1998) identified two problematic methods that were used in the
formative years of assessment. Psychologists employed a testing approach, and looked for
nomothetic (group) principles, stressing validity and reliability. Their principles were based on
personality characteristics. The disadvantage to this method was that it used a sign-based
procedure that took material out of context. Conversely, psychoanalysts and art therapists
perceived art as a reflection of mood or progress and attempted to understand the individual
more thoroughly. This approach was faulty in that it lacked scientific rigor.
Interpretation of pictorial imagery is highly subjective (McNiff, 1998). It is a challenge to
maintain objectivity and thereby establish validity in assessing art because artistic tensions
within the total form are images of intuitively felt activity (Julliard & Van Den Heuvel, 1999, p.
114); and because art expresses a state of constant change and growth. Furthermore, unless the
crudest diagnosis patient or normal can be made with sufficient precision, the assumption
that paintings and drawings contain data related in regular ways to psychopathological categories
26
lies open to serious question (Ulman & Levy, 1992, p. 107). For example, how can a patients
drawing accurately reveal whether he or she has schizophrenia? As Golomb (1992) maintained
in her critical review of projective drawing tests relating to the human figure, far-reaching
conclusions and the inconsistent results of numerous replication studies indicate the dubious state
of research in this area. For example, claims that the human figure drawn by an individual
relates intimately to the impulses, anxieties, conflicts and compensations characteristic of that
individual remain difficult to demonstrate due to problems of measurement validity (Swensen,
1968).
Many studies with children have used matched groups, but have failed to control for age
differences (Hagood, 2002). The literature is poorly cumulated, i.e., there is a lack of orderly
development building directly on the older work (Hacking, 1999, p. 166). Many studies
conclude with unacceptable results and call for further research: such studies reflect the
casualness with which many art therapists approach the art therapy assessment process (Gantt,
2004; Mills & Goodwin, 1991; Phillips, 1994).
A major flaw in much of the art therapy assessment research relates to poor methods of
rating pictures, and inappropriate use of statistical procedures for measuring inter-rater
reliability. Hacking (1999) cited several studies that used unsuitable methods to determine inter-
rater reliability. Nine such studies (Bergland & Moore Gonzalez, 1993; Cohen & Phelps, 1985;
Gantt, 1990; Kaplan, 1991; Kirk, & Kertesz, 1989; Langevin, Raine, Day, & Waxer, 1975a;
Langevin, Raine, Day, & Waxer, 1975b; McGlashan, Wadeson, Carpenter, & Levy, 1977;
Wright & Macintyre, 1982) indicated a high value of r interpreted as an indication of good
agreement. However, the correlation was stated to be inappropriate in this context, because the
correlation coefficient is a measure of the strength of linear association between two variables,
not agreement (p. 159). Furthermore, assessing agreement by a statistical method that is highly
sensitive to the choice of the sample of subjects is unwarranted. Hacking cited Kays (1978)
famous study for incorrectly judging agreement by using a Chi square test, which is also a test of
association.
A study by Wadlington and McWhinnie (1973) was criticized by Hacking as using the
comparison of means by a paired t-test, which is a hypothesis test. Similarly, Hacking found that
the Russell- Lacy et al. (1979) study used 60 judges in groups of 10 to rate 5 pictures and
compared the variation between scores of 0-10 agreements between groups, using the category
27
ranking test Friedmans Anova; however, this test is also inappropriate for determining inter-
rater reliability, as it is yet another test of association.
Methods cannot be deduced to agree well because they are not significantly
different. A high scatter of differences may well lead to a crucial difference in
means (bias) being non significant. Using this approach, worse agreement
decreases the chance of finding a significant difference and so increases the
chance that the methods will appear to agree. Despite the authors claims of good
statistical agreement in study 69 (Wadlington & McWhinnie, 1973), most of the
discussion reported their difficulties with the measure seriously affected their
study results and recommended a shorter form for better reliability. (Hacking,
1999, p. 160)
Hacking suggested, the simplest approach is to see how many exact agreements exist
(p. 160). She cited 7 studies that reported percentage agreement by tables of elements or overall
agreement (Table 1).
Table 1. Studies Reporting Percent Agreement (Hacking 1999)

Mills, Cohen & Meneses 1993a 1993b 95.7% agreement for 2 raters decreased to 77% for 29
raters.
Silver & Ellison 1995a and 1995b 94.3% agreement for 2 raters decreased to 61% for 10
raters.
Cohen & Phelps 1985 Good agreement for 2 raters decreased to poor
agreement for 4 raters.
Miljkovitch & Irvine 1982 Reported 0.96 which, it is assumed, represents
percentage agreement as there is no other information.
Percentage agreement figures appeared to be
reasonably high, but these could be unreliable when
more raters are added.
Sims, Bolton, & Dana 1983 Reported 0.97 which, it is assumed, represents
percentage agreement as there is no other information,
percentage agreement figures appeared to be
reasonably high, but these could be unreliable when
more raters are added.
Hackings analysis of the DDS (Cohen, Hammer & Singer, 1988) further illustrates the
problems with art-based assessments:
28
The DDS [Cohen, Hammer, & Singer, 1988] (is) one of few tests which
attempt to validate, reliably rate their instrument and encourage replications.
Described as a standardised [sic] evaluation supported by extensive research
[Cohen, Mills, & Kijak, 1994], only 3 interrater studies have been included in
this analysis: study 48 [Mills, Cohen, & Meneses, 1993a] reports agreement
scores from 77-100% over 23 categories, giving 95.7% overall after 2 months
training of the 2 main authors rating 30 sets of drawings by undescribed
subjects. Study 49 [Mills, Cohen, & Meneses, 1993b] reports only 77%
agreement between 29 naive raters performing the same measurements. Study 52
[Rankin, 1994] reports 96% agreement between raters of 4 details in tree
drawings, by 30 patients with post traumatic dissociative disorder and 30
controls, taken from the DDS rating guide and protocol. Other studies used
peculiar methods and were not included in this analysis. (1999, p. 61)
Two additional weaknesses related to inter-rater reliability were found in the calculation
of agreement (Hacking, 1999): (1) a lack of accounting for where the agreement is located in the
table, and (2) the fact that some agreement between raters is expected by chance. Hacking
suggested that it would be more reasonable to consider agreement in excess of the amount by
chance (p. 162), and found that Langevin and Hutchins (1973) study was the only one which
met this criterion. Hacking concluded that the best approach to this type of problem is that
adopted by Knapp (1994), and McGlashan et al. (1977), the Kappa statistic. Kappa may be
interpreted as the chance corrected proportional agreement, but Hacking emphasized that is
important to show the raw data, which the Knapp and McGlashan studies failed to do. In support
of this statement, Hacking cited Neales (1994) application of the DDS to children as having a
much lower level of reliability than that reported by Mills (1993a, 1993b): only 12 variables
reached significance using the Kappa measure of agreement between 2 raters (Hacking, 1999, p.
162). The multitude of problems with art therapy assessment instruments and supporting
research, particularly related to inter-rater reliability, are evident.
Philosophical and Theoretical Issues
Some art therapists are opposed to the use of art therapy assessments, and have suggested
that efforts to link formal elements in artwork with psychiatric diagnosis be abandoned (Gantt &
Tabone, 1998). These individuals fear that rigid, reductionistic classification robs artwork of its
29
uniqueness and meaningfulness and suggest that there are other ways to look at art. Wadeson
contended that art therapists reliance on assessment instruments reflects a longing for magic
(2002, p. 170). Suspicious of attempts to interpret artwork, Maclagan stated, if there is an art
in this analytic work, then it is all to often a devious, detective art, concerned with un-doing what
the pictorial image is composed of and weaving into it a web of its own devising (1989, p. 10).
Kaplan asserted, a wealth of information can be gathered just by discussing the art with the
client (2003, p. 33), thus inferring that open discussion might elicit information to supplement
the artwork.
Some believe that assessments should be conceptually-based, deriving constructs such as
attachment theory, development theory, etc. (D. Kaiser, personal communication, May 9, 2004).
These individuals assert that focus ing on theory is a more strengths-based framework for
understanding a client in a way that is systematically and contextually informed: Discerning
strengths, developmental position, and attachment security while considering gender, culture,
family form, etc., of the client seems more fitting for shaping art therapy interventions for
therapeutic change. This school of thought is diametrically opposed to the position held by
those who value the medical model and who are tied to the DSM, such as Barry Cohen and Anne
Mills (DDS authors) and Linda Gantt and Carmello Tabone (PPAT and FEATS researchers).
McNiff (1998) further anchored the stance that formal assessment methods, whether
theoretically or medically based, are ineffective:
Searches for pathology in artistic expression will inevitably lead to futile attempts
to categorize the endless variations of expression. The primary assumptions of art
psychopathology theories are unreliable since, as Prinzhorn shows, emotionally
troubled people are capable of producing wondrous artworks which compare
favorably with the creations of artists and children. These analyses of human
differences are incessantly variable whereas the search for more positive and
universal outcomes of artistic activity suggests that mentally ill people can use
creative expression to transform and overcome the limitations of their conditions.
(pp. 99-100)
McNiff believed that categorization of pathological elements in art is a futile
direction to pursue. Is there a more acceptable direction to be taken? Can a compromise
between the two camps be achieved? In looking to the future, and identifying
30
recommendations of previous researchers, perhaps there are ways to improve and
advance this area.
Avenues For Improving Art Therapy Assessments
Practitioners in a variety of mental health professions have developed and used
assessment tools, and there is every reason to believe that they will continue to do so. The
various issues and problems point to a need for improvement of art therapy assessment and
research education, and of methodologies for developing new tools, existing assessments, and
rating procedures. It has been suggested that art therapists study the problems thoroughly and
learn from previous mistakes (Gantt, 2004). Furthermore, some believe that valid and reliable
art-based assessments and rating instruments can and must be developed by art therapists.
D. Kaiser (personal communication, May 9, 2004) asserted that art therapists are a
pluralistic group and that this will serve them well in the long run, providing that they can end
the debate and accept the range of approaches that have evolved. Furthermore, the use of art in
assessment is still in its infancy, and even though valid and reliable tools have yet to be
developed, this should not prevent such tests from being created (Hardiman, Liu, & Zernich,
1992).
Klopfer and Taulbee recommended that research on projective tests account for whether
the variable being investigated is symbolic, conscious, or behavioral, and that until this happens,
such investigations will be like comparing walnuts with peaches and coming up with little other
than fruit salad (1976, p. 544). It was further suggested that acute and chronic phenomena, state
and trait phenomena, and behavioral and symbolic characteristics be distinguished from one
another. The authors stressed that behavior should be the focus of personality assessment, since
this is usually the reason that a patient is referred for treatment. Self-concept was also identified
as a quality that should be evaluated, since it has an impact on an individuals decision- making
process and behavior. Finally, it was suggested that symbolic or private personality traits also be
examined, since people are often motivated by their unconscious drives.
Public forums such as conferences and egroups have enabled art therapists to discuss the
various problems surrounding assessment, and to identify potential solutions. At the annual
national art therapy conferences in recent years, three panel presentations about assessment have
been presented (Cox, Agell, Cohen, Gantt, 1998, 1999; Horovitz et al., 2001). The Cox et al.
(1998) presentation provided a review of the UPAP, the DDS, the PPAT and FEATS
31
instruments. Samples of each of these protocols that were completed by patients with specific
disorders (psychotic, mood, personality, and cognitive) were shown. This prompted a stimulating
discussion between the panelists and the audience about art therapy assessment, and generated
interest in a follow- up panel the subsequent year. Thus, in 1999, Cox et al. came together again
to address the uniqueness of each procedure more specifically. The goal was to demonstrate
when, where, and with whom any one of the three instruments would be most appropriate to use.
Slides of UPAP, DDS, and PPAT drawings collected from one patient with schizophrenia, one
with major depression, and one with a cognitive disorder, were displayed. The panelists then
compared and contrasted the different outcomes of each of the protocols as they pertained to
each patient. This provided the audience with a unique opportunity to learn about and discuss the
benefits and weaknesses of each assessment directly with the art therapists who designed and/or
developed the actual tools.
Other attempts to promote understanding of art therapy assessments and research include
an informal survey of assessment use in child art therapy (Mills & Goodwin, 1991); a literature
review of projective techniques for children (Neale & Rosal, 1993); and a small-scale weighted
analysis of published studies on art therapy assessment research (Hacking, 1999).
Mills and Goodwin (1991) distributed surveys at a national art therapy conference to
determine how art therapists use assessments with children. The 37 of 100 questionnaires that
were returned revealed that participants were more familiar with projective tools than with art
therapy assessments. Most respondents indicated that they preferred instruments that relied on
modifications of existing art therapy techniques and projectives, and unpublished assessments.
The authors concluded that there is a vast diversity among art therapists in training and approach
to assessments, combined with a keen desire to innovate.
Neale and Rosal (1993) reviewed and evaluated 17 empirical studies on the subject of
projective drawing techniques (PDTs) published between the years 1968 and 1991. Studies were
grouped by the type of the test used: human figure drawings (HFDs), House-Tree-Person (HTP),
kinetic family and school drawings, and idiosyncratic PDTs. HFDs were found to be reliable as a
predictor of the performance of learning-related behaviors and as a measure of learning
disabilities. The HTP was free of cultural bias. The kinetic family drawings were found to have
solid concurrent and test-retest reliability, while the kinetic school drawings had strong
concurrent validity when correlated with achievement measures. Idiosyncratic PDTs were found
32
to be the weakest of the tests. The authors identified four research methods that improved the
rigor of the studies (p. 47):
(1) The use of objective criteria on which to score variables;
(2) The establishment of interrater reliability;
(3) The collection of data from a large number of subjects;
(4) The duplication of data collection and appropriate analysis procedures to
establish effectiveness and reliability of previously studied projective drawing
instruments.
It was suggested that adoption of these four methods would help art therapists to improve
the quality of assessment research.
In order to overcome some of the more theoretical and philosophical issues with art-
based instruments, Gantts (1986), Burts (1996), and McNiffs (1998) views may provide some
direction. Gantt (1986) examined alternative models for research design and strategies from the
fields of anthropology, art history, and linguistics, and suggested that these may have useful
implications for art therapy research methods because they concentrate on human behavior and
the products of human behavior (i.e., art, culture, language). An American-based doctoral
program recently mandated that its research curricula be updated to include the teaching of
historical, linguistic, feminist, artistic, and other modes of disciplined inquiry (S. McNiff,
personal communication, May 10, 2004). Burt (1996) emphasized qua litative approaches, which
she considered to be more closely related to postmodern feminist theory.
Gantt (1986) contended that in order to understand clients more fully, their literary and
visual traditions and their cultural rules must be considered in addition to their intra-psychic
processes. McNiff (1998) shared a similar view, asserting that research methods that engage both
artistic and interpersonal phenomena need to be identified, since the art therapy relationship is a
partnership between these elements.
McNiff said that it is important to consider the total context of what a person does, and
not to base an evaluation strictly on an interpretation of isolated images (1998, p. 119). Kaplan
(2003) also asserted that both the clients reactions to engaging in art- making, coupled with
interpretation of global features of the art, can produce significant findings.
An alternative to examining artwork through the traditional approaches is the
phenomenological device of bracketing, which involves the withholding of judgment when
33
approaching objects of inquiry (McNiff, 1998, p. 118). Rubin (1987) described a similar
approach that of looking at the experience of art therapy, until central themes begin to
emerge with an openness that does not categorize.
Phenomenological approaches have been used by Quail and Peavey (1994) and Fenner
(1996). Quail and Peavy (1994) presented a case study of an adult female and described a
subjective experience as it is lived approach to the art therapy experience. Over the course of a
16-week art therapy group, the subject described the process and her feelings through five
unstructured interviews. Quail and Peavy used Colaizzis (1978) method of extracting significant
statements and thereby revealed the subjects progression from preintentional experiencing, to a
fully intentional relationship with the object and patterns, to the formation of meaning in the art-
making process. In Fenners (1996) study, the client was the researcher, and art therapy was the
subject. Over a period of approximately two months, the client engaged in brief drawing
experiences of five minutes per sitting in order to determine whether personal meaning would be
enhanced and therapeutic change would be achieved. In employing a phenomenological
approach, both the Quail and Peavy and Fenner studies offer an alternative method of evaluating
clients, one that is different from the traditional, empirical approach.
The use of portfolio review for assessment purposes in art education has useful
implications for art therapy (McNiff, 1998). Art educators typically review their students
artworks over the course of the school year, in order to determine skill and assign a grade. The
portfolio review in art therapy entails the amassing of artworks created by the client, which
allows for tracking of changes in the artwork over time, and for common themes to emerge.
McNiff suggested that the review could be beneficial because it provides for a comprehensive
amalgamation of the clients interpretations of his or her own artwork, assessments, and
transcripts of sessions. Adapting the art education format of portfolio review and assessment for
use in art therapy would enable the therapist to gain a more comprehensive sense of the clients
presenting problems, evidence of progress, etc. This method would be most appropriate for use
in long-term treatment settings, where clients could amass their art therapy products over the
course of time.
Computer Technology and the Advancement of Assessment Methods
According to the National Visual Art Standards, The art disciplines, their techniques,
and their technologies have a strong historic relationship; each continues to shape and inspire the
34
other (National Art Education Association, 1994, p. 10). It is believed that existing and
emerging technologies influence art education due to its dependence on art media (Orr, 2003),
and the same is likely true for art therapy.
Art therapists are using technology in ways that are likely to advance the area of
assessment. The increasing popularity and user friendliness of computer technology is making
the digital storage of client artwork more practical for art therapists: Computer technology will
revolutionize possibilities for creative analysis, presentation, and communication of art therapy
data (McNiff, 1998, p. 203).
Art therapist researchers Linda Gantt and Carmello Tabone are developing a website for
the PPAT and FEATS (L. Gantt, personal communication March 10, 2004). The site will provide
information about the PPAT assessment and FEATS manual, will make rating sheets and related
materials available for downloading, and will enable art therapists to enter data from the PPATs
they collect. Only those researchers based in America who demonstrate that they are
accumulating the PPATs and adhering to the FEATS scoring instructions will be permitted to
enter their data. This will help to ensure the development of a representative sample that could be
used for norming purposes.
Analyzing The Research on Art Therapy Assessment Instruments
Another avenue to improve art therapy assessment instruments would be to increase
researchers comprehension of former investigations and provide recommendations for
improvements. A synthesis of the existing research, such as the present study, is an effective way
to achieve this end. Since there are currently no published systematic analyses or comprehensive
literature reviews of research on art therapy assessments or rating instruments, an exploration of
reviews and analyses in the related fields of psychology and creative arts therapies is warranted.
Syntheses of psychological assessment research. Meta-analysis techniques have been
used in many studies to synthesize research on psychological assessment (Acton, 1996; Garb,
2000; Garb, Wood, Nezworski, Grove, & Stejskal, 2001; Hiller, Rosenthal, Bornstein, Berry, &
Brunell-Neuleib, 1999; Meyer & Archer, 2001; Parker, Hanson, & Hunsley, 1988; Rosenthal,
Hiller, Bornstein, Berry, & Brunell-Neuleib, 2001; Spangler, 1992; Srivastava, 2002; West,
1998). Of particular interest is Actons (1996) research involving three studies that were carried
out to examine the empirical validity of individual features of human figure drawings as
measures of specific forms of psychopathology. This is a model study because unlike several
35
comprehensive and influential past reviews, it grouped findings by construct and employed
techniques to determine effect sizes and test their significance. In addition, Actons second study
used the results of the previous meta-analysis to develop drawing scales for four specific
constructs of psychopathology: anger/hostility, anxiety, social maladjustment, and thought
disorder. Finally, the third study in Actons research was a full replication of the second using a
new sample of young offenders, and is relevant because the results suggest some potential for
aggregates of individual drawing features to provide valid measures of specific forms of
psychopathology.
Some of the studies identified in the present literature search provide information about
the application of specific meta-analytic techniques. For example, in Rosenthal et al.s (2001)
study of meta-analytic methods, the Rorschach, and the MMPI, the authors asserted that research
synthesists must compute, compare, and evaluate a variety of indices of central tendency, and
they must examine the effects of (mediator) variables (p. 449). Other useful elements in this
article include commentary on the use of Kappa versus phi, combining correlated effect sizes,
and possible hindsight biases.
Garb (2000) reanalyzed Wests (1998) meta-analytic data on the use of projective
techniques, including the Rorschach test and HFDs, to detect child sexual abuse. West had
located 12 studies on detecting sexual abuse and 4 studies on detecting physical abuse, and
excluded nonsignificant results. In reanalyzing the data from Wests 12 studies on sexual abuse,
Garb calculated new effect sizes using the nonsignificant and significant results. In many of the
studies it was found that none of the projective test scores had been well replicated, and that
many of those that had reported validity were actually flawed. It was concluded that projective
techniques should not be used to detect child sexual abuse.
In 2001, Garb et al. wrote a commentary about the articles that encompassed the first
round of the Special Series on the Rorschach. Viglione (1999) and Stricker and Gold (1999)
failed to cite negative findings and praised the Rorschach. Although one of Dawes (1999) data
sets was flawed, he obtained results that provided modest support for the Rorschach. Hiller et al.
(1999) reported the results of a meta-analysis, but there were problems, including the fact that
their coders were not blind to all of the studies results. Hunsley and Bailey (1999) found that
there is no scientific basis for using the Rorschach, and provided ample support for this
conclusion.
36
Hiller et al. (1999) cited the Atkinson, Quarrington, Alp, and Cyr (1986) and Parker et al.
(1988) meta-analytic studies. Average validity coefficients for the Rorschach and the MMPI
were found to have similar magnitudes, but methodological problems in both meta-analyses were
thought to have impeded acceptance of these results (Garb, Florio, & Grove, 1998). Thus, Hiller
et al. conducted a new meta-analysis comparing criterion-related validity evidence for the
Rorschach and the MMPI. The unweighted mean validity coefficients (rs) were .30 for MMPI
and .29 for Rorschach, and they were not reliably different (p = .76 under fixed-effects model, p
= .89 under random-effects model). The Rorschach had larger validity coefficients than the
MMPI for studies using objective criterion variables, whereas the MMPI had larger validity
coefficients than the Rorschach for studies using psychiatric diagnoses and self-report measures
as criterion variables.
Authors of the final article in the Special Series on The Utility of the Rorschach for
Clinical Assessment, Meyer and Archer (2001), provided a summary of this instruments current
status. Global and focused meta-analyses were reviewed, including an expanded analysis of
Parker et al.s (1988) data set. Rorschach, MMPI, and IQ scales were found to have greater
validity for some purposes than for others, but all produced roughly similar effect size. Eleven
salient empirical and theoretical gaps in the Rorschach knowledge base were identified.
Parker, Hanson, and Hunsley (1988) located articles from the Journal of Personality
Assessment and the Journal of Clinical Psychology between 1970 and 1981 on the MMPI, the
Rorschach Te st, and the Wechsler Adult Intelligence Scale (WAIS). The average reliability,
stability, and validity of these instruments was estimated. Validity studies based on prior
research, theory, or both had greater effects than did studies lacking an empirical or theoretical
rationale. The reliability and stability of all three tests was found to be approximately equivalent
and generally acceptable. The convergent- validity estimates for the Rorschach and MMPI were
not significantly different, but both of these were lower than was the WAIS estimate. The
authors concluded that both the MMPI and Rorschach could be considered to have sufficient
psychometric properties if used according to the purpose for which they were designed and
validated.
Two meta-analyses of 105 randomly selected empirical research articles on the TAT and
questionnaires were conducted by Spangler (1992). Correlations between TAT measures of need
for achievement and outcomes were found to be generally positive, and these were quite large for
37
outcomes such as career success measured in the presence of intrinsic, or task-related,
achievement incentives. Questionnaire measures of need for achievement were also found to be
positively correlated with outcomes in the presence of external or social achievement incentives.
On average, TAT-based correlations were found to be larger than questionnaire-based
correlations.
Srivastava (2002) conducted a meta-analysis of all the studies on Somatic Inkblot Series
(SIS-I) published in the Journal of Projective Psychology and Mental Health from 1994-2001.
The purpose was to provide normative data by combining means and standard deviations of
existing studies and to determine whether SIS-I indices could differentiate various groups. For
intergroup comparison, critical ratios were computed on combined mean and standard deviation.
The comparison groups were in fact significantly differentiated by the SIS-I indices.
In 1998, West meta-analyzed 12 studies to assess the efficacy of projective instruments in
discriminating between sexually abused children (CSA) and non-sexually abused children. The
Rorschach, the Hand Test, the TAT, the Kinetic Family Drawing, the Human Figure Drawing,
Draw Your Favorite Kind of Day, the Rosebush: A Visualization Strategy, and HTP were
reviewed. An over-all effect size was determined to be d = .81. Six studies included a clinical
group of distressed non-sexually abused subjects and the effect size lowered to d = .76. The
remaining six studies used a norm group of nonabused children with the sexually abused group,
and the average effect size was d = .87. These effect sizes indicated that projective instruments
could effectively discriminate distressed children from those who were non-distressed. Although
most assessment tools seem to be able to differentiate between a normal group of subjects and
subjects who experienced sexual abuse during childhood, it could only be inferred that an
instrument is able to detect nondistress from some type of distress. The inclusion of the clinical
group with no history of sexual abuse generated the necessary data to support the assertion that
an instrument can discriminate a CSA subject from other types of distressed subjects. The
outcomes are in the medium to large range, despite the fact that the inclusion of the clinical
group tended to result in a lower effect size to discriminate the CSA subjects. The lower ranges
of power that resulted from inclusion of clinical groups were attributed to the fact that symptoms
often associated with CSA become evident in clinical disorders.
Syntheses of creative arts therapies assessment research. Several investigators in the
creative arts therapies have applied meta-analytic techniques to bodies of assessment research
38
(Conard, 1992; Hacking, 1999; Hacking, 2001; Loewy, 1995; Oswald, 2003; Ritter & Low,
1996; Scope, 1999; Sharples, 1992). Hackings work is the most directly relevant: in her analysis
of art-based assessment research, she endeavored to identify the central importance of
developing systematic, content-free assessments of psychiatric patients paintings (2001, p.
165). Hacking conducted an analysis of the literature that revealed the best repeatability in order
to put this literature on equal footing. She used the resulting data to provide a rationale for
developing the Descriptive Assessment for Psychiatric Art (DAPA). Hackings use of meta-
analysis techniques established a foundation for a more comprehensive examination of the art
therapy assessment research.
Loewy (1995) found that music therapists were not making use of published music
therapy assessment tools, and that the tools were being used to measure music primarily in
educational or behavioral terms, rather than to incorporate music as part of a psychodynamic
relationship. In an effort to further understand why music therapy assessments were being used
in this way, Loewy studied psychotherapeutic practices with emotionally handicapped children
and adolescents, and employed a hermeneutic inquiry into a music psychotherapeutic assessment
experience. A panel of five prominent music psychotherapists viewed a 50- minute music
psychotherapy first session assessment video and developed written assessment reports. Each
panelist was interviewed for ninety minutes. The interviews were transcribed, systematized, and
analyzed in conjunction with a preliminary analysis of the panel participants original reports.
Loewy then applied meta-analysis techniques to determine the impact that a therapists musical
background, orientation and training, and personal history have on the way that he or she assigns
meaning to an initial music therapy experience. The final analysis yielded five categories of
semantic agreement: (1) Affect Joy and Anxiety; (2) Structure Time, Basic Beat, and
Boundary; (3) Approval-Seeking Behavior; (4) Creativity Improvisation, Investment/Intent,
and Spontaneity; and (5) Symbolic Associations of Instruments. There were four categories of
semantic difference: (1) Music; (2) Cognition; (3) Singing in Tune; and (4) Rhythmic
Synchrony. Finally, the panel participants areas of specialization were noted: Theme, Success in
the Child, Affect/Congruence, Transference/Countertransference and Horns.
In 1992, Conard conducted a meta-analysis of research that examined the effect of
creative dramatics on the acquisition of cognitive skills. The following areas were investigated:
(1) the achievement of students involved in creative dramatics as compared to traditional
39
instructional methods; (2) the impact of sample and study characteristics on outcomes; and (3)
the effects of methodology and research on outcomes. Each study was weighted independently,
thus accounting for the variety in group size across studies. For studies in which creative
dramatics was applied, a mean effect size of 0.48 was calculated. Creative dramatics was found
to be more effective at the pre-school and elementary level than at the secondary level. Remedial
and regular students appeared to enjoy participating in and benefit from creative dramatics.
Studies that were carried out in private schools produced larger effect sizes than those that took
place in public schools. The quantitative analysis was combined with qualitative reviews, and the
qualitative data enhanced the results of the meta-analysis considerably. However, measurement
characteristics such as validity and reliability, and other details of the dependent measures, were
frequently excluded in the studies. Conard concluded that future research should include more
detailed information about methodology, procedures, and how effects are measured.
Oswald (2003) examined the popularity of meta-analysis in psychological research and
presented information about techniques applicable to the arts in education. Instructions were
provided to assist the researcher in computing the meta-analytic mean and how it should be
interpreted. Guidelines for making statistical artifact corrections in a meta-analysis were
discussed. The statistical power of meta-analysis was investigated with respect to detecting true
variance across a set of study statistics once corrections have already been made. A set of
conceptual issues was presented to address the detection of mediator effects across studies.
Standardized effect sizes for case-control studies of dance/movement therapy (DMT)
were calculated in Ritters and Lows (1996) meta-analysis of 23 studies. Summary statistics
reflecting the average change associated with DMT were produced, and the effectiveness of
DMT in different samples and for varying diagnoses was examined. It was determined that the
methodological problems identified in the DMT research could be addressed with the use of
standardized measures and inclusion of adequate control groups. DMT was found to be an
effective treatment for a variety of patients, especially those coping with anxiety. It was further
concluded that adults and adolescents benefit from DMT more than do children.
In 1999, Scope conducted a meta-analysis of the literature on creativity to examine the
effects of instructional variables (such as time spent on instruction, reviewing previous lessons,
etc.) on increases in creativity in school-aged children. All accessible studies, including
published and non-published, were located. The subjects ranged from preschoolers to high
40
school students. Instruction was found to have a positive effect on the childrens creativity, and
there was a modest positive correlation between creativity and independent practice. However,
the instructional variables of time spent on instruction, structuring, reviewing, questioning, and
responding were not found to have an impact on creativity. Additional variables or combinations
thereof might have caused the increases in creativity. An exploratory, qualitative review of three
exceptional studies revealed that the most successful treatments were motivating for the subjects,
were developmentally appropriate, had high treatment compliance, and had high levels of
teacher-student interactions.
Sharples (1992) reviewed 27 experimental studies published between the years 1970 and
1989 from the fields of art education and psychology, and conducted a qualitative meta-analysis
in order to investigate the relationship between social constraints, intrinsic motivation, and
creative performance.
Conclusion
The foundations of art therapy assessment instruments provided information about
historical milestones in psychological evaluation that influenced the development of art therapy
tools. The review of the literature on some of the first widely used tools indicated that the use of
projective techniques and art-based instruments is questionable due primarily to problems of
measurement validity and reliability.
The section pertaining to the development of art therapy assessments summarized the
influence of the first instruments designed by art therapists, and illuminated the significance of
formal rating systems. Literature emphasizing the importance of art therapy assessments
reflected the use of these tools in different settings to plan intervention or treatment and to
evaluate results; the derivation of meaningful information about clients; and the importance of
assessment in advancing of the field of art therapy.
Issues with art therapy assessments were illustrated with citations from the literature on
practical problems related to validity and reliability, as well as philosophical and theoretical
concerns. These pointed to questions about whether and how art therapy assessments could be
improved. There was some agreement in the literature that the quest for valid and reliable
instruments should be pursued, and several suggestions for improvement were put forth. The use
of computer technology and the application of meta-analysis techniques to examine the literature
were suggested as possible avenues.
41
The information derived from analyses in the fields of psychology and the creative arts
therapies was helpful in formulating ideas and determining appropriate methods and procedures
for A Systematic Analysis of Art Therapy Assessment and Rating Instrument Literature. The
next chapter details the methodology for the present study.
42
CHAPTER 3
METHODOLOGY
Approach for Conducting the Systematic Analysis

The purpose of this chapter is to discuss the methods that were used to address the
research questions of the present research and the issues that were encountered. The overriding
question of the present study was: what does the literature tell us about the current state of art
therapy assessments? The three sub-questions were: (1) To what extent are art therapy
assessments and rating instruments valid and reliable for use with clients?; (2) To what extent do
art therapy assessments measure the process of change that a client may experience in therapy?;
and (3) Do objective assessment methods such as standardized art therapy tools give us enough
information about clients?
The following stages were used as a guideline in the present study: problem formulation;
the literature search stage; extracting data and coding study characteristics; data evaluation; and
data analysis and interpretation (Cooper, 1998).
Problem Formulation
Problem formulation was considered before the present study was conducted. This
involved selection of a topic and later specification of inclusion criteria: The literature search
can begin with a conceptual definition and a few known operations that measure it. Then, as the
synthesist becomes more familiar with the research, the concept and associated operations can
grow more precise (Cooper, 1998, p. 14).
In order to identify the research problem, it was necessary to gather a collection of studies
at the outset (Cooper, 1998). Studies that were later found to be irrelevant were eliminated. This
involved decision- making about which criteria to include in the analysis.
Additional factors in literature selection were considered during the problem formulation
stage (Cooper, 1998). These were: conceptual relevance, study-generated evidence, synthesis-
generated evidence, and the possible existence of previous meta-analyses.
43
Because studies must be judged to be conceptua lly relevant (Cooper, 1998), this
researcher adhered to four primary factors that have been shown to influence the quality of such
judgments: the researcher remained open- minded throughout the problem formulation stage (per
Davidson, 1977); possessed expertise on the topic and consulted with other experts; based
decisions on abstracts (per Cooper & Ribble, 1989); and dedicated a considerable amount of time
(several months) to this process (per Cuadra & Katter, 1967), thereby expending every effort to
ensure quality in judgments about conceptual relevance of primary studies.
Study- generated and synthesis-generated refer to sources of evidence about
relationships within research syntheses. These were considered during problem formulation.
Study- generated evidence is present when a single study contains results that directly test the
relation being considered. Synthesis- generated evidence is present when the results of studies
using different procedures to test the same hypothesis are compared to one another (i.e., meta-
analysis) (Cooper, 1998, p. 22). When using synthesis- generated evidence to study descriptive
statistics or bivariate relationships (and the corresponding interactional hypothesis), this
researcher was alert to Coopers cautioning that social scientists often use different scales to
measure variables, because different scales make it difficult to test bivariate relationships or to
aggregate descriptive statistics.
When a subject has been widely researched, it is likely that previous meta-analyses on
that topic already exist (Cooper, 1998). Syntheses conducted in the past can provide a basis for
creating a new one. As such, this author located one previous meta-analysis on the present topic:
Hackings (1999) analysis on art-based assessment research. Hacking conducted a meta-analysis
of drawing elements (such as color, line, and space). For the present study, the data
available on individual drawing elements was too limited to warrant the application of meta-
analysis techniques. However, Hacking did not conduct a meta-analysis of concurrent validity
and inter-rater reliability statistics, items that were determined to be important in addressing this
authors research questions.
Criteria for selection of art therapy assessment studies. Between February of 2004 and
February of 2005, unpublished and published sources were located for inclusion in the present
study. The following criteria were used for the selection of primary studies:
? Papers from the field of art therapy only (i.e., no psychological art-based assessments,
such as the Draw A Person, House-Tree-Person, etc.).
44
? Assessments that involve drawing only (i.e., any tests that involve response to an
external stimulus were excluded).
? Assessments that require the subject to complete no more than three drawings (i.e.,
tools that encompass a battery of more than three drawings were excluded).
? Features of art rating scales that measure formal elements (as opposed to picture
content, thus content checklists were excluded).
? Studies written in English; and studies conducted within the last 32 years (since
1973).
These criteria were followed upon initiation of the literature search stage.
The Literature Search Stage
For this stage, every effort was made to locate and retrieve published and unpublished
studies on art therapy assessments and rating instruments, in an effort to ensure that the
cumulative outcome of the present study would reflect the results of all previous research (per
Cooper, 1998).
To protect against threats to validity in the literature review stage, Coopers (1998)
guidelines were used: (a) conduct a broad and exhaustive literature search, (b) provide a detailed
description in the research paper about how studies were gathered, (c) present indices of
potential retrieval bias (if available), such as an examination of whether any difference exists in
the results of published versus unpublished studies, and (d) summarize the sample characteristics
of individuals used in separate studies.
Locating studies. As recommended by Cooper (1998), three methods for locating primary
studies were employed for the present study: informal channels, formal methods, and secondary
channels (per Cooper, 1998). Specifically, for the present study, four principal types of informal
channels were used: (a) personal contacts, (b) solicitation letters (sent to colleagues via email),
(c) electronic invisible colleges (i.e., networks of arts therapists who share information with each
other via internet listserves), and (e) the World Wide Web.
Formal techniques for locating studies were also used: (a) professional conference paper
presentations, (b) personal journal libraries (i.e., the authors personal collection: The American
Journal of Art Therapy; Art Therapy, Journal of the American Art Therapy Association; The Arts
in Psychotherapy), and (c) research report reference lists (i.e., reviews of research reports already
acquired) (Cooper, 1998). Awareness was maintained of the potential for peer review and
45
publication biases when using personal journal libraries. Specifically, The scientific rigor of the
research is not the sole criterion for whether or not a study is published. Most notably, published
research is biased toward statistically significant findings (Cooper, 1998, p. 54). In additio n to
this prejudice against the null hypothesis, another source of bias that impacts publication was
considered: collectively confirmatory biases (Nunnally, 1960). Specifically, findings that conflict
with the prevailing beliefs of the day are less likely to be submitted for publication, and are
less likely to be selected for publication, than research which substantiates currently held beliefs.
The secondary channels for locating studies employed in the present study included
bibliographies and reference databases. The bibliographies consisted of those previously
prepared by others (such as an unpublished handout of the DDS bibliography). The reference
databases used to locate studies by this author were: Cambridge Scientific Abstracts and
PsycInfo.
Thirty-nine studies were initially located. Of these, four were excluded: Bowyer (1995)
(was unable to be mailed); Mills, Cohen and Meneses (1993a & 1993b) (is a review of previous
studies); Rankin (1994) (is a review of previous studies); and Teneycke (1998) (was unable to be
located). The remaining 35 studies were retained for inclusion in the final analysis (Appendix J).
Extracting Data and Coding Study Characteristics
Once a satisfactory number of primary studies were found, the synthesist perused each of
these. The next step was to design a coding sheet.
The research synthesis coding sheet. A coding sheet is used by the synthesist to
systematize information from the primary research reports (Cooper, 1998). Abiding by the
procedures set forth by Cooper, this researcher began constructing a coding sheet by pre-
determining some of the data that would be extracted, and then put together a draft coding sheet.
The coding sheet was further solidified after the studies were read. When constructing the coding
sheet, all potentially relevant information was retrieved from the studies.
Stocks (1994) criteria for coding sheet design were also followed. Specifically, six
general categories were incorporated: report identification, citation information, research design,
site variables, participant variables, and statistical information. A category for noting study
quality was also included, and space was made available on the sheet for descriptive notes.
Please refer to Appendix I for the coding sheet that was actually used to code the primary
studies.
46
Categorizing research methods. The synthesist must decide what methodological
characteristics of studies need to be coded (Cooper, 1998, p. 84). While the synthesist should
code all potentially relevant, objective aspects of research design (p. 88), there are threats to
validity that may not be captured by this information alone. As such, the mixed-criteria
approach, known as the optimal strategy for categorizing studies, was employed for the present
analysis. This approach is actually a combination of two a posteriori methods: (a) the threats-to-
validity approach, wherein judgments must be made about the threats to validity that exist in a
study; and (b) the methods-description approach, wherein the objective design characteristics of
a study, as described by the primary researcher, must be detailed.
As Cooper (1998) suggested, any limits on the types of individuals sampled in the
primary studies were recorded, along with information about where and when the studies were
conducted. In addition, it was noted when the dependent variable measurements were taken in
relation to measurement or manipulation of the independent variables.
In order to assess the statistical power of each study, the following items were recorded:
(a) the number of participants; (b) the number of other factors (sources of variance) extracted by
the analyses; and (c) the statistical test used (Cooper, 1998).
Testing of the coding sheet. In order to ensure reliable codings, per Coopers (1998)
recommendations, this synthesist adhered to the rules for developing a thorough and exhaustive
coding sheet, described previously in this chapter. The researcher met with the coders to discuss
the process in detail and to review practice examples. An initial draft of the coding sheet was
created by this author, and then feedback was solicited from colleagues. Following this step,
three studies were selected randomly (Couch, 1994; Kress & Mills, 1992; Wilson, 2004) in order
to pilot-test the sheet. Three people served as coders for the pilot-test. This researcher served as a
coder (Coder 1), and trained two individuals to code the three studies: Coder 2 was an art
therapist and doctoral student in the FSU Art Education department who was blind to the study,
and Coder 3 was the authors Major Professor, who was not blind to the study. The coders met
with the researcher to determine the rate of inter-coder agreement. This was calculated to be 73%
and it was determined that the pilot coding sheet needed to be revised.
In revising the pilot coding sheet, additional categories were incorporated and category
descriptors were defined with more precision, as suggested by Cooper (1998). A second and final
test of the revised coding sheet was conducted. The coders were provided with three new studies
47
that were randomly selected from the pool of 35 and were trained by the researcher in how to use
the revised coding sheet. The primary studies coded for this second test were: Batza (1995),
Cohen & Heijtmajer (1995), and McHugh (1997). Inter-coder agreement among these coding
sheets was 100%, thus the coding sheet was deemed appropriate for use in the present study.
As recommended by Cooper (1998), coder reliability was checked on randomly chosen
studies once coding was in progress. Coders 1 and 2 coded three papers: Francis, Kaiser &
Deaver (2003), Gulbro-Leavitt & Schimmel (1991), and Neale (1994). Inter-rater agreement was
calculated to be 100%.
Data Evaluation
During this stage, the author critically assessed data quality: it was determined whether
the data were contaminated by factors that were irrelevant to the central problem (Cooper, 1998).
Specifically, decisions were made about whether to keep separate or to aggregate multiple data
points (correlations or effect sizes) from the same sample, which involves independence and
nonindependence of data points. In addition, the analyst looked for errors in recording, extreme
values and other indicators that suggest unreliable measurements. The size of relationships or
treatment effects was also investigated.
Criteria for judging the adequacy of data- gathering procedures was established in order to
determine the trustworthiness of individual data points (per Cooper, 1998). Then, any primary
studies found to be invalid or irrelevant to the synthesis were either discarded (a discrete
decision), or weighted differently depending on their relative degree of trustworthiness (a
continuous decision) (p. 79).
Identifying independent comparisons. As delineated by Cooper (1998), each statistical
test was coded as a discrete event: studies included in the final analyses (described later) had two
main statistical comparisons, concurrent validity and inter-rater reliability. Thus, two separate
Microsoft Excel data sheets were made for each pertinent study. This approach, knows as the
shifting- unit of analysis, ensures that for analyses of influences on relationship strengths or
comparisons, a single study can contribute one data point to each of the categories distinguished
by the (mediating) variable (p. 100). A shifting unit of analysis is recommended as an effective
approach, because, although it can be confusing, it allows studies to retain their maximum
information value while keeping to a minimum any violation of the assumption of independence
of statistical tests (p. 100).
48
Data Analysis and Interpretation
The following data analysis steps were followed and are reported in Chapter Four: (a)
simple description of study findings (Glass et al., 1981); (b) correlating study characteristics and
findings; (c) calculating mean correlations, variability, and correcting for artifacts (Arthur et al.,
2001); (d) deciding to search for mediators (e.g., subject variables); (e) selecting and testing for
potential mediators; (f) linear analysis of variance models for estimation (Glass et al., 1981); and
(g) integrating studies that have quantitative independent variables.
Statistical procedures were used to interpret the data, so that systematic patterns could be
distinguished from chance fluctuations (Cooper, 1998). Studies were measured by means of a
correlation coefficient and an effect size (per Glass et al., 1981). Then, methods of tabulating and
describing statistics were applied: averages, frequency distributions, measures of variability, and
so on. These data are reported in Chapter Four of the present study.
The test-criterion relationship was expressed as an effect size for each study included in
the analysis (AERA, 1999). Since the strength of this relationship was found to vary according to
mediator variables (such as the year in which data were collected, whether studies were
published or not, etc.), separate estimated effect size distributions for subsets of studies were
computed, and magnitudes of the influences of situational features on effect sizes were
estimated.
Coopers (1998) stages served as a model for determining the methods and procedures
that were necessary to conduct the systematic analysis of art therapy assessment research and
manage the issues that were encountered. The next chapter reports the present studys results.
49
CHAPTER 4
RESULTS
Of primary interest in the present study is to uncover information about the current state
of art therapy assessments and rating instruments pertaining especially to validity and reliability.
Particular methods were applied in order to address this studys research questions, as was
outlined in Chapter Three. Specifically, the researcher gathered both descriptive data and
computed synthesis outcomes in an effort to reveal (a) to what extent art therapy assessments and
rating instruments are valid and reliable for use with clients; (b) to what extent therapy
assessments measure the process of change that a client may experience in therapy; and (c)
whether objective assessment methods such as standardized art therapy tools give us enough
information about clients.
Thus, in this chapter, descriptive results on the 35 primary research papers selected for
the present study are described (please refer to Appendix J for a bibliography of the 35 papers
and the method of study location). Detailed information about the inter-rater reliability of the
primary studies is provided, including meta-analysis results and an examination of potential
mediating variables. Similarly, concurrent validity data derived from some of the primary studies
are examined, consisting of meta-analysis results and an examination of potential mediators.
Descriptive Results
In this section, the following results are described: citation dates; citation types; art
therapy assessment types; patient group categories; and rater tallies.
Citation dates. Out of 35 total studies used in the analysis, 16 (45.71%) were classified as
published and 19 (54.28%) were unpublished. Figure 1 illustrates frequency data on the citatio n
dates for each of the 35 studies.
50
Citation Date Frequency Total
1973 1
1974
1975
1976
1977
1978
1979
1980
1981 1
1982
1983
1984
1985
1986
1987 1
1988 2
1989 2
1990 1
1991 1
1992 2
1993 3
1994 3
1995 2
1996 2
1997 2
1998 1
1999
2000 3
2001
2002 3
2003 1
2004 4
Figure 1. Citation Date Histogram
Citation types. Six types of citations are used in this study. The frequencies are shown in
Figure 2, and are tallied as follows: 15 journal articles (42.86% of total studies); nine masters
theses (25.71%); five dissertations (14.28%); four unpublished papers (11.43%); one bachelors
thesis (2.86%), and one book chapter (2.86%).
51
Citation Type Frequency Total
Journal article 15
Master's thesis 9
Dissertation 5
Unpublished paper 4
Bachelor's thesis 1
Book chapter 1
Figure 2. Citation Type
Art Therapy Assessment Types. Eight categories of Art Therapy Assessment Type were
tabulated in this study, and frequencies are: 20 DDS studies (57.14%); four Birds Nest Drawing
(BND) studies (11.43%); four PPAT studies (11.43%); two studies were not applicable for this
category as they employed rating scales only (5.71%); two studies analyzed spontaneous art
(5.71%); one used the A Favorite Kind of Day (AFKOD) tool (2.86%); one employed the Bridge
Drawing (2.86%), and the DAPA was used in one article (2.86%). Figure 3 displays these
results.
Art Therapy Assessment Type Frequency Total
DDS/CDDS 20
BND 4
PPAT 4
N/A 2
Spontaneous art 2
A Favorite KOD 1
Bridge Drawing 1
DAPA 1
Figure 3. Art Therapy Assessment/Rating Instrument Type
52
The majority of papers in the present study examined the DDS (57.14%), thus a
substantial amount of information about this tool was gathered. For example, seventeen studies
provided numerical data on specific DDS Drawing Analysis Form (DAF) variables, which may
be of interest to the reader (please see Appendix K).
Patient group categories. Fifty-eight total patient groups were identified in the 35 studies
analyzed by this author. They are classified as follows: Major Mental Illnesses (Bipolar
Disorders, Depressive Disorders, Dual/Multiple Diagnoses, Dissociative Disorders,
Schizophrenia and Other Psychotic Disorders) (27 studies, 77.14% of total); Mental Retardation
(two, or 5.71%); Disorders of Childhood (Adjustment Disorders, Attachment, Communication
Disorders, Conduct Disorders, Problems Related to Abuse or Neglect, SED) (12, or 34.28%);
Eating Disorders, Personality Disorders and Substance-Related Disorders (nine, or 25.71%);
Brain Injury and Organic Mental Syndromes and Disorders (three, or 8.57%), and Unspecified
Diagnosis/Miscellaneous and Normal (five, or 14.28% out of 35 total studies). Table 2
illustrates the tally of these groups.
Table 2. Patient Group Categories

Patient Group Category Citations Frequency Total
Major Mental Illnesses Batza 1995, Brudenell 1989, Coffey 1997, 27

Cohen 1988, Cohen95, Easterling00,
Fowler 2002, Gantt 1990, Gulbro 1988,
Gulbro 1991, Hacking 1996, Hacking 2000,
Johnson 2004, Kress 1992, Mchugh 1997,
Mills 1993, Ricca 1992, Shlagman 1996,
Wadlington 1973
Mental Retardation Batza 1995, Gantt 1990 2
Disorders of Childhood Coffey 1997, Francis 2003, Hyler 2002, 12
Kaiser 1993, Manning 1987, Neale 1994,
Overbeck 2002, Shlagman 1996,
Wadlington 1973, Wilson 2004, Yahnke
2000
Eating Disorders; Personality Bergland 1993, Billingsley 1998, Eitel 9
Disorders; Substance- 2004, Hacking 1996, Hacking 2000,
Related Disorders Kessler 1994, Mills 1989
Brain Injury; Organic Mental Couch 1994, Gantt 1990, Hacking 1996 3
Syndromes and Disorders
Unspecified Eitel 2004, Gussak 2004, Hays 1981, 5
Diagnosis/Miscellaneous, Neale 1994, Wadlington 1973
Normal
53
Nine studies had two or more patient group types: Batza 1995 (two groups), Coffey 1997
(two groups), Eitel 2004 (two groups), Gantt 1990 (three groups), Hacking 1996 (three groups),
Hacking 2000 (two groups), Neale 1994 (two groups), Shlagman 1996 (two groups) and
Wadlington 1973 (three groups).
Rater tallies. As is shown in Table 3, most studies used three people to rate artwork (12,
or 34.28% out of 35 total studies); six (or 17.14%) studies used one rater; five studies used two
raters (or 14.28%); two (or 5.71%) studies used four raters; another two (5.71%) used seven
raters; one (2.86%) used six raters; another study (or 2.86%) used five raters; one (or 2.86%)
used 86 raters, and for five (14.28%) studies the number of raters could not be determined.
Table 3. Number of Raters
Number of Raters Citations Frequency Total
1 Coffey 1997, Cohen 1988, Gussak 2004, 6

Kessler 1994, Kress 1992, Mills 1989
2 Couch 1994, Mchugh 1997, Shlagman 5
1996, Wilson 2004, Yahnke 2000
3 Billingsley 1998, Brudenell 1989, Fowler 12
2002, Francis 2003, Gantt 1990, Hyler
2002, Johnson 2004, Kaiser 1993,
Manning 1987, Neale 1994, Overbeck
2002, Ricca 1992
4 Gulbro 1988, Wadlington 1973 2
5 Cohen 1995 1
6 Bergland 1993 1
7 Hacking 1996, Hacking 2000 2

86 Eitel 2004 1
Cant Determine Batza 1995, Easterling 2000, Gulbro 1991, 5
Hays 1981, Mills 1989
Inter-Rater Reliability
Nineteen of the 35 studies computed inter-rater reliability to determine the proportion of
agreement among the individuals who rated assessment drawings (Table 4). Eitel (2004) was
excluded from the inter-rater reliability analysis because the procedures used in that study were
unclear. A variety of statistical measures for inter-rater reliability were reported in the remaining
54
18 studies. These included: kappa, percentage, and correlation (intra-class correlation, r, rho, or
alpha). The type of statistic and numerical result for each study are provided. The kappa statistic
was used in seven studies, five studies reported percentages for inter-rater reliability, and
correlation was used on nine occasio ns. Three studies that employed more than one method of
calculating inter-rater reliability (Fowler 2002, Johnson 2004, and Mchugh 1997) are shown
twice in the table.
Table 4. Inter-Rater Reliability
Citation Kappa Percentage Correlation*
Bergland 1993 0.91

Billingsley 1998 0.9
Fowler 2002 0.567
Fowler 2002 0.842
Francis 2003 0.66
Gantt 1990 0.79
Gulbro 1988 0.95
Hacking 1996 0.97
Hacking 2000 0.992
Hyler 2002 0.805
Johnson 2004 0.944
Johnson 2004 0.91
Kaiser 1993 0.975
Manning 1987 0.74
Mchugh 1997 0.34
Mchugh 1997 0.9
Neale 1994 0.75
Overbeck 2002 0.75
Shlagman 1996 0.82
Wilson 2004 0.853
Yahnke 2000 0.607
Global Effect 0.625945 0.89999324 0.90491
Size
* Intra-class correlation, Pearsons r, Rho, or Alpha
55
Meta-analysis of inter-rater reliability. A global effect size was computed for each of the
three inter-rater reliability categories (kappa, percentage and correlation). In meta-analysis,
global effect sizes for kappas and percentages of agreement are computed via the weighted
average method. For correlations (r), Fishers Z is used to transform r so that a global effect size
can be obtained.
For the studies that used kappa, a weighted average was calculated by using the sample
size for each study. Specifically, the sum was calculated (each sample size multiplied by each
kappa) then divided by the sum of each sample size, resulting in a global effect size of 0.63
(Table 5). The non-weighed average of all kappas was 0.64.
For the studies that reported percentage agreement, the average of percentages of all
studies was 0.89 (Table 6). The weighed average percentage of all studies was computed as
follows: sum (percentage of each study multiplied by corresponding sample size) divided by sum
(sample size of each study), resulting in a global effect size of 0.9.
Whereas weighted averages were computed to obtain global effect sizes for the kappa
and percentage agreement statistics, Fishers Z was used to compute global effect size for inter-
rater reliability correlations. Nine studies qualified for the meta-analysis of inter-rater reliability
results (Bergland 1993, Gantt 1990, Hacking 1996, Hacking 2000, Johnson 2004, Kaiser 1993,
Manning 1987, Mchugh 1997, Wilson 2004). The global effect size (Fishers Z) for these studies
was 0.91 (Appendix L). A sensitivity analysis was conducted in which two small studies were
excluded (Hacking 1996, N=8; Hacking 2000, N=8). Meta-analysis results are attainable even
with an N as small as seven (Q. Wang, persona l communication, March 18, 2005). Thus, the
global effect size of the remaining seven studies was calculated, and 0.9 was the resulting figure
(Appendix M). This may indicate that inter-rater reliability in art therapy assessment research is
higher when correlations are used.
The global effect size was calculated in order to determine the degree to which the null
hypothesis was false. The number was computed to be very high, 0.91, revealing that there is a
great deal of variability among primary study results. To determine the cause of this variability
(heterogeneity), potential mediator variables were examined: rater trained vs. not trained;
primary author served as rater vs. did not serve as a rater; coder found study supported test vs.
coder neutral on whether study supported test; study published vs. study not published. The
56
original nine studies were retained for the examination of potential mediators (i.e., sources of
variability).
Table 5. Kappa Effect Size
Citation Kappa N (Study Kappa*N

Sample Size)
Fowler 2002 0.567 48 27.216
Francis 2003 0.66 70 46.2
Hyler 2002 0.805 49 39.445
Mchugh 1997 0.34 80 27.2
Neale 1994 0.75 90 67.5
Overbeck 2002 0.75 32 24
Yahnke 2000 0.607 31 18.817
TOTALS 400 250.378
GLOBAL EFFECT SIZE 0.63985714 0.625945

(non-weighted) (weighted)
Table 6. Percentage Effect Size
Citation Percentage N (Study Percentage*N

Sample Size)
Billingsley 1998 0.9 27 24.3
Fowler 2002 0.842 48 40.416
Gulbro 1988 0.95 83 78.85
Johnson 2004 0.944 78 73.632
Shlagman 1996 0.82 60 49.2
TOTALS 296 266.398
GLOBAL EFFECT SIZE 0.8912 0.89999324
(non-weighted) (weighted)
Examination of potential mediators. Among the five studies for which all raters were not
trained, the global effect size (correlation) was 0.89 (Appendix N). In the four studies for which
raters were trained, the global effect size (correlation) was higher: 0.96, which may indicate that
training raters results in higher reliability. Variation among reliabilities of the nine studies was
57
42.73, and the Q-between (used to test whether the average effects from the groupings are
homogeneous [Cooper, 1998]) of 12.94 is fairly low which indicates that this variable does not
really help to explain the heterogeneity among the nine studies. Furthermore, the Q-within (used
to compare groups of R indexes) indicates that there is 29.8 variability that cannot be explained
by whether raters were trained or not.
In six papers, the primary studys author did not serve as a rater. Among these, the global
effect size (correlation) was 0.9 (Appendix O). Authors did serve as raters in the other three
studies, and the global effect size (correlation) for these was higher, at 0.93, which may suggest
that using authors as raters results in higher inter-rater reliability. Variation among the
reliabilities of the nine studies was 42.74, and the low Q-between of 2.00 suggests that this
variable does not help to explain the heterogeneity among the nine studies. Furthermore, the Q-
within indicates that there is 40.74 variability that cannot be exp lained by whether primary
authors served as raters or not.
Among the four papers for which the coder (this author) found that the study supported
the art-based test, the global effect size (correlation) was high at 0.94 (Appendix P). In the five
papers for which the coder was neutral on whether a given study supported the art-based test, the
global effect size (correlation) was lower, at 0.86. These findings may indicate that the coders
opinion about whether a studys results supported a test was slightly more reliable than whether
the coder believed that a studys findings were neutral. Variation among reliabilities of the nine
studies was 42.74, and the Q-between of 14.31 is fairly low which indicates that this variable
does not help to explain the heterogeneity among the nine studies. Furthermore, the Q-within
indicates that there is 28.43 variability that cannot be explained by the Coder favor variable.
Five of the nine studies were not published. Among these, the global effect size
(correlation) was 0.91 (Appendix Q). The remaining four studies were classified as published,
and the global effect size (correlation) for these was lower, at 0.9. This may suggest that whether
a study was published or not, it showed an average reliability of 0.9. Variatio n among the nine
studies reliabilities was 42.74, and the low Q-between of 0.19 suggests that this variable does
not help to explain the heterogeneity among the nine studies. Furthermore, the Q-within indicates
that there is 42.54 variability that cannot be explained by whether a study was published or not.
In summary, of the four potential mediating variables that were examined (rater trained
vs. not trained; primary author served as rater vs. did not serve as a rater; coder found study
58
supported test vs. coder neutral on whether study supported test; study published vs. study not
published), none were found to be helpful in explaining the heterogeneity among the nine
studies, especially primary author rater vs. not and study published vs. not. The variables
rater trained vs. not trained and coder favor vs. neutral were only moderately useful in
explaining the heterogeneity, as is illustrated in Table 7.
Table 7. Potential Mediating Variables (Inter-Rater Reliability)
Q-Between Probability Q-Within Compared to Probability

Q-Between Variation among reliabilities Q-Within
of the nine studies
Rater trained vs. 12.9402 0.000321600* 29.7968 / 42.7370 0.000103462*
not trained
Primary author 1.99688 0.15762 40.7402 / 42.7370 0.000000908*
rater vs. not
Coder favor vs. 14.3060 0.000155369* 28.4310 / 42.7370 0.000183652*
neutral
Study published 0.19383 0.65975 42.5432 / 42.7370 0.000000409*
vs. not
* Significant at the <.001 level
Concurrent Validity
Concurrent validity was attempted in 15 (42.86%) studies, was not attempted in 18
studies (51.43%), and two (5.71%) studies were coded as N/A in this category (Table 8).
Meta-analysis of concurrent validity. Of the 15 studies that attempted concurrent validity
only seven (Brudenell 1989, Gulbro 1988, Gulbro 1991, Johnson 2004, Kaiser 1993, Overbeck
2002, Wilson 2004) used correlations. Two of these studies (Gulbro 1988, Kaiser 1993) used
more than one test to compare with the art-based tool, which resulted in a total of 11 studies for
the final meta-analysis. The correlations were transformed into Fishers Zs. Table 9 displays the
relevant data for the meta-analysis of concurrent validity. Additional data is available in
Appendices R-II. Confidence intervals are included therein, as these provide a method of
visually detecting heterogeneity.
59
Table 8. Concurrent Validity Frequencies
Citations Frequency Total

Concurrent Brudenell 1989, Cohen 1988, Easterling 2000, 15
Validity Attempted Fowler 2002, Francis 2003, Gulbro 1988,
Gulbro 1991, Gussak 2004, Hyler 2002,
Johnson 2004, Kaiser 1993, Mills 1993,
Overbeck 2002, Wilson 2004, Yahnke 2000
Concurrent Batza 1995, Billingsley 1998, Coffey 1997, 18
Validity Not Cohen 1995, Couch 1994, Eitel 2004, Gantt
Attempted 1990, Hacking 1996, Hacking 2000, Hays
1981, Kessler 1994, Kress 1992, Manning
1987, Mchugh 1997, Mills 1989, Neale 1994,
Ricca 1992, Shlagman 1996
N/A Bergland 1993, Wadlington 1973 2
Table 9. Concurrent Validity Effect Sizes

Obs ID Year Test r Manuscript Patient Author Coder Sample Fishers Z
Type Group Favor Favor Size
1 Brudenell 1989 1 0.751 2 1 0 0 7 0.97524
2 Gulbro 1988 1 0 1 1 2 2 83 0.00000
3 Gulbro 1988 1 0.01 1 1 2 2 83 0.01000
4 Gulbro 1988 1 0.12 1 1 2 2 83 0.12058
5 Gulbro 1988 3 -0.16 1 1 2 2 83 -0.16139
6 Gulbro 1991 1 0.3 1 1 2 2 83 0.30952
7 Johnson 2004 3 -0.095485 1 1 1 1 60 -0.09578
8 Kaiser 1993 2 0.6 2 3 1 1 41 0.69315
9 Kaiser 1993 2 0.11 2 3 1 1 41 0.11045
10 Overbeck 2002 2 0.025 2 3 2 2 32 0.02501
11 Wilson 2004 3 -0.116 2 3 0 2 8 -0.11652
Table 8 Key
Category Definition
Test The type of test used for concurrent validity (1=depression measures; 2=attachment
measures; 3=other drawing tests, CAT-R, or MCMI-II).
r The correlations between the art therapy assessment tool and the other test(s) used in the
given study.
Manuscript Type The type of manuscript (1=published or dissertation; 2=unpublished).
Patient Group The patient group catgegory (1=major mental illnesses; 3=disorders of childhood).
Author Favor 0=authors conclusion that findings did not support use of art-based test; 1=authors
conclusion that findings did support use of test; and 2=authors conclusion that findings
neither support nor oppose use of test.
Coder Favor 0=coders conclusion that findings did not support use of art-based test; 1=coders
conclusion that findings did support use of test; and 2=coders conclusion that findings
neither support nor oppose use of test.
Sample Size The treatment or patient group N for each study.
60
The individual effect sizes were synthesized to obtain a global effect size of 0.09.
Heterogeneity (variability) was found among the effect sizes, evidenced by the wide variation in
correlations (rs ranging from 0.16 to 0.751). These correlations are all low and some of them
are even less than 0, which may indicate that these studies have low validity (with the exception
of Brudenell 1989, however this study had a very small sample size). To determine the origin of
the heterogeneity, the aforementioned six categories were treated as potential mediating variables
and were statistically analyzed.
Examination of potential mediators. Table 10 provides a summary of the variables that
were examined as potential sources of variance. The details of the six analyses are available in
Appendices T (p. 101, Author Favor), W (p. 104, Coder Favor), Z (p. 107, Manuscript Type),
CC (p. 110, Patient Group), FF (p. 113, Test), and II (p. 116, Year).
Table 10. Potential Mediating Variables (Concurrent Validity)
Q-Between Probability Q-Within Compared to Probability

Q-Between Variation among Q-Within
reliabilities of the 11
studies
Author Favor 2.55721 0.27843 26.8657 / 29.4230 0.000745726*
Coder Favor 5.06286 0.079545 24.3601 / 29.4230 0.001993794
Manuscript Type 6.47957 0.010912 22.9434 / 29.4230 0.000745726*
Patient Group 4.85927 0.027498 24.5637 / 29.4230 0.003493698
Test 11.9278 0.002569834 17.4951 / 29.4230 0.025347
Year 15.2887 0.000478742* 14.1343 / 29.4230 0.078332
* Significant at the <.001 level
These data reveal that the category Year (year of study completion or publication) is
unique in that, although there is 14.1343 variability that cannot be explained by Year, the
individual groupings may be able to explain the variability. In group 1, studies conducted prior to
1990, validity is low at 0.00443. The group 2 studies, those conducted between 1990-2000, are
most valid at 0.34034. The studies conducted since 2000, group 3, are the least valid at -0.05836.
61
Conclusion
In this chapter, descriptive results were provided. This included data for the following
categories: citation dates; citation types; art therapy assessment types; patient group categories;
and rater tallies. Detailed information about inter-rater reliability was described, including meta-
analysis results and an examination of potential mediating variables. Concurrent validity data
were also examined, consisting of meta-analysis results and an examination of potential
mediators. Conclusions based upon these results are described in the subsequent chapter.
62
CHAPTER 5
CONCLUSIONS
It is the consensus of most mental health professionals, agency administrators, and

insurance companies that regardless of the formality or structure, assessmentand
reassessment at appropriate timesconstitutes the core of good practice (Gantt, 2004, p. 18).
It has been demonstrated in the previous chapters that art therapists use and develop
assessments and rating instruments, and that there is a need to improve the validity and reliability
of these tools. In order to address this need, a systematic analysis of 35 pertinent studies was
conducted. A hypothesis and three research questions were formulated to specifically deal with
the issues at the root of the problem. These are discussed presently, in relation the studys results.
To address research question one, methodological problems identified in the primary
studies are described. Problems with validity and reliability of the primary articles are delineated,
followed by a discussion of methods for the improvement of assessment tools. Subsequently,
questions two and three are addressed: to what extent do art therapy assessments measure the
process of change that a client may experience in therapy?; and, do objective assessment
methods such as standardized art therapy tools give us enough information about clients?
Limitations of the study are presented in the following sections: validity issues in study retrieval;
issues in problem formulation; judging research quality; issues in coding; sources of unreliability
in coding; validity issues in data analysis; and issues in study results. Finally, recommendations
for further study are described.
Research Questions
Question One
To what extent are art therapy assessments and rating instruments valid and reliable for
use with clients? It was assumed that methodological difficulties on previous art therapy
63
assessment research exist, and that this would impact the validity and reliability of assessment
tools.
Methodological problems identified in the primary studies. Twenty-eight of the 35
studies reported methodological weaknesses (Appendix JJ). However, seven studies did not
mention weaknesses, and some failed to include an exhaustive list of problems. The weaknesses
that were reported are tallied in Figure 4. Twenty- four studies acknowledged a total of 39
subject-related flaws. Data collection problems were also frequent, with 16 studies reporting a
total of 18 occasions. Other reported flaws are: (1) rating instrument-related (11 studies); (2)
concerned with inter-rater reliability (eight studies); (3) categorized as other (five studies); and
(4) assessment-related (two studies). Three papers that failed to report study procedures and/or
findings (Cohen, Hammer & Singer, 1988; Hays, 1981; McHugh, 1997) were identified. A more
specific itemization of author- identified study flaws is included in Appendix KK. Coder-
identified weaknesses are included in Appendix LL.
Problems with validity and reliability of the articles researched. As McNiff (1998)
concluded, many tools are generally deficient of data supporting their validity and reliability, and
are not supported by credible psychological theory. Specifically, McNiff found that those who
choose to assess clients through art have neglected to convincingly address the essence of
empirical scientific inquiry: (1) findings that link character traits with artistic expressions; (2)
replicable results based upon copious and random data; and (3) uniform outcome measures
which justify diagnosis of a client via his or her artwork. Furthermore, Gantt and Tabone (1998)
reported that the previous research on projective drawings and assessments has yielded mixed
results. These mixed results are reflected in a summary of the meta-analyses identified in
Chapter Two of the present study.
Garb (2000) found that projective techniques should not be used to detect child sexual
abuse, whereas West (1998) concluded that projective instruments could effectively discriminate
distressed children from those who were non-distressed. Hunsley and Bailey (1999) found that
there is no scientific basis for using the Rorschach, and provided ample support for this
conclusion. Similarly, Meyer and Archer (2001) demonstrated that the Rorschach had greater
validity for some purposes than for others, and identified eleven salient empirical and theoretical
gaps in the Rorschach knowledge base. Conversely, Parker, Hanson, and Hunsley (1988)
64
determined that the Rorschach could be considered to have sufficient psychometric properties if
used according to the purpose for which it was designed and validated.
Wadlington 1973
Weakness
Shlagman 1996
Billingsley 1998
Easterling 2000
Overbeck 2002
Brudenell 1989
Bergland 1993
Johnson 2004
Manning 1987
Hacking 1996
Hacking 2000
Category and Total
Mchugh 1997
Gussak 2004
Yahnke 2000
Francis 2003
Kessler 1994
Gulbro 1988
Gulbro 1991
Wilson 2004
Cohen 1988
Cohen 1995
Fowler 2002
Coffey 1997
Couch 1994
Kaiser 1993
Neale 1994
Batza 1995
Kress 1992
Ricca 1992
Gantt 1990
Hyler 2002
Hays 1981
Number of
Eitel 2004
Mills 1989
Mills 1993
Weaknesses
No weaknesses

identified
(7)
Procedures/findings
not reported (0)
Subject-related

weaknesses

(39)
Data collection

weaknesses (18)
Assessment
-related

weaknesses (2)
Rating instrument

weaknesses (11)
Inter-rater reliability

weaknesses (9)
Other (5)

Figure 4. Author-Identified Study Weaknesses
McNiffs (1998) findings are supported to some extent by the descriptive results of the
present study. That 18 out of 35 papers included in this study did not attempt concurrent validity,
and that 16 out of 35 did not attempt inter-rater reliability, reflects McNiffs conclusion that
many tools are generally deficient of data supporting their validity and reliability.
Concurrent validity tells us about the degree to which the scores on an instrument are
related to the scores on another instrument administered at the same time, or to some other
criterion ava ilable at the same time (Fraenkel & Wallen, 2003). In the current research, of the 15
(42.86%) primary studies that did not attempt concurrent validity, 11 used correlations and were
therefore eligible for the application of meta-analysis techniques. The category Year (year of
65
study completion or publication) was found to be unique in that, although 14.13 variability could
not be explained by Year, the individual groupings might have been able to explain the
variability. In group 1, studies conducted prior to 1990, there was zero validity. The group 2
studies, those conducted between 1990-2000, were most valid at 0.34. The studies conducted
since 2000, group 3 were the least valid at -0.06. However, these results should be interpreted
with caution. Overall, each of the variables that were examined as potential sources of variance
was not found to contribute significantly to the variance. Thus, results are inconclusive.
The previous research on psychological tools has yielded mixed results (Gantt and
Tabone, 1998). For the 35 studies included in the present analysis, results also appear to be
mixed, although the majority (about 94 percent) of studies were neutral or positive. As is
illustrated in Tables 11 and 12, in 19 of 35 studies, the authors concluded that the studys
findings did support the use of the art therapy assessment, whereas in only two of 35 studies did
the authors indicate that their research did not support the use of an assessment. In 14 studies, the
authors concluding remarks revealed that their studies neither supported nor opposed the use of
an assessment (i.e., results were neutral). Coder 1 found that seven studies supported the use of
the art therapy assessment; that 26 studies neither supported nor opposed; and that two did not
support use of the assessment. Thus, for both author and coder judgments, the results were
somewhat mixed, the majority of results found to be either neutral or in favor of the use of the
given assessment.
Six (or 17.14%) studies used one rater, thus inter-rater reliability was not applicable. For
the seven studies that qualified for inclusion in the meta-analysis of inter-rater reliability, the
global effect size was high, at 0.9. However, of the four potential mediating variables that were
examined (rater trained vs. not trained; primary author served as rater vs. did not serve as a rater;
coder found study supported test vs. coder neutral on whether study supported test; study
published vs. study not published), none of these explained the heterogeneity among the nine
studies, especially primary author rater vs. not and study published vs. not. The variables
rater trained vs. not trained and coder favor vs. neutral were only moderately useful in
explaining the heterogeneity. Therefore, it is not appropriate to draw conclusions about inter-
rater reliability based on the current studys findings.
66
Table 11. Author/Coder Favor Tally
Citation ID Author Coder
Favor Favor
Batza 1995 2 2
Bergland 1993 1 1
Billingsley 1998 1 2
Brudenell 1989 0 0
Coffey 1997 1 2
Cohen 1988 2 2
Cohen 1995 1 2
Couch 1994 1 2
Easterling 2000 1 2
Eitel 2004 2 0
Fowler 2002 2 2
Francis 2003 1 1
Gantt 1990 1 2
Gulbro 1988 2 2
Gulbro 1991 2 2
Gussak 2004 1 1
Hacking 1996 1 2
Hacking 2000 1 1
Hays 1981 1 2
Hyler 2002 2 2
Johnson 2004 1 1
Kaiser 1993 1 1
Kessler 1994 1 1
Kress 1992 2 2
Manning 1987 1 2
Mchugh 1997 1 2
Mills 1989 2 2
Mills 1993 1 2
Neale 1994 2 2
Overbeck 2002 2 2
Ricca 1992 2 2
Shlagman 1996 2 2
Wadlington 1973 1 2
Wilson 2004 0 2
Yahnke 2000 2 2
0= conclusion that the studys findings did not

support use of the art -based test
1= conclusion that the studys findings did
support use of the test
2= conclusion that the studys findings neither
support nor oppose use of the test
67
Table 12. Author/Coder Favor Frequencies
Author Favor Frequency Total Coder Favor Frequency Total
Author seemed to conclude 19 Coder concluded that 7

that the studys findings DID the studys findings
support use of the test DID support use of
the test

that the studys findings the studys findings
neither support nor oppose neither support nor
use of the test oppose use of the test

that the studys findings DID the studys findings
NOT support use of the art- DID NOT support use
based test of the art-based test
In summary, question one, To what extent are art therapy assessments and rating
instruments valid and reliable for use with clients? cannot be addressed by the present results.
Therefore, although methodological difficulties on previous art therapy assessment research
appear to exist, details about how these difficulties actually impact the validity and reliability of
assessment tools remain unknown.
Methods for improvement of tools. This studys findings have implications for the
improvement of existing tools as well as those that have yet to be developed. In Chapter Two,
avenues for the improvement of art therapy assessments and rating instruments were discussed.
Findings of the present study support the recommendations for enhancement of existing tools as
well as future research in this realm put forth by previous authors, and also reveal additional
areas in which researchers could make improvements.
As is listed in Table 13, the results of the present study support the recommendations for
improvements identified by previous authors: (1) data should be collected from a large number
of participants (Hagood, 2002; Neale & Rosal, 1993); (2) subjects should be matched (Hagood,
2002); (3) researchers should ensure that they have consulted previous assessment literature prior
to developing their own tools and/or furthering the existing research (Hacking, 1999); (4)
researchers should publicly admit to flaws in their work and make trainees aware of these flaws
while striving for improvement of the assessment tool and rating system; (5) rating systems
68
should incorporate objective criteria on which to score variables (Neale & Rosal, 1993); (6)
interrater reliability should be established; and (7) data collection and appropriate analysis
procedures should be duplicated to establish reliability and effectiveness of previously studied
art-based tests.
The present study was limited by the sampling methods used in the primary research
papers: The sampling methods of 22 (62.86%) studies were flawed, thereby preventing
generalizability of results.
Table 13. Sampling Flaws

Sampling Flaws Citations Frequency Total
Small N (less than Batza 1995, Billingsley 1998, Brudenell 1989, 13
30) Cohen 1995, Couch 1994, Easterling 2000,
Fowler 2002, Gantt 1990, Hacking 1996,
Manning 1987, Mills 1993, Ricca 1992, Wilson
2004.
Subjects not Batza 1995, Brudenell 1989, Coffey 1997, 10
matched Cohen 1995, Couch 1994, Hacking 1996,
Kress 1992, Mchugh 1997, Mills 1989, Neale
1994
Subjects not Billingsley 1998, Francis 2003, Gussak 2004, 6
randomly selected Ove rbeck 2002, Ricca 1992, Shlagman 1996
Sampling methods Batza 1995, Bergland 1993, Brudenell 1989, 24
unable to be Coffey 1997, Cohen 1988, Cohen 1995,
determined Couch 1994, Eitel 2004, Fowler 2002, Gulbro
1988, Gulbro 1991, Hacking 2000, Hays 1981,
Hyler 2002, Johnson 2004, Kaiser 1993,
Kessler 1994, Kress 1992, Mchugh 1997,
Mills 1989, Mills 1993, Wadlington 1973,
Wilson 2004, Yahnke 2000
A breakdown of author- identified study weaknesses in relation to rating systems

illustrates the ways in which existing systems could be improved (Table 14), and thereby holds
implications for systems that have yet to be developed.
Because the majority of primary studies were about the DDS, a great deal of information
about the DDS rating system is available. Table 14 provides a list of suggestions for revision of
the rating guide and scales of the DDS Drawing Analysis Form (DAF). Specifically, Mills
(1989) found that the rating guide shows a lack of consistent theoretical outlook and suggested
69
that it be improved by rewriting for clarification of terms and rating procedures, and for
inclusion of illustrations to augment verbal definitions (p. 133). Johnson (2004) suggested the
addition of two items to DDS checklist: multidirectional movement and unrelated multiple
images. Problems with the dichotomous and categorical rating scales in the DDS Rating Guide
and Drawing Analysis Form were cited (Billingsley 1998, Gulbro 1991, Gulbro 1991, Neale
1994), and some authors reported weaknesses with the DDS Content Checklist and Tree Scale
(Couch 1994, Kress 1992). Ricca (1992) found the DAF to be subjective and recommended that
more objective criteria be incorporated. Finally, Fowler (2002) criticized Mills and Cohens
original inter-rater reliability results.
In 1990, Gantt stated that two FEATS scales (perseveration and rotation) needed
additional refinement in order to be useful, and that the FEATS scales used imprecise
measurements. Francis (2003) found that wording of Birds Nest Drawing directives and graphic
indicators on the checklist could be worded more precisely.
If researchers were to consider the suggestions put forth in Table 14, improvements to
rating systems could be implemented. As discussed previously, researchers should also bear in
mind that interval/ratio scales can be compared in terms of direction or magnitude, but that the
scores will be more variable. Conversely, nominal and ordinal measures should not be compared
in terms of direction or magnitude, but they are more likely to produce consistent responses.
Because there is no conclusive evidence for using Likert-type (interval) versus binary-choice
(nominal) items in rating instruments, the choice should be specific to the instruments purpose
(B. Biskin, personal communication, February 22, 2005). The format that best represents the
underlying construct to be measured should guide the selection of format. Both methods have
value as long as their limitations are realized. Well-constructed, standardized scales for rating
artwork are vital in order to validate assessment findings and to determine the reliability of
subjects scores.
In order to address question one, methodological problems identified in the primary
studies, and problems with their validity and reliability were discussed. Although question one
could not be addressed by the present results, it was determined that this studys findings have
implications for the improvement of existing tools as well as those that have yet to be developed.
Information about flaws in the FEATS, BND and in particular the DDS, was provided. It was
suggested that if weaknesses are better understood by researchers, then this information should
70
help to improve the existing rating systems, and ensure the enhancement of newly developed
systems.
Table 14. Rating Systems Needing Revision
CITATION ID AUTHORS COMMENT PAGE

Studies that
identified the DDS
Rating System as
needing revision:
Billingsley 1998 Stated that the DDS Rating Guide may need revision as categorical scales were found to be p. 124
problematic.
Couch 1994 Reported weakness of DDS Content Checklist and Tree Scale. Need for standardized rating p. 113
criteria on DAF.
Fowler 2002 Criticism of Mills and Cohens inter-rater reliability. p. 224
Gulbro 1988 Weaknesses of the DDS itself (lack of ordinal scales on DAF) p. 123
Gulbro 1991 Dichotomous and categoric scoring variables seriously limit the sensitivity of the instrument. p. 355
Johnson 2004 Author suggested adding two items to DDS checklist: multidirectional movement and
unrelated multiple images.
Kress 1992 Content checklist and Creekmore Tree Scale have not been tested for reliability and validity. p. 23
Mills 1989 The rating guide, created by a number of art therapists from diverse backgrounds, shows a p. 133
lack of consistent theoretical outlook, a weakness attributable to the eclecticism of its creators.
It could be improved by rewriting for clarification of terms and rating procedures, and for
inclusion of illustrations to augment verbal definitions. However, ratings resulting from the
improved system would then differ from the ratings resulting from using the 1986-B format, and
results would not be comparable.
Neale 1994 The Drawing Analysis Form and the Rating Guide may need revision. Because of the p. 126
categorical nature of the rating format, statistical analysis of the data is extremely complicated.
For a less complicated analysis, the rating format should be comprised of variables rated on a
continuous data scale. By using a continuous data scale, the distribution of errors would be
normal and use of a statistical analysis with more power would be possible.
Ricca 1992 Statistical analysis limited to 14 of 23 DAF categories; p. 126
The DAF seemed subjective, therefore more objective criteria is highly suggested. p. 128
Studies that
identified the FEATS
as needing revision:
Gantt 1990 Two FEATS scales (perseveration and rotation) still need additional refinement to be useful. p. 218
The scales use imprecise measurements.
Studies that
identified the BND
Rating System as
needing revision:
Francis 2003 Wording of directives; wording of graphic indicators on checklist should be more precise. p. 135
71
Question Two
It was assumed that art therapy assessments measure the process of change that a client
may experience in therapy. However, the variability of the concurrent validity and inter-rater
reliability meta-analyses results of the present study indicates that the field of art therapy has not
yet produced sufficient information in the area of assessments and rating instruments.
The variation across the primary studies was vast and may have been produced by one of
two factors that are known to cause differences in tests of main effects: (1) sampling error
(chance fluctuations due to the imprecision of sampled estimates), and (2) differences in study
participants or how studies are conducted (Cooper, 1998). Taveggia (1974) underscored the
implications of using sampling techniques and probability theory to make inferences about
populations:
A methodological principle overlooked by writers ofreviews is that research
results are probabilistic. This suggests thatthe findings of any single research
are meaninglessthey have occurred simply by chance. It also follows that if a
large enough number of researches has been done on a particular topic, chance
alone dictates that studies will exist that report inconsistent and contradictory
findings! (pp. 397-398)
Thus, chance fluctuation due to the inexactness of sampled estimates is one possible
source of variance in the present studys results. To address the issue of potential variability
attributable to methodology, Coopers (1998) recommendation to examine substantive
differences between studies was followed. However, the results of the sensitivity analyses
revealed trivial sources of variance none of the mediating variables were found to significantly
contribute to the variance in the results. Thus, due to sampling error and vast differences in study
participants and methodology, the extent to which art therapy assessments measure the process
of change that a client may experience in therapy cannot be determined.
Question Three
Do objective assessment methods such as standardized art therapy tools give us enough
information about clients? Based on the review of the literature, it was assumed that the most
effective approach to assessment incorporates objective measures such as standardized
assessment procedures (formalized assessment tools and rating manuals; portfolio evaluation;
behavioral checklists), as well as subjective approaches such as the clients interpretation of his
72
or her artwork. Due to the inconclusive results of the present study, it is recommended that
researchers continue to explore these objective and subjective approaches to assessment.
Based on the present studys outcomes, question three cannot be addressed, because the
field of art therapy has not yet produced sufficient information in the area of assessments and
rating instruments. Rather, objective assessment methods such as standardized art therapy tools
not only fail to give us enough information about clients, but the previous research fails to
provide adequate information about the tools themselves.
The present study sought to answer the question, What does the literature tell us about
the current state of art therapy assessments? The null hypothesis, that homogeneity exists
among the study variables identified in art therapy assessment and rating instrument literature,
was rejected, thereby demonstrating that art therapists are still in a nascent stage of
understanding assessments and rating instruments.
Limitations of the Study
Seven types of limitations were identified in the present study. These include: validity
issues in study retrieval; issues in problem formulation; judging research quality; issues in
coding; sources of unreliability in coding; validity issues in data analysis; and issues in study
results.
Validity Issues in Study Retrieval
The most serious form of bias enters a meta-analysis at the literature search stage,
because of the difficulty inherent in assessing the influence of a potential bias (Glass et al.,
1981). Furthermore, every study (hence every individual represented therein) used in a synthesis
has an unequal chance of being selected (Cooper, 1998). Although every effort was made to
gather all published and unpublished research for the present study, this synthesist encountered
problems that were beyond her control. These included: (a) primary researchers who reported
data carelessly or incompletely; and (b) individuals and/or libraries that faltered in their services,
failing to provide all potentially relevant documents. It was not possible to locate many theses
and dissertations. Access to more papers would have increased the number of studies included in
the concurrent validity and inter-rater reliability meta-analyses, which may have yielded more
substantial results.
73
Issues in Problem Formulation
This researcher endeavored to uncover variables to demonstrate why results varied in
different studies and to generate notions that would explain these higher-order relations (Cooper,
1998). Despite these efforts, however, it is possible that some variables were not uncovered.
Judging Research Quality
Cooper (1998) cited two sources of variance in the quality judgment of research
evaluators: (a) the relative importance they assign to different research design characteristics,
and (b) their judgments about how well a particular study meets a design criterion. Synthesists
predispositions usually have an impact on methods for evaluating studies, and the a priori
quality judgments required to exclude studies are likely to vary from judge to judge (p. 83).
Although synthesists are usually aware of these biases as well as the outcomes of stud ies as they
collect the research, it is likely that the findings will be tainted by the evaluators predispositions.
Issues in Coding
Low- inference codings are those that require the researcher to locate the needed
information in the primary study and transfer it to the coding sheet (Cooper, 1998, p. 30).
Codings of high- inference, on the other hand, require coders to make inferential judgments about
the studies, which may create problems in reliability. In the present study, high- inference
judgments were a factor in assigning scores to the categories Author Favor and Coder Favor, as
Coder 1 had to decide to what extent she and the primary authors favored the use of the
assessment based upon primary study findings. It is possible that reliability may have been
compromised by these subjective judgments.
Other Sources of Unreliability in Coding
Problems of reliability evolve in a meta-analysis at the coding stage. Issues arise when
different coders fail to see or judge study characteristics in the same way (Glass et al., 1981).
Four sources of unreliability in study coding that may have been a factor in the present study,
include: (a) recording errors; (b) lack of clarity in primary researchers descriptions; (c)
ambiguous definitions provided by the research synthesist, which lead to disagreement about the
proper code for a study characteristic; and (d) coders predispositions, which can lead them to
favor one interpretation of an ambiguous code over another (Cooper, 1998).
Validity Issues in Data Analysis
74
During the data analysis stage, inappropriate use of rules of inference creates a threat to
validity (Cooper, 1998). In order to avoid this, the researcher of the present study worked closely
with a statistician consultant. In addition, every effort was made to avoid misinterpretation of
synthesis-generated results in such a way as to substantiate statements of causality, because any
conclusions based on synthesis-generated evidence are always purely associational (p. 155).
Issues in Study Results
Although it is ideal that the incorporated studies permit generalizations to the particular
individuals or groups of focus (Cooper, 1998), the present study was limited by the sampling
methods used in the primary research papers. Because the sampling methods of 22, or 62.86% of
studies were flawed, generalizability of results is not possible. Furthermore, although a larger
sample size usually increases stability, if more studies could have been retrieved, it is likely that
even more variability would have been found.
It is apparent that relationship strength can be examined in a meta-analysis, but causality
is a separate matter (Cooper, 1998). It is very risky to make statements of causality based on
synthesis-generated results, because it is impossible to determine the true causal agent. When a
relation between correlated characteristics is found to exist, however, this information can be
used to recommend future directions for primary researchers.
Recommendations for Further Study
There are several areas on the topic of assessment in which further study could be
pursued: (1) replication of Hackings study on individual drawing scales; (2) exploration of
amalgamated findings of different assessments; (3) completion of a systematic analysis of patient
groups; and (4) periodic execution of thorough literature reviews and systematic analyses. These
are discussed presently.
By including non-art therapy assessment tools in her meta-analysis, Hacking (1999) was
able to meta-analyze drawing elements (such as color, line, and space). Hacking listed the
items in a large table with their test values and p- values, and standardized the scores to aggregate
them. This was followed by the application of a meta-analytic procedure to each category to
discover whether the non-significant results outweighed the significant. For example, Hacking
aggregated all variables covering 14 drawing areas categorized as form variables, objective
content or subjective content. Subjective variables seemed to produce the largest effect, but there
were demonstrable if small effects for the two other categories. Specific results are as follows:
75
Form variables (77 form variables from 14 tabulated drawing areas); effect size: 0.1977;
confidence limits (all significances p<0.05): 0.1217-0.2736. Objective content (38 subjective
content variables from 14 tabulated drawing areas); effect size: 0.3062; confidence limits (all
significances p<0.05): 0.2105-0.4020. Subjective content (102 subjective content variables from
14 tabulated drawing areas); effect size: 0.4283; confidence limits (all significances p<0.05):
0.3704-0.4863. It would be useful for an art therapist to replicate Hackings study with art
therapy assessment tools, if a larger number of studies that include data on individual drawing
scales become available in the future.
Because the majority of papers located for inclusion in the present study were about the
DDS, the results offer more information about the DDS than any other assessment. In the future
it would be advantageous to explore amalgamated findings of other assessments to the same
extent.
If more studies could be located, it might be worthwhile to conduct a systematic analysis
of patient groups. Such an analysis could enable art therapists to gain a better understanding of
the implications of the use of different assessments with various patient groups. This type of
investigation might also increase the validity of assessment instruments.
The field of art therapy would benefit if thorough literature reviews and systematic
analyses on this topic were to be conducted periodically. This would help researchers to maintain
a sound understanding of the work that has already been accomplished, and what needs to be
done to improve the research.
Conclusion
In order to address the hypothesis and research questions established by this researcher,
the results of a systematic analysis of 35 studies were presented and discussed. To address the
first research question, methodological problems identified in the primary studies were
described. Problems with validity and reliability of the primary articles were discussed, followed
by a delineation of methods for the improvement of assessment tools. Question two could not be
addressed due to sampling error and vast differences in study participants and methodology.
Because the field of art therapy has not yet produced sufficient information in the area of
assessments and rating instruments, question three could not be addressed. The studys
limitations, which included: validity issues in study retrieval; issues in problem formulation;
judging research quality; issues in coding; sources of unreliability in coding; validity issues in
76
data analysis; and issues in study results, were presented. Recommendations for further study
were provided.
Based on the present analysis, the art therapy assessment and rating instrument literature
reveals that flaws in the research are numerous, and that much work has yet to be done. As was
revealed via an extensive review of the literature and elucidation of the debated issues, art
therapists are still in a nascent stage of understanding assessments and rating instruments.
Decisions about treatment and diagnosis are often based upon the results of various
assessment and evaluation techniques. When any form of artwork is used to gain insight about
clients, art therapists need to be aware of the benefits and limitations of their approach and the
tools they use. It is the responsibility of every art therapist to be well versed in the areas of
evaluation, measurement, and research methodology. Art therapists should identify their own
personal philosophy, whether in support of or opposed to the use of assessments. Perhaps the
wisest stance on assessment involves embracing both sides of this issue and moving forward
with the work that needs to be done.
77
APPENDIX A
ART THERAPY ASSESSMENT INSTRUMENTS
This is not an exhaustive list. Reference citations included when available.
Art Therapy Assessment Instruments Included Corresponding Standardized Rating System

in the Systematic Analysis
Birds Nest Drawing (BND) (Kaiser, 1993) Attachment Rating Scale (Kaiser, 1993)
Bridge Drawing (Hays & Lyons, 1981) 12 variables assembled by authors (informal)
Diagnostic Drawing Series (DDS) (Cohen, Hammer Drawing Analysis Form; Content Checklist (Cohen, 1985/1994)
& Singer, 1988)
A Favorite Kind of Day (Manning Rauch, 1987) Aggression Depicted in the AFKD Rating Instrument (A three-item
checklist) (Manning Rauch, 1987)
Person Picking an Apple From a Tree (PPAT) Formal Elements Art Therapy Scale (FEATS) (Gantt & Tabone, 1998)
Art Therapy Assessment Instruments Corresponding Rating System

Arrington Visual Preference Test (AVPT) (Dorris Arrington)
Art Therapy-Projective Imagery Assessment (AT- (EVMS)
PIA) (EVMS)
Dot- to-Dot
Draw-A-Village Task (DAV)
Draw-Yourself Task (DYT)
Face Stimulus Assessment (FSA) (Betts, 2003) FSA Rating Guidelines (Betts, 2003)
Favorite Place Drawings
Levick Emotional and Cognitive Art Therapy (Myra Levick, 2001)
Assessment (LECATA) (Levick)
Mandala Assessment Research Instrument (MARI) Manual containing possible interpretations
Card Test (Kellogg)
Ulman Personality Assessment Procedure (UPAP) A checklist
(Ulman)
Stand-Alone Rating Systems

Descriptive Assessment for Psychiatric Art (DAPA)
Sheppard Pratt Art Rating Scale (SPAR)
(Shoemaker-Beal, 1977; Bergland & Moore Gonzalez, 1993)
Questionnaire NBS (Elbing & Hacking, 2001) for objective picture criteria;
Semantic Differential (Simmat, 1969) for contextual-affective picture
criteria.
Unnamed 18-item rating scale (with 4 additional color-related test items)
(Wadlington & McWhinnie, 1973)
78
APPENDIX B
TWO POPULAR ART THERAPY ASSESSMENTS AND THEIR CLINICAL

APPLICATIONS: THE DDS AND THE PPAT
The Diagnostic Drawing Series (DDS) (Cohen, Hammer, & Singer, 1988)
The DDS s a three-picture art interview that was developed in 1982 by art therapists
Barry Cohen and Barbara Lesowitz in Virginia. It first entered the published literature in 1985.
Drawings are collected and maintained for research purposes in the DDS archive, Alexandria,
VA.
The DDS should be administered on a tabletop during one 50- minute session (most
people finish the Series in about 20 minutes, leaving time for discussion). It can be administered
individually and in groups. Materials are: 18 x 24 white 60 lb or 70 lb drawing paper that has a
slight tooth or texture; standard 12 pack of Alphacolor square chalk pastels (unwrapped) in North
America; Faber Castell elsewhere.
DDS Directives (Cohen & Mills, 2000):
1) Free Picture: Make a picture with these materials. This is the unstructured task of the Series.
It typically reveals a manifestation of the clients defense system (i.e., the amount and type of
information the patient is initially willing to share [Feder & Feder, 1998])
2) Tree Picture: Draw a picture of a tree. (NB: even if a tree was drawn in the first picture.)
This is the structured task of the Series. Deemed to be non-threatening subject (Feder & Feder,
1998).
3) Feeling Picture: Make a picture of how youre feeling using lines, shapes, and colors. This
is a semi- structured task, it asks the client to communicate about his/her affective state directly,
as well as to represent it in abstract form. Patients are rarely fooled by the artifice of projective
tests, which are now part of our popular culture. If we want to know about the patients
experience or self concept, why not simply ask? (Cohen, Mills, & Kijak, 1994)
The Person Picking an Apple From a Tree (PPAT) (Gantt, 1990)

This assessment consists of one drawing that can be administered in an individual or
group art therapy session (Arrington, 1992). The client is asked to Draw a picture of a person
picking an apple from a tree. The materials include one piece of 12 by 18 inch white drawing
paper and a set of 12 Mr. Sketch scented felt-tip markers. The drawings are rated with the
Formal Elements Art Therapy Scale (FEATS) manual (Gantt & Tabone, 1998).
The PPAT drawing was first described by Viktor Lowenfeld (1939, 1947) in a study he
conducted on childrens use of space in art. His instructions were detailed: You are under an
apple tree. On one of its lower branches you see an apple that you particularly admire and that
you would like to have. You stretch out your hand to pick the apple, but your reach is a little
short. Then you make a great effort and get the apple after all. Now you have it and enjoy eating
it. Draw yourself as you are taking the apple off the tree (1947, pp. 75-76). Little else has been
written about this drawing. Greg Furth included examples in his book The Secret World of
Drawings (1988, p. 86-88) but did not discuss his reason for using it.
Gantt and Tabone (1998) credit art therapist Tally Tripp for bringing the idea of the
PPAT to Washington, DC from Georgia. Ta lly had worked with a Georgia art therapist who used
the PPAT frequently.
79
APPENDIX C
HOW AN ART THERAPY ASSESSMENT IS DEVELOPED
As artists, art therapists believe that they possess a unique sensibility for studying
drawings (Hagood, 2002). It follows, therefore, that they attempt to create their own assessment
instruments. However, developing a standardized art therapy assessment is a long and arduous
undertaking (Betts, 2003). A lifetime of effort can be devoted to this endeavor (Kaplan, 2001,
p. 144).
It is common to find an art therapist who has used any number of previously existing
assessments, such as the DDS or the PPAT. However, there are occasions when an art therapist
has used such tools and has found them to be inappropriate for use with his or her specific group
of clients. Perhaps through trying out different ways of working with the clients, and
experimenting with different media and methods, the art therapist may begin to formulate ideas
about creating his or her own unique tool that is tailored for use with a particular client
population.
For example, art therapist Donna Betts (2003) developed the Face Stimulus Assessment
(FSA) while she was employed at a multicultural school in a major metropolitan area. She had
not had much success in using the established art therapy instruments, so she invented a
technique for evaluating her non-verbal clients who had cognitive impairments. Betts clients
were not motivated to draw without a visual stimulus and they were unable to follow directions.
Even a basic directive such as draw a person did not elicit any response from the clients who
had severe mental retardation or developmental delay.
Thus, over time and with several pilot trials, Betts (2003) developed a method that would
uncover her clients strengths through art. Betts experience in creating the FSA is not so unlike
the processes that Ulman (1965, 1975, 1992; Ulman & Levy, 1975, 1992), Kwiatkowska (1975,
1978), and Silver (1978, 1983) undertook in developing their assessments. Ulman, Kwiatkowska,
and Silver also designed tools based on their clinical experience with clients who had unique
challenges, as was discussed in Chapter Two of this prospectus.
Once a new assessment has been piloted, then it must undergo rigorous validity and
reliability studies in order to be established as a tool and used in clinical settings. Lehmann and
Risquez (1953) delineated specific requirements that are necessary for ensuring the credibility of
an art-based assessment:
1) It should be possible to obtain repeated productions that are comparable in order to
obtain a longitudinal view of the variations in the patients graphic expression over a
period of time.
2) The method should allow for the comparison of productions of the same patient and
of different patients at different times by means of a standardized method of rating.
3) It should be possible to obtain valid and useful information on the patients medical
condition through the evaluation of his or her paintings without having to spend
additional time in observing the patient while he or she is painting or in conducting an
interview about the finished product.
80
APPENDIX D
81
APPENDIX E
82
APPENDIX F
Hacking, 1999
83
APPENDIX G
84
85
APPENDIX H
86
87
APPENDIX I
CODING SHEET FOR ART THERAPY ASSESSMENT AND

RATING INSTRUMENT STUDIES
DONNA BETTS 2005
Reviewer/Coder: _____________________________________________ Coder ID: ________

Todays Date: __ __/__ __/__ __
Citation Information:
Citation ID #: __________
Author(s): _______________________________________________________________
Major discipline of the first or primary author (1=Art Therapy; 2=Psychology;
3=Counseling; 4=Social Work; 5=Other; 99=unable to determine): _______
Title:
________________________________________________________________________
________________________________________________________________________
Published Study
Publication:
________________________________________________________________________
Year: ________ Volume: ________ Issue: ________ Pages: ________
Unpublished Study
Masters Thesis Year: ________
Doctoral Dissertation Year: ________
Other: __________ Year: ________
Search Method:
1. Electronic search: ____________ 3. Personal recommendation: ____________
2. Manual search: ______________ 4. Other: ____________________________
Research Design:
Art therapy assessment instrument used in study:
1. Diagnostic Drawing Series (DDS)
2. Person Picking an Apple From a Tree (PPAT)
3. Other: ______________________________________________
Type of rating system used to rate artwork in study:

1. Drawing Analysis Form (corresponds to DDS)
2. Formal Elements Art Therapy Scale (FEATS) (corresponds to PPAT)
3. Other: ________________________________________________________
4. Other: ________________________________________________________
99. Unable to determine
Rating Procedures:
Number of raters used to rate drawings: 1. 2. 3. 4.
99. Unknown
Was the primary researcher one of the raters?
88
1. Yes 2. No 99. Unable to determine
Were raters trained?
Rater 1: 1. Yes 2. No 99. Unable to determine
Design type:
Concurrent validity attempted Concurrent validity not attempted
Test(s) used in comparison to the art therapy assessment tool:
____________________________________________________________
1. Reported 2. Not reported 99. Unable to determine
Statistical measure: ____________ Statistical value: ___________
Criterion validity:
Were the patients/subjects drawings compared to a 2nd set of drawings?
If yes: Normal controls Archival drawings
Were conclusions about the initial drawings backed up by a psychiatrists
or physicians diagnosis of the patient?
Predictive validity:
1. The assessment was developed to predict future behavior (predictive
validity)
What was the researcher attempting to predict? (e.g., assessment
predicted suicidality.)
______________________________________________________
2. The assessment was developed to assess current pathology
What was the researcher attempting to assess?
______________________________________________________
Intra-rater reliability:
Inter-rater reliability:
Test-retest reliability attempted Test-retest reliability not
attempted
Length of time between testing: _________
Equivalent forms reliability (like MMPI has order of questions in test is
changed):
Describe the difference between the two forms of the test:
__________________________________________________________________
__________________________________________________________________
Split- half reliability:
89
What part of the assessment was compared to the other part?
__________________________________________________________________
Site Variables:
Location (use state abbreviations): ______ 99. Unknown
Source of funding:
1. Public 2. Private 99. Unknown
Setting:
1. Hospital 2. School 3. Day Treatment Center
4. Other: ________________________________ 5. Unknown
Participant Variables:
# of sites: ______ List sites: _____________________________________________
Patient Group:
# of participants in tx group: ______
1. Random sample 2. Convenience sample 99. Unable to
determine
Diagnostic selection criteria:
Ss screened for pathology 1. Yes 2. No 99. Unable to determine
Gender: # Female: ______ # Male: ______
Age: Average age (F): ______ Average age (M): ______ Total average
age: ______
Lowest included age: ______ Highest included age: ______
Socioeconomic status: Ethnic group:
1. Lower # or %: ______ 1. White # or %: ______
2. Middle # or %: ______ 2. Black # or %: ______
3. Mixed # or %: ______ 3. Other
___________________
3. Other # or %: ______ # or %: ______
99. Unknown 99. Unknown
Control Group (if applicable):

# of participants in control group: ______
1. Random sample 2. Convenience sample 99. Unable to
determine
Diagnostic selection criteria:
Ss screened for absence of pathology 1. Yes 2. No 99. Unable
to determine
Gender: # Female: ______ # Male: ______
Age: Average age (F): ______ Average age (M): ______ Total average
age: ______
Lowest included age: ______ Highest included age: ______
Socioeconomic status: Ethnic group:
1. Lower # or %: ______ 1. White # or %: ______
90
2. Middle # or %: ______ 2. Black # or %: ______
3. Mixed # or %: ______ 3. Other
___________________
3. Other # or %: ______ # or %: ______
99. Unknown 99. Unknown
Statistical Information:
_______________________________________________________________________
_______________________________________________________________________
Study Quality:
Results Favor Use of This Assessment (according to studys author):
1. Yes 2. No 3. Neutral
Study flaws described by author (please list): (include page #)
_______________________________________________________________________
_______________________________________________________________________
Results Favor Use of This Assessment (according to you, the coder):
1. Yes 2. No 3. Neutral
Additional study flaws not identified by author (please list):
_______________________________________________________________________
_______________________________________________________________________
Notes and Comments:

_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
91
APPENDIX J
PRIMARY STUDIES USED IN THE SYSTEMATIC ANALYSIS
Total Number of Studies: 35
CITATION LOCATION METHOD
Batza Morris, M. (1995). The Diagnostic Drawing Series and the PsycINFO database
Tree Rating Scale: An isomorphic representation of multiple
personality disorder, major depression, and schizophrenia
populations. Art Therapy: Journal of the American Art
Therapy Association, 12(2), 118-128.
Bergland, C., & Moore Gonzalez, R. (1993). Art and madness: Can PsycINFO database
the interface be quantified? The Sheppard Pratt Art Rating
Scale--an instrument for measuring art integration. American
Journal of Art Therapy, 31, 81-90.
Billingsley, G. (1998). The efficacy of the Diagnostic Drawing Series DDS archives
with substance-related disordered clients. Unpublished
doctoral dissertation, Walden University.
Brudenell, T. J. (1989). Art representations as functions of DDS archives
depressive state: Longitudinal studies in chronic childhood
and adolescent depression. Unpublished master's thesis,
College of Notre Dame, Belmont, CA.
Coffey, T. M. (1997). The use of the Diagnostic Drawing Series with DDS archives
an adolescent population. Unpublished paper.
Cohen, B. M., Hammer, J. S., & Singer, S. (1988). The Diagnostic PsycINFO database
Drawing Series: A systematic approach to art therapy
evaluation and research. Arts in Psychotherapy: Special
Research in the creative arts therapies., 15(1), 11-21.
Cohen, B. M., & Heijtmajer, O. (1995). Identification of dissociative DDS archives
disorders: Comparing the SCID-D and dissociative
experiences scale with the Diagnostic Drawing Series.
Unpublished paper.
Couch, J. B. (1994). Diagnostic Drawing Series: Research with older PsycINFO database
people diagnosed with organic mental syndromes and
disorders. Art Therapy: Journal of the American Art Therapy
Association, 11(2), 111-115.
Easterling, C. E. (2000). Art therapy with elementary school students Manual search of FSU
with symptoms of depression: The effects of storytelling, art therapy department
fantasy, daydreams, and art reproductions. Unpublished dissertations
master's thesis, Florida State University, Tallahassee, FL.
Eitel, K., Szkura, L., & Wietersheim, J. v. (2004). Do you see what I Manual search: know the
see? A study about the interrater-reliability in art therapy. author personally
Unpublished paper.
92
Fowler, J. P., & Ardon, A. M. (2002). Diagnostic Drawing Series PsycINFO database
and dissociative disorders: A Dutch study. Arts in
Psychotherapy, 29(4), 221-230.
Francis, D., Kaiser, D, & Deaver, S. P. (2003). Representations of PsycINFO database
attachment security in the bird's nest drawings of clients with
substance abuse disorders. Art Therapy: Journal of the
American Art Therapy Association, 20(3), 125-137.
Gantt, L. M. (1990). A validity study of the Formal Elements Art Manual search: know the
Therapy Scale (FEATS) for diagnostic information in author personally
patients' drawings. Unpublished Dissertation, University of
Pittsburgh, Pittsburgh, PA.
Gulbro Leavitt, C. (1988). A validity study of the Diagnostic DDS archives
Drawing Series as used for assessing depression in children
and adolescents. Unpublished doctoral dissertation,
California School of Professional Psychology, Los Angeles,
CA.
Gulbro-Leavitt, C., & Schimmel, B. (1991). Assessing depression in PsycINFO database
children and adolescents using the Diagnostic Drawing Series
modified for children (DDS-C). Arts in Psychotherapy, 18(4),
353-356.
Gussak, D. (2004). Art therapy with prison inmates: A pilot study. PsycINFO database
Arts in Psychotherapy, 31(4), 245-259.
Hacking, S., Foreman, D., & Belcher, J. (1996). The Descriptive PsycINFO database
Assessment for Psychiatric Art: A new way of quantifying
paintings by psychiatric patients. Journal of Nervous and
Mental Disease, 184(7), 425-430.
Hacking, S., & Foreman, D. (2000). The Descriptive Assessment for PsycINFO database
Psychiatric Art (DAPA): Update and further research.
Journal of Nervous and Mental Disease, 188(8), 525-529.
Hays, R. E., & Lyons, S. J. (1981). The bridge drawing: A projective PsycINFO database
technique for assessment in art therapy. Arts in
Psychotherapy, 8(3-sup-4), 207-217.
Hyler, C. (2002). Children's drawings as representations of Obtained from EVMS
attachment. Unpublished master's thesis, Eastern Virginia
Medical School, Norfolk, VA.
Johnson, K. M. (2004). The use of the Diagnostic Drawing Series in DDS archives
the diagnosis of bipolar disorder. Unpublished Dissertation,
Seattle Pacific University, Seattle, WA.
Kaiser, D. (1993). Attachment organization as manifested in a Obtained from EVMS
drawing task. Unpublished master's thesis, Eastern Virginia
Medical School, Norfolk, VA.
Kessler, K. (1994). A study of the Diagnositic Drawing Series with PsycINFO database
eating disordered patients. Art Therapy: Journal of the
Kress, T., & Mills, A. (1992). Multiple personality disorder and the DDS archives
Diagnostic Drawing Series: Further investigations.
93
Unpublished paper.
Manning, T. M. (1987). Aggression depicted in abused children's PsycINFO database
drawings. Arts in Psychotherapy, 14, 15-24.
McHugh, C. M. (1997). A comparative study of structural aspects of DDS archives
drawings between individuals diagnosed with major
depressive disorder and bipolar disorder in the manic phase.
Unpublished master's thesis, Eastern Virginia Medical
School, Norfolk, VA.
Mills, A. (1989). A statistical study of the formal aspects of the DDS archives
Diagnostic Drawing Series of borderline personality
disordered patients, and its context in contemporary art
therapy. Unpublished master's thesis, Concordia University,
Montreal, PQ.
Mills, A., & Cohen, B. (1993). Facilitating the idenfitication of PsycINFO database
multiple personality disorder through art: The Diagnostic
Drawing Series. In E. S. Kluft (Ed.), Expressive and
functional therapies in the treatment of multiple personality
disorder. Springfield, IL: Charles C Thomas.
Neale, E. L. (1994). The Children's Diagnostic Drawing Series. Art PsycINFO database
Therapy: Journal of the American Art Therapy Association,
11(2), 119-126.
Overbeck, L. B. (2002). A pilot study of pregnant women's drawings. Obtained from EVMS
Unpublished master's thesis, Eastern Virginia Medical
School, Norfolk, VA.
Ricca, D. (1992). Utilizing the Diagnostic Drawing Series as a tool DDS archives
in differentiating a diagnosis between multiple personality
disorder and schizophrenia. Unpublished master's thesis,
Hahnemann University, Philadelphia, PA.
Shlagman, H. (1996). The Diagnostic Drawing Series: A comparison DDS archives
of psychiatric inpatient adolescents in crisis with non-
hospitalized youth. Unpublished master's thesis, College of
Notre Dame, Belmont, CA.
Wadlington, W. L., & McWhinnie, H. J. (1973). The development of PsycINFO database
a rating scale for the study of formal aesthetic qualities in the
paintings of mental patients. Art Psychotherapy, 1, 201-220.
Wilson, K. (2004). Projective drawing: Alternative assessment of Manual search: know the
emotion in children who stutter. Unpublished bachelor's author personally
thesis, Florida State University, Tallahassee, FL.
Yahnke, L. (2000). Diagnostic Drawing Series as an assessment for PsycINFO database
children who have witnessed marital violence. Unpublished
doctoral dissertation, Minnesota School of Professional
Psychology, Minneapolis, MN.
94
APPENDIX K
SUMMARY OF NUMERICAL DATA ON SPECIFIC DDS DRAWING ANALYSIS FORM

(DAF) VARIABLES
DDS Citation ID DDS DAF Variable information Page #s in study Table(s)

where information is containing
located information
Billingsley 1998 Percentage of incidence in all three of the DDS pictures; pp. 85-93 Table 4
Chi-square (to address the research question How are the pp. 155-156
graphic profiles of substance-related disordered clients
admitted to an outpatient treatment program different from
the profiles already generated by the (DDS) with a control
group and other disorders? (p. 6).
Brudenell 1989 All three of the DDS drawings combined and correlated with p. 43 Table 13
CES-D scores.
Coffey 1997 Patient group compared to control group (percentages) pp. 21-24
Cohen 1988 Reported correlations, specifically, the degree of pp. 16-19 Tables 1-4
association of all the explanatory variables with the
diagnostic category
Couch 1994 Reported percentages of occurrence. pp. 112-113
Fowler 2002 Discrimination measures reported. p. 226 Table 5
Gulbro 1988 Reported Frequencies of structural elements across p. 105 Table 25
drawings which are significant or approaching significance.
Gulbro 1991 Variable Idiosyncratic Color reported for Free Drawing only, p. 354
correlated to DSRS, phi=.37; variable People reported for
Free Drawing only, correlated to DSRS, phi=.34; variable
Animals reported for Free Drawing only, correlated to
DSRS, phi=.24.
Johnson 2004-a Summary of significant variables. p. 13 Table 4
Johnson 2004-b Correlations of MCMI-II categories and DDS variables on all Tables 10-12
3 DDS pictures.
Kessler 1994 Percentage of occurrence/non-occurrence of Groundline p. 117 Table 2
element in picture one of the DAF;
Percentage of occurrence/non-occurrence of Falling Apart Table 3
Tree elements in picture two of the DAF.
Mchugh 1997 Inter-rater reliability statistics. pp. 86-87 Tables 1 & 2
Mills 1989-a Tabulation of BPD Sample (percentages) pp. 86-89 Table 5
Tabulation of Character-Disordered Males (percentages); pp. 90-93, Table 6
Tabulation of Character-Disordered Adolescents pp. 94-98 Table 7
(percentages).
Mills 1989-b Statistically Significant Regressors (T-tests). pp. 110-119 Table 8
Mills 1993 Six DAF variables appeared prominently in each of the three
pictures by the MPD subjects and were reported as
percentages of incidence:
Integration, p. 47 Table 4
Abstraction, p. 48 Table 6
Enclosure, p. 49 Table 7
Movement, p. 46 Table 2
Tree, p. 46 Table 3
Tilt p. 47 Table 5
Neale 1994-a Probability Table for Ho1, that the variables of the DAF p. 121 Table 5
applied to the CDDS would significantly discriminate
between the treatment group (DSM-III-R diagnosis) and the
95
control group.
Neale 1994-b Chi-squares for Ho2, that a cluster of DAF variables applied p. 123 Table 6
to the CDDS would emerge as criteria for diagnosing
children with adjustment disorders.
Neale 1994-c Characteristics of Drawings of Children Diagnosed with p. 124 Table 7
Adjustment Disorder: Variables with interrater reliability with
Kappa > .50.
Ricca 1992-a The statistical results from the Likelihood Ratio Chi-Square pp. 66-70 Tables 3-5
test, presenting the collaborated scores of raters one, two
and three; p-values provided.
Ricca 1992-b Cross-analysis of Picture 1s, Picture 2s, Picture 3s pp. 72-74 Tables 6-8
[The statistical results from the Likelihood Ratio Chi-Square
test, presenting the collaborated scores of raters one, two
and three; p-values provided.]
Shlagman 1996-a DAF variables data. pp. 35-93 Tables 7-67
Shlagman 1996-b Variables with 25% (or more) difference between both pp. 67-69 Tables 63-65
sample groups.
Yahnke 2000 Inter-rater reliability between variables on the DDS; p. 30 Table 4
Frequency results from the CDDS Series total; p. 32 Table 5
Variables from the CDDS and their relationship to Primary p. 34 Table 6
Scales on the CBCL
96
APPENDIX L
META-ANALYSIS OF INTER-RATER RELIABILITY

GLOBAL EFFECT SIZE OF NINE STUDIES THAT USED CORRELATIONS
Obs K Q P LL_T_DOT T_DOT UL_T_DOT

1 9 42.7370 .000000984 1.39667 1.49870 1.60074
No. of studies EFFECT SIZE Lower (CI) Fisher Z Upper (CI)
(represents variability
among the validities)
Obs V_T_DOT SE_T_DOT MODV modsd QV qsd

1 .002710027 0.052058 0.10591 0.32543 0.11191 0.33453
Obs LbackZ BackZ UbackZ

1 0.88463 0.90491 0.92178
Fishers Z:
global effect size
Confidence Intervals (these provide a method of visually detecting heterogeneity)
LLIM ULIM Fisher min max

z -0.5 4
*------------------------------------------------------------------*
0.63 1.51 1.07 | | [-----*-----] |
1.30 1.75 1.53 | | [--*--] |
1.87 2.50 2.18 | | [-----*---] |
1.25 1.70 1.47 | | [--*--] |
1.01 1.52 1.27 | | [---*--] |
1.28 1.78 1.53 | | [--*---] |
1.22 2.97 2.09 | | [------------*-----------] |
1.88 3.64 2.76 | | [-----------*------------] |
0.57 1.33 0.95 | | [----*-----] |
*------------------------------------------------------------------*
97
APPENDIX M

GLOBAL EFFECT SIZE OF SEVEN STUDIES THAT USED CORRELATIONS
Obs K Q P LL_T_DOT T_DOT UL_T_DOT

1 7 32.7977 .000011469 1.36944 1.47289 1.57633

1 .002785515 0.052778 0.087086 0.29510 0.089650 0.29942
Obs LBackZ BackZ UBackZ

1 0.87857 0.90013 0.91803
Global effect size
for inter-rater
reliability correlations
Confidence Intervals (these provide a method of visually detecting heterogeneity)

z -0.5 4
*------------------------------------------------------------------*
1.28 1.78 1.53 | | [--*---] |
1.30 1.75 1.53 | | [--*--] |
1.87 2.50 2.18 | | [-----*---] |
0.63 1.51 1.07 | | [-----*-----] |
0.57 1.33 0.95 | | [----*-----] |
1.25 1.70 1.47 | | [--*--] |
1.01 1.52 1.27 | | [---*--] |
*------------------------------------------------------------------*
98
APPENDIX N

POTENTIAL MEDIATING VARIABLE:
RATER TRAINING VS. NO RATER TRAINING
Obs Ratertrng K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT

1 0 5 9.1174 0.058230 1.29672 1.40970 1.52267 0.003322
Some or all 5 studies,
raters not raters not
trained trained
2 1 4 20.6794 0.000123 1.65501 1.89269 2.13038 0.014706
Rater(s)
trained
Obs SE_T_DOT MODV modsd QV qsd

1 0.05764 0.02125 0.14578 0.02173 0.14742
2 0.12127 0.34665 0.58877 0.44037 0.66360

1 0.86088 0.88743 0.90916
Studies that did not train raters Global effect size in studies
have almost equal reliabilities that did not train all raters
2 0.92954 0.95561 0.97217
Global effect size in studies that
did train all raters (higher than
studies that did not)
Obs Q QBetween Pb Qwithin Pw

1 42.7370 12.9402 .000321600 29.7968 .000103462
Variation among With rater training factor, There is still 29.79
all reliabilities of this number explains the variability that cannot
the 9 studies variability more than w/o this be explained by rater
information training
99
APPENDIX O

AUTHOR RATER VS. AUTHOR NOT RATER
Obs AuthorRater K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT

1 0 6 32.5147 0.000005 1.34215 1.45846 1.57476 0.003521
Primary Author
was not a rater
2 1 3 8.2254 0.016363 1.42058 1.63317 1.84576 0.011765
Primary Author
was a rater

1 0.05934 0.11626 0.34097 0.12076 0.34750
2 0.10847 0.10986 0.33145 0.34140 0.58429

1 0.87219 0.89735 0.91778
2 0.88972 0.92651 0.95135
Reliability is a bit higher
when author is rater

1 42.7370 1.99688 0.15762 40.7402 .000000908
This number is low, which
means that whether the author
is rater or not does not help
explain the variability among
the 9 studies
100
APPENDIX P

CODER FOUND STUDY SUPPORTED TEST VS.
NEUTRAL ON WHETHER STUDY SUPPORTED TEST
Obs CodrFav K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT

1 1 4 18.6083 0.000329 1.55438 1.70047 1.84656 .005555556
Coder found
study supported
test
2 2 5 9.8227 0.043522 1.16398 1.30655 1.44912 .005291005
Coder neutral on
whether study
supported test

1 0.074536 0.11562 0.34003 0.13091 0.36181
2 0.072739 0.03851 0.19624 0.04395 0.20965

1 0.91450 0.93547 0.95142
Coder found study
supported test gives higher
reliability
2 0.82233 0.86340 0.89552
Coder neutral, so reliability
goes down a bit.

1 42.7370 14.3060 .000155369 28.4310 .000183652
14/42 means almost 1/3 of variability
is explained by this factor (but 2/3 is
still not explained)
101
APPENDIX Q

STUDY PUBLISHED VS. STUDY NOT PUBLISHED
Obs Publish K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT

1 0 5 24.8186 .000054715 1.39330 1.51258 1.63186 0.003704
Study not
published
2 1 4 17.7246 .000501290 1.26387 1.46085 1.65784 0.010101
Study
published
No. of EFFECT SIZE Lower (CI) Fisher Upper (CI)
studies (represents Z
variability
among the
validities)

1 0.06086 0.09638 0.31046 0.10055 0.31710
2 0.10050 0.19831 0.44532 0.28152 0.53059

1 0.88389 0.90740 0.92633
2 0.85213 0.89782 0.92993

1 42.7370 0.19383 0.65975 42.5432 .000000409
Whether the study is published
or not doesnt tell us very much
about the variability
102
APPENDIX R
CATEGORICAL FIXED-EFFECTS MODEL:

ASSESSMENT ANALYSIS DATA VS AUTHOR FAVOR
OVERALL ANALYSIS AND DESCRIPTIVES
Concurrent Validity: Confidence Intervals for Author Favor

z -1 1.5
*-----------------------------------------------------------------*
-0.00 1.96 0.98 | [------------------------- * >
-0.99 0.76 -0.12 |[----------------------*-|------------------] |
-0.36 0.16 -0.10 | [------*-|---] |
0.38 1.01 0.69 | | [-------*-------] |
-0.21 0.43 0.11 | [----|--*--------] |
-0.22 0.22 0.00 | [----*----] |
-0.21 0.23 0.01 | [----*------] |
-0.10 0.34 0.12 | [-|--*-----] |
-0.38 0.06 -0.16 | [-----*---|] |
0.09 0.53 0.31 | | [----*-----] |
-0.34 0.39 0.03 | [--------|*--------] |
*-----------------------------------------------------------------*
Concurrent Validity: Confidence Intervals for Author Favorestimated (real) validity

LL_ UL_ T_ min max
T_ T_ DOT -0.2 0.8
DOT DOT * ---------------------------------------------------------------------*
0.01 0.17 0.09 | [------*--] |
*---------------------------------------------------------------------*
103
APPENDIX S

OVERALL META-ANALYSIS, ALL STUDIES
Obs K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT SE_T_DOT

1 11 29.4230 .00106392 .008025498 0.090049 0.17207 .001751313 0.041849
No. of EFFECT SIZE Lower (CI) Fisher Z Upper (CI) Variance Standard
studies (represents Error
variability
among the
validities)
Obs MODV modsd QV qsd

1 0.037417 0.19344 0.038640 0.19657
Correlation

1 .008025325 0.089806 0.17039
Lower bound Correlation Upper bound
104
APPENDIX T

MODEL BY SAMPLING, ALL STUDIES
Obs AuthrFav K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT

1 0 2 2.6488 0.10363 -0.28463 0.36871 1.02204 0.11111
2 1 3 14.5153 0.00070 0.01860 0.18855 0.35850 0.00752
3 2 6 9.7017 0.08414 -0.04096 0.05367 0.14829 0.00233

1 0.33333 0.36640 0.60531 0.37098 0.60908
2 0.08671 0.14115 0.37570 0.14409 0.37959
3 0.04828 0.01315 0.11468 0.01334 0.11550
Correlation

1 -0.27718 0.35286 0.77070
Author found that study did not
support test: highest reliability
2 0.01860 0.18635 0.34390
Author found that study did
support test: lower reliability
3 -0.04094 0.05361 0.14722
Author neutral: lowest reliability
The possible variability among the three groups

1 29.4230 2.55721 0.27843 26.8657 .000745726
Variation Indicates that the three There is still Q-within is significant
among groups are 26.8657 variability because Pw<.001.
reliabilities of homogeneous (i.e., not that cannot be Indicates that the
all 11 studies significant, based on Q explained by variability among the
between) Author Favor effect sizes can
contribute to the Q-
within variabilities.
105
APPENDIX U

ASSESSMENT ANALYSIS DATA VS CODER FAVOR
Concurrent Validity: Confidence Intervals for Coder Favor

z -1 1.5
*-----------------------------------------------------------------*
-0.00 1.96 0.98 | [------------------------- * >
-0.36 0.16 -0.10 | [------*-|---] |
0.38 1.01 0.69 | | [-------*-------] |
-0.21 0.43 0.11 | [----|--*--------] |
-0.22 0.22 0.00 | [----*----] |
-0.21 0.23 0.01 | [----*------] |
-0.10 0.34 0.12 | [-|--*-----] |
-0.38 0.06 -0.16 | [-----*---|] |
0.09 0.53 0.31 | | [----*-----] |
-0.34 0.39 0.03 | [--------|*--------] |
-0.99 0.76 -0.12 |[----------------------*-|------------------] |
*-----------------------------------------------------------------*
Concurrent Validity: Confidence Intervals for Coder Favorestimated (real) validity

LL_ UL_ T_ min max
T_ T_ DOT -0.2 0.8
DOT DOT * ---------------------------------------------------------------------*
0.01 0.17 0.09 | [------*--] |
*---------------------------------------------------------------------*
106
APPENDIX V


1 11 29.4230 .00106392 .008025498 0.090049 0.17207 .001751313 0.041849
No. of EFFECT Lower (CI) Fisher Z Upper (CI) Variance Standard
studies SIZE Error
(represents
variability
among the
validities)

1 0.037417 0.19344 0.038640 0.19657
Correlation

1 .008025325 0.089806 0.17039
107
APPENDIX W

Obs CodrFav K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT

1 0 1 0.0000 . -0.004755 0.97524 1.95524 0.25000
2 1 3 14.5153 0.00070 0.018598 0.18855 0.35850 0.00752
3 2 7 9.8448 0.13134 -0.042379 0.05170 0.14579 0.00230
No. of EFFECT SIZE Lower (CI) Fisher Upper (CI)
studies (represents variability Z
among the validities)
NB: The first line (0) can be ignored because there was only one and shouldnt affect the overall data.

1 0.50000 0.00000 0.00000 0.00000 0.00000
2 0.08671 0.14115 0.37570 0.14409 0.37959
3 0.04800 0.01034 0.10166 0.01073 0.10359
Correlation:

1 -0.004755 0.75100 0.96073
2 0.018595 0.18635 0.34390
3 -0.042353 0.05166 0.14476
The possible variability among the three groups:

1 29.4230 5.06286 0.079545 24.3601 .001993794
Variation Indicates that the There is still Q-within is not significant
among three groups are 24.3601 variability because Pw>.001.
reliabilities of all homogeneous (i.e., that cannot be Indicates that the
11 studies not significant, based explained by variability among the
on Q between) Coder Favor effect sizes does not
contribute significantly to
the Q-within variabilities.
NB: Most of the variability among the effect sizes comes from the Q within.
108
APPENDIX X

ASSESSMENT ANALYSIS DATA VS MANUSCRIPT TYPE
Concurrent Validity: Confidence Intervals for Manuscript Type

z -1 1.5
*-----------------------------------------------------------------*
-0.22 0.22 0.00 | [----*----] |
-0.21 0.23 0.01 | [----*------] |
-0.10 0.34 0.12 | [-|--*-----] |
-0.38 0.06 -0.16 | [-----*---|] |
0.09 0.53 0.31 | | [----*-----] |
-0.36 0.16 -0.10 | [------*-|---] |
-0.00 1.96 0.98 | [------------------------- * >
0.38 1.01 0.69 | | [-------*-------] |
-0.21 0.43 0.11 | [----|--*--------] |
-0.34 0.39 0.03 | [--------|*--------] |
-0.99 0.76 -0.12 |[----------------------*-|------------------] |
*-----------------------------------------------------------------*
Concurrent Validity: Confidence Intervals for Manuscript Type estimated (real) validity
LL_ UL_ T_ min max
T_ T_ DOT -0.2 0.8
DOT DOT * ---------------------------------------------------------------------*
0.01 0.17 0.09 | [------*--] |
*---------------------------------------------------------------------*
109
APPENDIX Y


1 11 29.4230 .00106392 .008025498 0.090049 0.17207 .001751313 0.041849
studies SIZE Error
(represents
variability
among the
validities)

1 0.037417 0.19344 0.038640 0.19657
Correlation

1 .008025325 0.089806 0.17039
110
APPENDIX Z

Obs manutyp K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT

1 1 6 10.8215 0.055036 -0.05484 0.03684 0.12853 .002188184
2 2 5 12.1219 0.016468 0.11976 0.30333 0.48690 .008771930
(1) Journal articles and dissertations: low validity

(2) Masters theses: high validity

1 0.046778 0.015286 0.12364 0.01533 0.12379
2 0.093659 0.089056 0.29842 0.10036 0.31679

1 -0.05479 0.03683 0.12783
2 0.11919 0.29436 0.45176
(1) Journal articles and dissertations: low validity

(2) Masters theses: high validity
The possible variability among the two groups:

1 29.4230 6.47957 0.010912 22.9434 .000745726
Variation Indicates that the There is still 22.9434 Q-within is significant
among three groups are variability that because Pw<.001.
reliabilities of all homogeneous cannot be explained Indicates that the
11 studies (i.e., not by Manuscript variability among the
significant, based Type effect sizes can
on Q between) contribute to the Q-
within variabilities.
111
APPENDIX AA

ASSESSMENT ANALYSIS DATA VS PATIENT GROUP
Concurrent Validity: Confidence Intervals for Patient Group

z -1 1.5
*-----------------------------------------------------------------*
-0.00 1.96 0.98 | [------------------------- * >
-0.22 0.22 0.00 | [----*----] |
-0.21 0.23 0.01 | [----*------] |
-0.10 0.34 0.12 | [-|--*-----] |
-0.38 0.06 -0.16 | [-----*---|] |
0.09 0.53 0.31 | | [----*-----] |
-0.36 0.16 -0.10 | [------*-|---] |
0.38 1.01 0.69 | | [-------*-------] |
-0.21 0.43 0.11 | [----|--*--------] |
-0.34 0.39 0.03 | [--------|*--------] |
-0.99 0.76 -0.12 |[----------------------*-|------------------] |
*-----------------------------------------------------------------*
Concurrent Validity: Confidence Intervals for Patient Group estimated (real) validity
LL_ UL_ T_ min max
T_ T_ DOT -0.2 0.8
DOT DOT *---------------------------------------------------------------------*
0.01 0.17 0.09 | [------*--] |
*---------------------------------------------------------------------*
112
APPENDIX BB


1 11 29.4230 .00106392 .008025498 0.090049 0.17207 .001751313 0.041849
studies SIZE Error
(represents
variability
among the
validities)

1 0.037417 0.19344 0.038640 0.19657
Correlation

1 .008025325 0.089806 0.17039
113
APPENDIX CC

Obs Pantgrp K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT

1 1 7 14.3133 0.026325 -0.046300 0.04499 0.13627 .002169197
2 3 4 10.2503 0.016554 0.092022 0.27890 0.46578 .009090909
1=major mental illness: fairly low for group 1

3=disorders of childhood: validity fairly high

1 0.046575 0.021039 0.14505 0.021621 0.14704
2 0.095346 0.087883 0.29645 0.095559 0.30913
Correlation:

1 -0.046267 0.04496 0.13544
2 0.091763 0.27189 0.43478
The possible variability among the groups:

1 29.4230 4.85927 0.027498 24.5637 .003493698

1 29.4230 4.85927 0.027498 24.5637 .003493698
reliabilities of homogeneous (i.e., that cannot be Indicates that the
all 11 studies not significant, explained by variability among the
based on Q Patient Grp effect sizes does not
between) contribute significantly to
114
APPENDIX DD

ASSESSMENT ANALYSIS DATA VS ASSESSMENT TYPE
Concurrent Validity: Confidence Intervals for Assessment Type

z -1 1.5
*-----------------------------------------------------------------*
-0.00 1.96 0.98 | [------------------------- * >
-0.22 0.22 0.00 | [----*----] |
-0.21 0.23 0.01 | [----*------] |
-0.10 0.34 0.12 | [-|--*-----] |
0.09 0.53 0.31 | | [----*-----] |
0.38 1.01 0.69 | | [-------*-------] |
-0.21 0.43 0.11 | [----|--*--------] |
-0.34 0.39 0.03 | [--------|*--------] |
-0.38 0.06 -0.16 | [-----*---|] |
-0.36 0.16 -0.10 | [------*-|---] |
-0.99 0.76 -0.12 |[----------------------*-|------------------] |
*-----------------------------------------------------------------*
Concurrent Validity: Confidence Intervals for Assessment Type estimated (real) validity
LL_ UL_ T_ min max
T_ T_ DOT -0.2 0.8
DOT DOT * ---------------------------------------------------------------------*
0.01 0.17 0.09 | [------*--] |
*---------------------------------------------------------------------*
115
APPENDIX EE


1 11 29.4230 .00106392 .008025498 0.090049 0.17207 .001751313 0.041849
studies SIZE Error
(represents
variability
among the
validities)

1 0.037417 0.19344 0.038640 0.19657
Correlation:

1 .008025325 0.089806 0.17039
116
APPENDIX FF

Obs Test K Q P LL_T_DOT T_DOT UL_T_DOT V_T_DOT SE_T_DOT

1 1 5 7.91905 0.09459 0.01182 0.12071 0.22960 .003086420 0.055556
2 2 3 9.43131 0.00895 0.10645 0.29773 0.48901 .009523810 0.097590
3 3 3 0.14477 0.93017 -0.29795 -0.13347 0.03101 .007042254 0.083918
Test: Recoding of tests into 3 3=Attachment measures 4,5,6

categories: (RQ, ATM) 4=other drawing tests (DAP, FD)
2=Depression measures (original categories: 4,12) (original categories: 9,10)
(DSRS, CES-D, CDI-C, CDI-P) 5=CAT-R Questionnaire (Childrens Attitudes About Talking-
(original categories: 3,6,7,8) Revised) (category 5)
6=Millon Clinical Multiaxial Inventory-II (MCMI-II)
(original category: 11)

1 0.01512 0.12296 0.01600 0.12649
2 0.10616 0.32582 0.10695 0.32703
3 0.00000 0.00000 0.00000 0.00000
Correlation:

1 0.01182 0.12012 0.22564
2 0.10605 0.28923 0.45343
3 -0.28944 -0.13268 0.03100

1 29.4230 11.9278 .002569834 17.4951 0.025347
reliabilities of homogeneous (i.e., that cannot be Indicates that the
all 11 studies not significant, explained by variability among the
based on Q Test Type effect sizes does not
between) contribute significantly to
117
APPENDIX GG

ASSESSMENT ANALYSIS DATA VS YEAR
Concurrent Validity: Confidence Intervals for Year

z -1 1.5
*-----------------------------------------------------------------*
-0.00 1.96 0.98 | [------------------------- * >
-0.22 0.22 0.00 | [----*----] |
-0.21 0.23 0.01 | [----*------] |
-0.10 0.34 0.12 | [-|--*-----] |
-0.38 0.06 -0.16 | [-----*---|] |
0.09 0.53 0.31 | | [----*-----] |
0.38 1.01 0.69 | | [-------*-------] |
-0.21 0.43 0.11 | [----|--*--------] |
-0.36 0.16 -0.10 | [------*-|---] |
-0.34 0.39 0.03 | [--------|*--------] |
-0.99 0.76 -0.12 |[----------------------*-|------------------] |
*-----------------------------------------------------------------*
Concurrent Validity: Confidence Intervals for Yearestimated (real) validity

LL_ UL_ T_ min max
T_ T_ DOT -0.2 0.8
DOT DOT *---------------------------------------------------------------------*
0.01 0.17 0.09 | [------*--] |
*---------------------------------------------------------------------*
118
APPENDIX HH


1 11 29.4230 .00106392 .008025498 0.090049 0.17207 .001751313 0.041849
studies SIZE Error
(represents
variability
among the
validities)

1 0.037417 0.19344 0.038640 0.19657
Correlation:

1 .008025325 0.089806 0.17039
119
APPENDIX II

Obs YR K Q P LL_T_DOT T_DOT UL_T_DOT

1 1 5 7.05288 0.13312 -0.10446 0.00443 0.11332
2 2 3 6.78312 0.03366 0.19755 0.35448 0.51140
3 3 3 0.29826 0.86146 -0.26389 -0.05843 0.14704

1 0.003086 0.05556 0.011778 0.10853 0.012464 0.11164
2 0.006410 0.08006 0.045992 0.21446 0.049586 0.22268
3 0.010989 0.10483 0.000000 0.00000 0.000000 0.00000
Correlation:

1 -0.10408 0.00443 0.11284
The studies conducted before 1990
have very low validity
2 0.19502 0.34034 0.47104
The studies conducted between 1990-
2000 are most valid
3 -0.25793 -0.05836 0.14599
The studies conducted after 2000 are
less valid

1 29.4230 15.2887 .000478742 14.1343 0.078332
among three groups are 14.1343 because Pw>.001.
reliabilities of all heterogeneous variability that Indicates that the
11 studies (i.e., significant, cannot be variability among the
based on Q explained by effect sizes does not
between) Year contribute significantly to
120
APPENDIX JJ
STUDY WEAKNESSES OF INTEREST
Author-identified study flaws Additional Study

Flaws Identified by
Coder
Subject-Related Flaws
Lack random selection Francis 2003, Gussak 2004, Overbeck 2002, Billingsley 1998, Ricca
Shlagman 1996 1992
Small N Batza 1995, Billingsley 1998, Brudenell 1989, Manning 1987, Wilson
Cohen 1995, Couch 1994, Easterling 2000, 2004
Francis 2003, Gantt 1990, Gulbro 1988,
Hacking 1996, Kessler 1994, Overbeck 2002,
Ricca 1992, Yahnke 2000
Ss not matched Batza 1995, Brudenell 1989, Coffey 1997, Brudenell 1989, Hacking
Cohen 1995, Couch 1994, Kress 1992, 1996
Mchugh 1997, Mills 1989, Neale 1994
Clinical symptomatology too weak or varied to detect Brudenell 1989, Gulbro 1988, Hacking 1996,
differences within subjects and/or differentiate between Johnson 2004, Mills 1993, Neale 1994
groups (i.e., confounding variable), thereby preventing
generalization
Data Collection Flaws
Researcher failed to adhere to standardized methods of Couch 1994, Easterling 2000 Brudenell 1989, Gulbro
assessment administration 1988
Study limited in scope and/or methodology Brudenell 1989, Cohen 1995, Easterling
2000, Overbeck 2002
Assessment instrument-related flaws
Assessment unable to detect specific diagnoses Ricca 1992
Problems encountered in placing assessment drawings Fowler 2002
within the graphic profile
Rating instrument flaws
Limited rating instrument Billingsley 1998, Couch 1994, Francis 2003, Hays 1981
Gantt 1990, Gulbro 1988, Gulbro 1991,
Johnson 2004, Kress 1992, Mills 1989, Neale
1994, Ricca 1992
Author's changes to rating system need improvement Mchugh 1997
Inter-rater reliability flaws
Only one rater used Kress 1992 Cohen 1988, Gussak
2004, Kessler 1994
Lacking inter-rater reliability Batza 1995, Brudenell 1989 Coffey 1997, Hays 1981
Unsatisfactory inter-rater reliability Fowler 2002, Neale 1994
Incorrect statistical procedures used to calculate inter- Eitel 2004
rater reliability
Inconsistencies/problems with rating procedures Cohen 1995, Couch 1994, Kress 1992,
Shlagman 1996
121
(7)
(39)
-related
identified
Other (5)
weaknesses
Assessment
Data collection
Number of
Subject-related
not reported (0)
No weaknesses
Weakness
weaknesses (9)
weaknesses (2)
Weaknesses
weaknesses (11)
weaknesses (18)
Rating instrument
Inter-rater reliability
Procedures/findings
Category and Total
Batza 1995
Bergland 1993
Billingsley 1998
Brudenell 1989
Coffey 1997
Cohen 1988
Cohen 1995
Couch 1994
Easterling 2000
Eitel 2004
Fowler 2002
Francis 2003
Gantt 1990
Gulbro 1988
Gulbro 1991
122
Gussak 2004
Hacking 1996
Hacking 2000
APPENDIX KK
Hays 1981
Hyler 2002
Johnson 2004
Kaiser 1993
Kessler 1994
Kress 1992
CODER-IDENTIFIED STUDY WEAKNESSES
Manning 1987
Mchugh 1997
Mills 1989
Mills 1993
Neale 1994
Overbeck 2002
Ricca 1992
Shlagman 1996
Wadlington 1973
Wilson 2004
Yahnke 2000
APPENDIX LL
123
APPENDIX MM
124
APPENDIX NN
125
REFERENCES
Acton, B. (1996). A new look at human figure drawings: Results of a meta-analysis and drawing
scale development. Unpublished Dissertation, Simon Fraser University, British
Columbia, Canada.
Aiken, L. R. (1997). Psychological testing and assessment. (9th ed.). Boston, MA: Allyn and
Bacon.
American Art Therapy Association (2004a). About art therapy. Retrieved 1.31.2004.
http://www.arttherapy.org/aboutarttherapy/about.htm.
American Art Therapy Association. (2004b). Education standards. Retrieved 9.10.2004.

http://www.arttherapy.org/aboutaata/educationstandards.pdf.
American Educational Research Association, American Psychological Association, & National

Council on Measurement in Education (1999). Standards for Educational and
Psychological Testing. Washington, DC: American Educational Research Association.
Anderson, F. (2001). Needed: A major collaborative effort. Art therapy: Journal of the American
Art Therapy Association, 18(2), 74-78.
Animated Software Company, Internet Glossary of Statistical Terms. (n.d.[a]). Chi square.
Retrieved February 3, 2004, from http://www.animatedsoftware.com/statglos/
sgchi_sq.htm.
Animated Software Company, Internet Glossary of Statistical Terms. (n.d.[b]). t test.

sgttest.htm.
Animated Software Company, Internet Glossary of Statistical Terms. (n.d.[c]). t test.

sgzscore.htm.
APCA (n.d.). convergent validity. Retrieved February 28, 2005, from

http://www.acpa.nche.edu/comms/comm09/dragon/dragon-recs.html.
126
Arrington, D. (1992). Art-based assessment procedures and instruments used in research. In H.
Wadeson (Ed.), A Guide to Conducting Art Therapy Research, 141-159. Mundelein, IL:
The American Art Therapy Association.
Atkinson, L., Quarrington, B., Alp, I. E., & Cyr, J. J. (1986). Rorschach validity: An empirical
approach to the literature. Journal of Clinical Psychology, 42(2), 360-362.
Arthur, W., Bennett, W., & Huffcutt, A. I. (2001). Conducting weighted analysis using SAS.
Mahwah, NJ: Lawrence Erlbaum Associates, Publishers.
Batza Morris, M. (1995). The Diagnostic Drawing Series and the Tree Rating Scale: An
isomorphic representation of multiple personality disorder, major depression, and
schizophrenia populations. Art Therapy: Journal of the American Art Therapy
Association, 12(2), 118-128.
Becker, B. J. (1994). Combining significance levels. In H. Cooper & L. V. Hedges (Eds.), The
handbook of research synthesis (pp. 215-230). New York: Russell Sage Foundation.
Becker, L. A. (1998) Effect size. Retrieved February 3, 2004, from Colorado University,
Colorado Springs Web site: http://www.uccs.edu/~lbecker/psy590/es.htm.
Begg, C. B. (1994). Publication bias. In H. Cooper & L. V. Hedges (Eds.), The handbook of
research synthesis (pp. 399-410). New York: Russell Sage Foundation.
Bergland, C., & Moore Gonzalez, R. (1993). Art & madness: can the interface be quantified?
American Journal of Art Therapy, 31, 81-90.
Betts, D. J. (2003). Developing a projective drawing test: Experiences with the Face Stimulus
Assessment (FSA). Art Therapy: Journal of the American Art Therapy Association,
20(2), 77-82.
Billingsley, G. (1998). The efficacy of the Diagnostic Drawing Series with substance-related
disordered clients. Unpublished doctoral dissertation, Walden University.
Bowyer, K. A. (1995). Research using the Diagnostic Drawing Series and the PPAT with an
elderly population. Unpublished masters thesis, George Washington University,
Washington, DC.
Brooke, S. (1996). A therapists guide to art therapy assessments: Tools of the trade.
Springfield, IL: Charles C Thomas.
Brudenell, T. J. (1989). Art representations as functions of depressive state: Longitudinal studies

in chronic childhood and adolescent depression. Unpublished master's thesis, College of
Buck, J. N. (1948). The H-T-P technique, a qualitative and quantitative scoring manual. Journal
127
of Clinical Psychology Monograph Supplement, 4, 1-120.
Burleigh, L. R., & Beutler, L. E. (1997). A critical analysis of two creative arts therapies. The
Arts in Psychotherapy, 23(5), 375-381.
Buros Institute, Test Reviews Online. (n.d.[a]). Retrieved May 24, 2004, from
http://buros.unl.edu/buros/jsp/category.html.
Buros Institute, Test Reviews Online. (n.d.[b]). Category list of test titles: Personality. Retrieved
May 24, 2004, from http://buros.unl.edu/buros/jsp/clists.jsp?cateid=12&catename
=Personality.
Burt, H. (1996). Beyond practice: A postmodern feminist perspective on art therapy research. Art
Therapy: Journal of the American Art Therapy Association,13(1), 12-19.
Bushman, B. J. (1994). Vote-counting procedures in weighted analysis. In H. Cooper & L. V.

Hedges (Eds.), The handbook of research synthesis (pp. 193-214). New York: Russell
Sage Foundation.
Case, C. (1998). Brief encounters: Thinking about images in assessment. INSCAPE, 3(1), 26-33.
Chapman, L. J., & Chapman, J. P. (1967). Genesis of popular but erroneous psychodiagnostic
observations. Journal of Abnormal Psychology. 72(3), 193-204.
Coffey, T. M. (1997). The use of the Diagnostic Drawing Series with an adolescent population.
Unpublished paper.
Cohen, B. M. (Ed.). (1985). The Diagnostic Drawing Series Handbook. Unpublished handbook.
Cohen, B. M. (Ed.). (1986/1994). The Diagnostic Drawing Series Rating Guide. Unpublished
guidebook.
Cohen, B. M., Hammer, J., & Singer, S. (1988). The Diagnostic Drawing Series (DDS): A
systematic approach to art therapy evaluation and research. Arts in Psychotherapy, 15(1),
11-21.
Cohen, B. M., & Heijtmajer, O. (1995). Identification of dissociative disorders: Comparing the
SCID-D and dissociative experie nces scale with the Diagnostic Drawing Series.
Unpublished paper.
Cohen, B., & Mills, A. (2000). Report on the Diagnostic Drawing Series. Unpublished paper.
Alexandria, VA: The DDS Project.
Cohen, B. M., Mills, A., & Kijak, A. K. (1994). An introduction to the Diagnostic Drawing
Series: A standardized tool for diagnostic and clinical use. Art Therapy: Journal of the
American Art Therapy Association,11(2), 105-110.
128
Cohen, F. W., & Phelps, R. E. (1985). Incest markers in children's artwork. Arts in
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Colaizzi, P. F. (1978). Psychological research as the phenomenologist views it. In R. S. Vale &
M. King (Eds.). Existential-phenomenological alternatives for psychology (pp. 48-71).
New York: Oxford University Press.
Conard, F. (1992). The arts in education and a meta-analysis. Unpublished Dissertation, Purdue
University.
Cooper, H. (1998). Synthesizing research: A guide for literature reviewers (3rd ed.). Thousand
Oaks, CA: Sage Publications.
Cooper, H., & Ribble, R. G. (1989). Influences on the outcome of literature searches for
integrative research reviews. Knowledge: Creation, Diffusion, Utilization, 10, 179-201.
Couch, J. B. (1994). Diagnostic Drawing Series: Research with older people diagnosed with
organic mental syndromes and disorders. Art Therapy: Journal of the American Art
Therapy Association, 11(2), 111-115.
Cox, C. T. (Moderator), Agell, G., Cohen, B., & Gantt, L. (1998, November). Are you assessing
what Im assessing? Lets take a look! Panel presented at the meeting of the American
Art Therapy Association, Portland, OR.
Cox, C. T. (Moderator), Agell, G., Cohen, B., & Gantt, L. (1999, November). Are you assessing
what Im assessing? Lets take a look! Round two. Panel presented at the meeting of the
American Art Therapy Association, Orlando, FL.
Cuadra, C. A., & Katter, R. V. (1967). Opening the black box of relevance. Journal of
Documentation, 23, 291-303.
Davidson, D. (1977). The effects of individual differences of cognitive style on judgments of

document relevance. Journal of the American Society for Information Science, 8, 273-
284.
Dawes, R. M. (1999). Two methods for studying the incremental validity of a Rorschach
variable. Psychological Assessment, 11(3), 297-302.
Dawson, C. F. S. (1984). A study of selected style and content variables in the drawings of
depressed and nondepressed adults. Unpublished dissertation, University of North
Dakota, Grand Forks, ND.
Deaver, S. P. (2002). What constitutes art therapy research? Art therapy: Journal of the
129
Easterling, C. E. (2000). Art therapy with elementary school students with symptoms of
depression: The effects of storytelling, fantasy, daydreams, and art reproductions.
Unpublished master's thesis, Florida State University, Tallahassee, FL.
Eitel, K., Szkura, L., & Wietersheim, J. v. (2004). Do you see what I see? A study about the
interrater-reliability in art therapy. Unpublished paper.
Elkins, D. E., Stovall, K., Malchiodi, C. A. (2003). American Art Therapy Association, Inc.:
2001-2002 membership survey report. Art therapy: Journal of the American Art Therapy
Association, 20(1), 28-34.
Evidence Based Emergency Medicine, New York Academy of Medicine. (n.d.) Odds ratio.
Retrieved February 3, 2004, from http://www.ebem.org/definitions.html#sectO.
Feder, B., & Feder, E. (1998). The art and science of evaluation in the arts therapies: How do
you know whats working? Springfield, IL: Charles C Thomas.
Fenner, P. (1996). Heuristic research study: Self- therapy using the brief image- making
experience. Arts in Psychotherapy, 23(1), 37-51.
Fowler, J. P., & Ardon, A. M. (2002). Diagnostic Drawing Series and dissociative disorders: A
Dutch study. Arts in Psychotherapy, 29(4), 221-230.
Fraenkel, J. R., & Wallen, N. E. (2003). How to design and evaluate research in education (5th
ed.). New York, NY: McGraw-Hill.
Francis, D., Kaiser, D, & Deaver, S. P. (2003). Representations of attachment security in the
bird's nest drawings of clients with substance abuse disorders. Art Therapy: Journal of
the American Art Therapy Association, 20(3), 125-137.
Furth, G. (1988). The secret word of drawings: Healing through art. Boston, MA: Sigo Press.
Gantt, L. (1986). Systematic investigation of art works: Some research models drawn from
neighboring fields. American Journal of Art Therapy, 24(4), 111-118.
Gantt, L. (1990). A validity study of the Formal Elements Art Therapy Scale (FEATS) for
diagnostic information in patients drawings. Unpublished doctoral dissertation,
University of Pittsburgh, Pittsburgh, PA.
Gantt, L. (1992). A description and history of art therapy assessment research. In H. Wadeson
(Ed.), A Guide to Conducting Art Therapy Research, 119-139. Mundelein, IL: The
American Art Therapy Association.
Gantt, L. (2004). The case for formal art therapy assessments. Art Therapy: Journal of the
130
Gantt, L., & Tabone, C. (1998). The Formal Elements Art Therapy Scale: The Rating Manual.
Morgantown, WV: Gargoyle Press.
Gantt, L., & Tabone, C. (2001, November). Measuring clinical changes using art. Paper
presented at the meeting of the American Art Therapy Association, Albuquerque, NM.
Garb, H. N. (2000). Projective techniques and the detection of child sexual abuse. Child
Maltreatment: Journal of the American Professional Society on the Abuse of Children,
5(2), 161-168.
Garb, H. N., Florio, C. M., & Grove, W. M. (1998). The validity of the Rorschach and the
Minnesota Multiphasic Personality Inventory: Results from meta-analyses. Psychological
Science, 9(5), 402-404.
Garb, H. N., Wood, J. M., Nezworski, M. T., Grove, W. M., & Stejskal, W. J. (2001). Toward a
resolution of the Rorschach controversy. Psychological Assessment, 13(4), 423-448.
Glass, G., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills,
CA: Sage Publications.
Golomb, C. (1992). The Childs Creation of a Pictorial World. Berkeley: University of

California Press.
Google (n.d.). Criterion validity. Retrieved March 6, 2005, from

http://www.google.com/search?hl=en&lr=&ie=UTF8&oi=defmore&q=define:Criterion+
Validity.
Groth-Marnat, G. (1990). Handbook of psychological assessment (2nd ed.). New York, NY:
Wiley.
Gulbro Leavitt, C. (1988). A validity study of the Diagnostic Drawing Series as used for
assessing depression in children and adolescents. Unpublished doctoral dissertation,
California School of Professional Psychology, Los Angeles, CA.
Gulbro-Leavitt, C., & Schimmel, B. (1991). Assessing depression in children and adolescents
using the Diagnostic Drawing Series modified for children (DDS-C). Arts in
Gulliver, P. (circa 1970s). Art therapy in an assessment center. INSCAPE, 12, pp. unknown.
Gussak, D. (2004). Art therapy with prison inmates: A pilot study. Arts in Psychotherapy, 31(4),
245-259.
Hacking, S. (1999). The psychopathology of everyday art: A quantitative study. Dissertation,
131
University of Keele, Sheffield, UK. Published online at
http://www.musictherapyworld.de/modules/archive/stuff/papers/Hacking.pdf
Hacking, S. (2001). Psychopathology in paintings: A meta-analysis of studies using paintings by

psychiatric patients. British Journal of Medical Psychology, 74(Pt1), 35-45.
Hacking, S., Foreman, D., & Belcher, J. (1996). The Descriptive Assessment for Psychiatric Art:
A new way of quantifying paintings by psychiatric patients. Journal of Nervous and
Hacking, S., & Foreman, D. (2000). The Descriptive Assessment for Psychiatric Art (DAPA):
Update and further research. Journal of Nervous and Mental Disease, 188(8), 525-529.
Hadley, R. G., & Mitchell, L. K. (1995). Counseling research and program evaluation. Pacific
Grove, CA: Brooks/Cole Publishing Company.
Hagood, M. M. (1990). Art therapy research in England: Impressions of an American art

therapist. The Arts in Psychotherapy, 17, 75-79.
Hagood, M. M. (1992). Diagnosis or dilemma: Drawings of sexually abused children. British

Journal of Projective Psychology. 37(1), 22-33.
Hagood, M. M. (2002). A correlational study of art-based measures of cognitive development:

Clinical and research implications for art the rapists working with children. Art therapy:
Journal of the American Art Therapy Association, 19(2), 63-68.
Hagood, M. M. (2004). Commentary. Art Therapy: Journal of the American Art Therapy
Association, 21(1), 3.
Hall, J. A., Tickle-Degnen, L. T., Rosenthal, R., & Mosteller, F. (1994). Hypotheses and
problems in research synthesis. In H. Cooper & L. V. Hedges (Eds.), The handbook of
research synthesis (pp. 17-28). New York: Russell Sage Foundation.
Hammer, E. F. (1958). The clinical application of projective drawings. Springfield, IL: Charles
C Thomas.
Hardiman, G. W., Liu, F. J., & Zernich, T. (1992). Assessing knowledge in the visual arts. In G.
Cupchik & J. Laszlo (Eds), Emerging visions of the aesthetic process: Psychology,
semiology, and philosophy (pp. 171-182). New York, NY: Cambridge University Press.
Harris, D. B. (1963). Childrens drawings as measures of intellectual maturity. New York, NY:
Harcourt, Brace, & World.
Harris, J. B. (1996). Children's drawings as psychological assessment tools. Retrieved April 19,
2003, from http://www.iste.org/jrte/28/5/harris/article/introduction.cfm.
132
Harris, D. B., & Roberts, J. (1972). Intellectual maturity of children: Demographic and
socioeconomic factors. Vital & Health Statistics, Series 2 (pp. 1-74).
Hays, R. E., & Lyons, S. J. (1981). The bridge drawing: A projective technique for assessment in
art therapy. Arts in Psychotherapy, 8(3-sup-4), 207-217.
Hedges, L. V. (1994). Statistical considerations. In H. Cooper & L. V. Hedges (Eds.), The

handbook of research synthesis (pp. 29-38). New York: Russell Sage Foundation.
Hedges, L. V., & Olkin, I. (1985). Statistical methods for weighted analysis. Orlando, FL:
Academic Press.
Hiller, J. B., Rosenthal, R., Bornstein, R. F., Berry, D. T. R., & Brunell-Neuleib, S. (1999). A
comparative meta-analysis of Rorschach and MMPI validity. Psychological Assessment,
11(3), 278-296.
Horovitz, E. G. (Moderator), Agell, G., Gantt, L., Jones, D., & Wadeson, H. (2001, November).
Upholding beliefs: Art therapy assessment, training, and practice. Panel presented at the
meeting of the American Art Therapy Association, Albuquerque, NM.
Hrdlicka, A. (1899). Art and literature in the mentally abnormal. American Journal of Insanity,
55, 385-404.
Hunsley, J. & Bailey, J. M. (1999). The clinical utility of the Rorschach: Unfulfilled promises
and an uncertain future. Psychological Assessment, 11(3), 266-277.
Hyler, C. (2002). Children's drawings as representations of attachment. Unpublished master's

thesis, Eastern Virginia Medical School, Norfolk, VA.
Johnson, K. M. (2004). The use of the Diagnostic Drawing Series in the diagnosis of bipolar
disorder. Unpublished Dissertation, Seattle Pacific University, Seattle, WA.
Julliard, K. N., & Van Den Heuvel, G. (1999). Susanne K. Langer and the foundations of art
therapy. Art Therapy: Journal of the American Art Therapy Association, 16(3), 112-120.
Kahill, S. (1984). Human figure drawing in adults: An update of the empirical evidence, 1967-
1982. Canadian Psychology, 25(4), 269-292.
Kaiser, D. (1993). Attachment organization as manifested in a drawing task. Unpublished

master's thesis, Eastern Virginia Medical School, Norfolk, VA.
Kaplan, F. F. (1991). Drawing assessment and artistic skill. Arts in Psychotherapy, 18, 347-352.
Kaplan, F. (2001). Areas of inquiry for art therapy research. Art therapy: Journal of the
133
Kaplan, F. (2003). Art-based assessments. In C. A. Malchiodi (Ed.), Handbook of Art Therapy
(pp. 25-35). New York, NY: The Guilford Press.
Kay, S. R. (1978). Qualitative differences in human figure drawings according to schizophrenic

subtype. Perceptual and Motor Skills, 47, 923-932.
Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). New York, NY: Holt,
Rinehart & Winston.
Kessler, K. (1994). A study of the Diagnositic Drawing Series with eating disordered patients.
Art Therapy: Journal of the American Art Therapy Association, 11(2), 116-118.
Kinget, G. M. (1958). The Drawing Completion Test. In E. F. Hammer (Ed.), The clinical
application of projective drawings (pp. 344-364). Springfield, IL: Charles C Thomas.
Kirk, A., & Kertesz, A. (1989). Hemispheric contributions to drawing. Neuropsychologia, 27(6),
881-886.
Klepsch, M., & Logie, L. (1982). Children draw and tell: An introduction to the projective uses
of children's human figure drawings. New York, NY: Brunner/Mazel.
Klopfer, W. G., & Taulbee, E. S. (1976). Projective tests. Annual Review of Psychology, 27(54),
3-567.
Knapp, N. M. (1994). Research with diagnostic drawings for normal and Alzheimer's subjects,
Art Therapy, 11(2), 131-138.
Kress, T., & Mills, A. (1992). Multiple personality disorder and the Diagnostic Drawing Series:
Further investigations. Unpublished paper.
Kwiatkowska, H. Y. (1975). Family art therapy: Experiments with a new technique. In E. Ulman
& P. Dachinger (Eds.), Art therapy in theory and practice (pp. 113-125). New York, NY:
Schocken Books.
Kwiatkowska, H. Y. (1978). Family therapy and evaluation through art. Springfield, IL: Charles
C Thomas.
Langevin, R., & Hutchins, L. M. (1973). An experimental investigation of judges ratings of

Schizophrenics and non-schizophrenics paintings. Journal of Personality Assessment,
37(6), 537-543.
Langevin, R., Raine, M., Day, D., & Waxer, K. (1975a). Art experience, intelligence and formal
features in psychotics' paintings. Arts in Psychotherapy (study 1), 2(2), 149-158.
Langevin, R., Raine, M., Day, D., Waxer, K. (1975b), Art experience, intelligence and formal
features in psychotics' paintings. Arts in Psychotherapy (study 2), 2(2), 149-158.
134
Lehmann, H., & Risquez, F. (1953). The use of finger paintings in the clinical evaluation of
psychotic conditions: A quantitative and qualitative approach. Journal of Mental Science,
99, 763-777.
Levick, M. F. (2001). The Levick Emotional and Cognitive Art Therapy Assessment. (LECATA).
Boca Raton. The South Florida Art Psychotherapy Institute.
Loewy, J. V. (1995). A hermeneutic panel study of music therapy assessment with an emotionally
disturbed boy. Unpublished Dissertation, New York University, New York.
Lombroso, C., (1891). The man of genius. London: Walter Scott.
Lowenfeld, V. (1939). The nature of creative activity. New York, NY: Harcourt, Brace.
Lowenfeld, V. (1947). Creative and mental growth. New York, NY: Macmillan.
Luzzatto, P. (1987). The internal world of drug-abusers: Projective pictures of self-object

relationships: A pilot study. British Journal of Projective Psychology, 32(2), 22-33.
MacGregor, J. M. (1989). The discovery of the art of the insane. Princeton, NJ: Princeton
University Press.
Machover, K. (1949). Personality projection in the drawing of the human figure. Oxford,
England: Charles C Thomas.
Maclagan, D. (1989). The aesthetic dimension of art therapy: Luxury or necessity. INSCAPE,
Spring, 10-13.
Malchiodi, C. (1994). Introduction to special section of art-based assessments. Art Therapy:

Journal of the American Art Therapy Association 11, 2, 104.
Manning, T. M. (1987). Aggression depicted in abused children's drawings. The Arts in

Psychotherapy, l4, l5-24.
McGlashan, T. H., Wadeson, H. S., Carpenter, W. T., & Levy, S. T. (1977). Art and recovery
style from psychosis. Journal of Nervous and Mental Disease, 164(3), 182-190.
McHugh, C. M. (1997). A comparative study of structural aspects of drawings between

individuals diagnosed with major depressive disorder and bipolar disorder in the manic
phase. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA.
McNiff, S. (1998). Art-based research. Philadelphia, PA: Jessica Kingsley.
Meyer, G. J., & Archer, R. P. (2001). The hard science of Rorschach research: What do we know
and where do we go? Psychological Assessment, 13(4), 486-502.
135
Miljkovitch, M., & Irvine, G. M. (1982). Comparison of drawing performances of
schizophrenics, other psychiatric patients and normal schoolchildren on a draw-a-village
task. Arts in Psychotherapy, 9, 203-216.
Mills, A. (1989). A statistical study of the formal aspects of the Diagnostic Drawing Series of
borderline personality disordered patients, and its context in contemporary art therapy.
Unpublished master's thesis, Concordia University, Montreal, PQ.
Mills, A., & Cohen, B. (1993). Facilitating the idenfitication of multiple personality disorder
through art: The Diagnostic Drawing Series. In E. S. Kluft (Ed.), Expressive and
functional therapies in the treatment of multiple personality disorder. Springfield, IL:
Charles C Thomas.
Mills, A., Cohen, B. M., & Meneses, J. Z. (1993a). Reliability and validity tests of the
Diagnostic Drawing Series. Arts in Psychotherapy, 20, 83-88.
Mills, A., Cohen, B. M., & Meneses, J. Z. (1993b). Reliability and validity tests of the
Diagnostic Drawing Series: DDS study 77 naive raters, unpublished report. Arts in
Psychotherapy, 20, 83-88.
Mills, A., & Goodwin, R. (1991). An informal survey of assessment use in child art therapy. Art
Therapy: Journal of the American Art Therapy Association, 8(2), 10-13.
Murray, H. A. (1943). Thematic apperception test. Cambridge, MA: Harvard University Press.
National Art Education Association (1994). Advisory. Computers and art education. Author.
Neale, E. L. (1994). The Children's Diagnostic Drawing Series. Art Therapy: Journal of the
Neale, E. L., & Rosal, M. L. (1993). What can art therapists learn from the research on
projective drawing techniques for children? A review of the literature. The Arts in
Psychotherapy, 20 (37-49).
Nunnally, J. C. (1960). The place of statistics in psychology. Educational and Psychological

Measurement, 20, 641-650.
Orr, P. P. (2003). A hollow God: Technology's effects on paradigms and practices in secondary
art education. Unpublished Dissertation, Purdue University, West Lafayette, IN.
Orwin, R. G. (1994). Evaluating coding decisions. In H. Cooper & L. V. Hedges (Eds.),

The handbook of research synthesis (pp. 139-162). New York: Russell Sage Foundation.
Oster, G. D., & Gould Crone, P. (2004). Using drawings in assessment and therapy: A guide for
mental health professionals. New York, NY: Brunner-Routledge.
136
Oswald, F. L. (2003). Meta-analysis and the art of the average. Validity generalization: A critical
review, 311-338.
Overbeck, L. B. (2002). A pilot study of pregnant women's drawings. Unpublished master's

thesis, Eastern Virginia Medical School, Norfolk, VA.
Parker, K. C. H., Hanson, R. K., & Hunsley, J. (1988). MMPI, Rorschach, and WAIS: A meta-
analytic comparison of reliability, stability, and validity. Psychological Bulletin, 103(3),
367-373.
Phillips, J. (1994). Commentary on the assessment portion of the art therapy practice analysis
survey. Art therapy: Journal of the American Art Therapy Association, 11(3), 151-152.
Pigott, T. D. (1994). Methods for handling missing data in research synthesis. In H. Cooper & L.
V. Hedges (Eds.), The handbook of research synthesis (pp. 163-175). New York: Russell
Sage Foundation.
Price, D. (1965). Networks of scientific papers. Science, 149, 510-515.
Quail, J. M., & Peavy, R. V. (1994). A phenomenologic research study of a client's experience in
art therapy. Arts in Psychotherapy, 21(1), 45-57.
Rankin, A. (1994). Tree drawings and trauma indicators: a comparison of past research with
current findings from the DDS. Art Therapy: Journal of the American Art Therapy
Association, 11(2), 127-130.
Ricca, D. (1992). Utilizing the Diagnostic Drawing Series as a tool in differentiating a diagnosis
between multiple personality disorder and schizophrenia. Unpublished master's thesis,
Hahnemann University, Philadelphia, PA.
Ritter, M., & Low, K. G. (1996). Effects of dance/movement therapy: A meta-analysis. Arts in
Roback, H. B. (1968). Human figure drawings: Their utility in the clinical psychologists
armamentarium for personality assessment. Psychological Bulletin, 70(1), 1-19.
Rosal, M. L. (1992). Illustrations of art therapy research. In Wadeson, H. (Ed.). A Guide to

Conducting Art Therapy Research (pp. 57-65). Mundelein, IL: The American Art
Therapy Association.
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological
Bulletin, 86, 638-641.
Rosenthal, R. (1984). Meta analytic procedures for social research. Beverley Hills, CA: Sage.
Rosenthal, R. (1991). Weighted analytic procedures for social research (rev. edition). Newbury
137
Park, CA: Sage.
Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.),
The handbook of research synthesis (pp. 231-244). New York: Russell Sage Foundation.
Rosenthal, R., Hiller, J. B., Bornstein, R. F., Berry, D. T. R., & Brunell-Neuleib, S. (2001).
Meta-analytic methods, the Rorschach, and the MMPI. Psychological Assessment, 13(4),
449-451.
Rosnow, R. L., & Rosenthal, R. (1996). Computing contrasts, effect sizes, and counternulls on
other peoples published data: General procedures for research consumers. Psychological
Methods, 1, 331-340.
Rubin, J. A. (1987). Approaches to Art Therapy: Theory and Technique. New York, NY:
Brunner/Mazel.
Rubin, J. A. (1999). Art therapy: An introduction. Philadelphia, PA: Brunner/Mazel.
Russell- Lacy, S., Robinson, V., Benson, J., & Cranage, J. (1979). An experimental study of
pictures produced by acute schizophrenic subjects. British Journal of Psychiatry, 134,
195-200.
Shlagman, H. (1996). The Diagnostic Drawing Series: A comparison of psychiatric inpatient

adolescents in crisis with non-hospitalized youth. Unpublished master's thesis, College of
Scope, E. E. (1999). A meta-analysis of research on creativity: The effects of instructional

variables. Unpublished Dissertation.
Sharples, G. D. (1992). Intrinsic motivation and social constraints: A meta-analysis of

experimental research utilizing creative activities. Unpublished Dissertation.
Shoemaker-Beal, R. (1977). The significance of the first picture in art therapy. Paper presented
at the Dynamics of Creativity: The Eighth Annual Conference of the American Art
Therapy Association, Baltimore, MD.
Silver, R. (1966). The role of art in the conceptual thinking, adjustment, and aptitudes of deaf
and aphasic children. Unpublished Doctoral Dissertation, Columbia University, New
York.
Silver, R. A. (1983). Silver Drawing Test of Cognitive and Creative Skills. Seattle, WA: Special
Child Publications.
Silver, R. A. (1988, 1993). Draw A Story, Screening for Depression and Emotional Needs. New
York, NY: Ablin Press.
138
Silver, R. A. (1990). Silver Drawing Test of Cognitive Skills and Adjustment. Drawing What You
Predict, What You See, and What You Imagine. New York, NY: Albin Press.
Silver, R. A. (1996). Silver Drawing Test of Cognition and Emotion (3rd ed.). New York, NY:
Albin Press.
Silver, R. A. (2002). Three art assessments: The Silver Drawing Test of cognition and emotion;
draw a story: Screening for depression; and stimulus drawings and techniques. New
York, NY: Brunner-Routledge.
Silver, R. A. (2003). The Silver Drawing Test of Cognition and Emotion. In C. A. Malchiodi
(Ed.), Handbook of Art Therapy (pp. 410-419). New York, NY: The Guilford Press.
Silver, R., & Ellison, J. (1995a). Identifying and assessing self- images in drawings by delinquent
adolescents. Arts in Psychotherapy, 22(4), 339-352.
Silver, R., & Ellison, J. (1995b). Identifying and assessing self- images in drawings by delinquent
Adolescents: Part 2. Arts in Psychotherapy, 22(4), 339-352.
Simon, P. M. (1888). Les crits et les dessins des alins. Archivio di Antropologia Criminelle,
Psichiatria e Medicina Legale, 3, 318-355.
Sims, J., Bolton, B., & Dana, R. H. (1983). Dimensionality & concurrent validity of the Handler
DAP anxiety index. Multivariate Experimental Clinical Research, 6(2), 69-79.
Spangler, W. D. (1992). Validity of questionnaire and TAT measures of need for achievement:
Two meta-analyses. Psychological Bulletin, 112(1), 140-154.
Srivastava, A. K. (2002). Somatic Inkblot Series-I: A meta analysis. Journal of Projective

Psychology & Mental Health, 9(1), 33-37.
Stock, W. A. (1994). Sytematic coding for research synthesis. In H. Cooper & L. V. Hedges
(Eds.), The handbook of research synthesis (pp. 125-138). New York: Russell Sage
Foundation.
Stricker, G. & Gold, J. R. (1999). The Rorschach: Toward a nomothetically based,

idiographically applicable configurational model. Psychological Assessment, 11(3), 240-
250.
Suinn, R. M., & Oskamp, S. (1969). The predictive validity of projective measures: A fifteen-
year evaluative review of research. Springfield, IL: Charles C Thomas.
Swensen, C. H. (1968). Empirical evaluations of human figure drawings: 1957-1966.

Psychological Bulletin, 70(1), 20-44.
Tardieu, A. (1886). Etudes mdico-lgales sur la folie. Paris: JB Baillire.
139
Tate, R. (1998). An introduction to modeling outcomes in the behavioral and social sciences
(2nd ed.). Edina, MN: Burgess Publishing.
Taveggia, T. C. (1974). Resolving research controversy through empirical cumulation.

Sociological Methods and Research, 2, 395-407.
Teneycke, T. (1988). Eating disorders and affective disorders: Is there a connection? A study
involving the Diagnostic Drawing Series. Unpublished bachelors thesis, University of
Regina, Regina, Sask.
Uhlin, D. M. (1978). Assessment of violent prone personality through art. British Journal of
Projective Psychology & Personality Study, 23(1), 15-22.
Ulman, E. (1965). A new use of art in psychiatric diagnosis. Bulletin of Art Therapy, 4, 91-116.
Ulman, E., (1975). A new use of art in psychiatric diagnosis. In E. Ulman & P. Dachinger (Eds.),
Art therapy in theory and practice (pp. 361-386). New York: Schocken.
Ulman, E. (1992). A new use of art in psychiatric diagnosis. American Journal of Art Therapy,
30, 78-88.
Ulman, E., & Levy, B. I. (1975). An experimental approach to the judgment of psychopathology
from paintings. In E. Ulman & P. Dachinger (Eds.), Art therapy in theory and practice
(pp. 393-402). New York: Schocken.
Ulman, E., & Levy, B. I. (1992). An experimental approach to the judgment of

psychopathology from paintings. American Journal of Art Therapy, 30, 107-112.
Viglione, D. J. (1999). A review of recent research addressing the utility of the Rorschach.
Psychological Assessment, 11(3), 251-265.
Wadeson, H. (2002). The anti-assessment devils advocate. Art Therapy: Journal of the
American Art Therapy Association,19(4), pp. 168-170.
Wadeson, H. (2003). About this issue. Art Therapy: Journal of the American Art Therapy
Association,20(2), p. 63.
Wadeson, H. (2004). Commentary. Art Therapy: Journal of the American Art Therapy
Association, 21(1), 3-4.
Wadeson, H., & Carpenter, W. (1976). A comparative study of art expression of schizophrenic,
unipolar depressive, and bipolar manic-depressive patients. Journal of Nervous and
Wadlington, W. L., & McWhinnie, H. J. (1973). The development of a rating scale for the study
140
of formal aesthetic qualities in the paintings of mental patients. Arts in Psychotherapy,
1(3-4), 201-220.
Walsh, B., & Betz, N. (2001). Tests and assessment (4th Ed.). Upper Saddle River, NJ:
Prentice Hall.
West, M. M. (1998). Meta-analysis of studies assessing the efficacy of projective techniques in

discriminating child sexual abuse. Child Abuse & Neglect, 22(11), 1151-1166.
Wiersma, W. (2000). Research methods in education: An introduction (7th ed.). Boston, MA:
Allyn and Bacon.
Wilson, K. (2004). Projective drawing: Alternative assessment of emotion in children who

stutter. Unpublished bachelor's thesis, Florida State University, Tallahassee, FL.
Wolf, F. M. (1986). Meta analysis: Quantitative methods for research synthesis. Beverley Hills,
CA: Sage.
Wright, J. H., & Macintyre, M. P. (1982). The family drawing depression scale. Journal of
Clinical Psychology, 38(4), 853-861.
Yahnke, L. (2000). Diagnostic Drawing Series as an assessment for children who have
witnessed marital violence. Unpublished doctoral dissertation, Minnesota School of
Professional Psychology, Minneapolis, MN.
Yazdani, S. (2002a). Reliability. Retrieved June 2, 2004, from http://216.239.39.104/search?q=

cache:HAOfbC5GwqwJ:www.atgci.org/medical%2520education/reliability.ppt+inter+rat
er+reliability+definition&hl=en.
Yazdani, S. (2002b). Validity. Retrieved June 2, 2004, from http://216.239.51.104/search?q=

cache:TyjgaTVEl7oJ:www.atgci.org/medical%2520education/validity.ppt+validity+defin
ition+Yazdani&hl=en&ie=UTF-8.
141
BIOGRAPHICAL SKETCH
Donna J. Betts, ATR-BC, hails from Toronto, Canada. She received a Bachelor of Fine
Arts from the Nova Scotia College of Art & Design in 1992, a Master of Arts in Art Therapy
from the George Washington University in 1999, and a PhD in Art Education with a
specialization in Art Therapy from the Florida State University in 2005. From 2002-2005, she
worked as a teaching assistant while completing her doctoral studies. In 2004, Ms. Betts was the
proud recipient of the Daisy Parker Flory Graduate Scholar Award, bestowed upon her by the
Honor Society of Phi Kappa Phi at Florida State.
A registered and board-certified art therapist, Ms. Betts began working with people who
have eating disorders in Tallahassee, Florida, in 2003, and as an information assistant, writer and
graphic designer for Florida State University Communications in 2004. From 1998-2002 Ms.
Betts worked as an art therapist with children and adolescents with multiple disabilities in
Washington, DC.
The book Creative Arts Therapies Approaches in Adoption and Foster Care:
Contemporary Strategies for Working With Individuals and Families, was conceived and edited
by Ms. Betts and published by Charles C Thomas in 2003. In addition to this 16-chapter volume,
Ms. Betts has published articles and has presented at various conferences and schools.
Ms. Betts served on the Board of Directors of the American Art Therapy Association
(AATA) from 2002-2004, and founded the Research Committee and Governmental Affairs
Committee websites for the AATA (www.arttherapy.org) in 2001. She has served as the
Recording Secretary for the National Coalition of Creative Arts Therapies Associations
(NCCATA) since 1999, and as their web administrator since 2001 (www.nccata.org). In 2000,
Ms. Betts received the Chapter Distinguished Service Award from the Potomac Art Therapy
Association in Washington, DC. She promotes art therapy on her website, www.art-therapy.us.
142

A Systematic Analysis of Art Therapy Assessment

Uploaded by

Copyright:

Available Formats

A Systematic Analysis of Art Therapy Assessment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Systematic Analysis of Art Therapy Assessment

Uploaded by

Copyright:

Available Formats

Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

A Systematic Analysis of Art Therapy

SCHOOL OF VISUAL ARTS AND DANCE

A SYSTEMATIC ANALYSIS OF ART THERAPY ASSESSMENT

AND RATING INSTRUMENT LITERATURE

A Dissertation submitted to the

Susan Carol Losh

Marcia L. Rosal, Chair, Art Education Department

Sally McRorie, Dean, School of Visual Arts and Dance

Deep appreciation is extended to my doctoral committee members, Dr. David Gussak,

Historical Foundations and Development of Art Therapy Assessment Instruments 15

Approach for Conducting the Systematic Analysis 43

BIOGRAPHICAL SKETCH 142

1. Studies Reporting Percent Agreement (Hacking 1999) 28

2. Patient Group Categories 53

5. Kappa Effect Size 57

6. Percentage Effect Size 57

7. Potential Mediating Variables (Inter-Rater Reliability) 59

8. Concurrent Validity Frequencies 59

9. Concurrent Validity Effect Sizes 60

10. Potential Mediating Variables (Concurrent Validity) 61

11. Author/Coder Favor Tally 67

12. Author/Coder Favor Frequencies 68

13. Sampling Flaws 68

14. Rating Systems Needing Revision 71

1. Citation Date Histogram 51

3. Art Therapy Assessment/Rating Instrument Type 52

4. Author-Identified Study Weaknesses 65

Table 1. Studies Reporting Percent Agreement (Hacking 1999)

Approach for Conducting the Systematic Analysis

Figure 1. Citation Date Histogram

Figure 2. Citation Type

Art Therapy Assessment Type Frequency Total

Figure 3. Art Therapy Assessment/Rating Instrument Type

Table 2. Patient Group Categories

Major Mental Illnesses Batza 1995, Brudenell 1989, Coffey 1997, 27

Table 3. Number of Raters

Number of Raters Citations Frequency Total

1 Coffey 1997, Cohen 1988, Gussak 2004, 6

Table 4. Inter-Rater Reliability

Citation Kappa Percentage Correlation*

Bergland 1993 0.91

* Intra-class correlation, Pearsons r, Rho, or Alpha

Table 5. Kappa Effect Size

Citation Kappa N (Study Kappa*N

GLOBAL EFFECT SIZE 0.63985714 0.625945

Table 6. Percentage Effect Size

Citation Percentage N (Study Percentage*N

Table 7. Potential Mediating Variables (Inter-Rater Reliability)

Q-Between Probability Q-Within Compared to Probability

* Significant at the <.001 level

Citations Frequency Total

Table 9. Concurrent Validity Effect Sizes

Table 10. Potential Mediating Variables (Concurrent Validity)

Q-Between Probability Q-Within Compared to Probability

* Significant at the <.001 level

It is the consensus of most mental health professionals, agency administrators, and

0= conclusion that the studys findings did not

Author Favor Frequency Total Coder Favor Frequency Total

Reviewer/Coder: _____________________________________ Coder ID: