Toward A Common Procedure Using Likert and Likert-Type Scales in Small Groups Comparative Design Observations
Toward A Common Procedure Using Likert and Likert-Type Scales in Small Groups Comparative Design Observations
Toward A Common Procedure Using Likert and Likert-Type Scales in Small Groups Comparative Design Observations
1. Introduction
Creating, introducing, and assessing new computer supports is a common research subject in the design
research community [Blessing 1994], [Maher et al. 1998], [Hinds and Kiesler 2002], [Tang et al. 2011].
A widespread procedure is based on comparative methods: "the selection and analysis of cases that are
similar in known ways and differ in other ways, with a view to formulating or testing hypothesis" [Jupp
2006, p.33]. Buisine et al. label this procedure as paradigm evaluation [Buisine et al. 2012], which
consists in comparing independent groups (very often small groups [Bale 1950]) realizing the same
activity on a computer support and on a given control condition (e.g. pen-and-paper). Different factors
are measured in paradigm evaluation (e.g. concept generation or creativity [Gidel et al. 2011], distal or
located work [Tang et al. 2011], collaboration [Buisine et al. 2012]). However, a common investigation
concerns users’ opinions toward the perceived quality of new computer supports (e.g perceived
usability, perceived effectiveness, perceived efficiency) [Blessing 1994], [Guerra et al. 2015].
Frequently, users’ opinions are measured through questionnaires based on Likert and Likert-type items,
forming rating scales. The resulting data are analyzed in order to find any difference whether positive
or negative between the two conditions. Hence, a common but incorrect practice is to categorize results
as "good" or "bad" according to their statistical significance (i.e. p-value). The statistical analysis of this
data, and so, the way statistical significance is calculated, depends on the rating scale type employed.
However, confusion persists about the correct classification of rating scales, whether they are Likert,
Likert-type scales or Discrete Visual Analog Scales (DVAS). This may seems a trivial problem, but this
incertitude hinders the comparison of existing research, causing a lack of scientific rigor and lowering
the impact of the results for practical use. Section 2 clarifies this terminological debate with a deductive
approach: from Likert scales to general rating scales.
How to use Likert and Likert-type scales is still an open question [Jamieson 2004], [Norman 2010], in
particular do we need to statistically threat them as ordinal or interval scales? Section 3 will presents the
different points of view on the matter with a numerical example.
Section 4 propose a pragmatic solution and goes further, delving deeper into the issue if is correct to use
or not p-value as a divide between "good" and "bad" results. Finally, section 5 proposes guidelines to
be applied when using Likert and Likert-type scales in comparative design observations of small groups.
This proposition, open to be debated, aim to provide a pragmatic common procedure that increase
scientific rigor and comparability of design research.
Figure 1. Hierarchical relation between Rating scales, Discrete Visual Analog scales,
Likert-type scales, and Likert scales
Figure 4. Example of Visual analog scales (left) and Discrete visual analog scale (right)
Table 1. Nonparametric analog to Parametric Statistical Methods for group comparison [Kuzon
et al. 1996], [Blessing and Chakrabarti 2009]
Type of Problem Type data Nonparametric Parametric Methods
Methods
Comparison of groups One group (compared to Chi-squared, z-test, t-test
a reference value) Kolmogorov-Smirnov
Two independent groups Mann-Whitney, Chi- t-test for independent
squared, Wilcoxon,’s groups, z-test, ANOVA,
signed rank, Wald- MANOVA
Wolfowitz, Median test,
Kruskall-Wallis,
Kolmogorov-Smirnov
two-sample
Two paired of related Wilcoxon rank sum test, Paired T-test, z-test
groups sign test, McNemar’s test
Three or more groups Friedman’s two-way ANOVA with
ANOVA by ranks, replication, z-test
Cochran Q test, Kruskall-
Wallis
The main difference between a non-parametric and a parametric approach consists in the assumption of
a normal distribution of the sample population. Knapp [1990] pictures quite well the different position
about "conservative" and "liberal" positions about the analysis of Likert scales. Moreover, several
insightful discussion and articles are available online on the matter through a simple search.
Table 2 summarizes the statistical analysis as if Likert-type items are linear (i.e. parametric approach).
Table 3. summarizes the statistical analysis as if Likert-type items are ordinal (i.e. non-parametric)
If any value item set has hypothetically a normal distribution, median, mean and mode would be equal.
This is not the case, which approach is so preferable, supposing that even ordinal data for N>5 [Norman
2010] can be treated as interval data (i.e. do not require assumptions of normality)?
4. Conclusion
According to Adams et al. [1965, p.100]: "Nothing is wrong per se in applying any statistical operation
to measurements of given scale, but what may be wrong, depending on what is said about the results of
these applications, is that the statement about them will not be empirically meaningful or else that it is
not scientifically significant." Klapp [1990] asks for a truce granting the liberty of choice according to
the position of Adams et al. [1965]. Hence, if both approaches are acceptable, what can we do to improve
the rigor of design research? A possibility is to perform both approach and confront results, but it is a
very time demanding solution.
A pragmatic suggestion is to follow the proposition of Boone, Jr. and Boone [2012] and to use
parametric approach with Likert and Likert-type scales (i.e. every time you have to aggregate a
consistent number of Likert, Likert-type, and VAS) and non-paramatric with Likert and Likert-type
items (and DVAS). This is similar to what suggested by Carifio and Perla [2008], that in turn precise
the proposal of Clason and Dormody [1994]: "The weight of the empirical evidence, therefore, clearly
supports the view and position that Likert scales (collections of Likert items) produce interval data ,
particularly if the scale meets the standard psychometric rule-of-thumb criterion of comprising at least
eight reasonably related items." [Carifio and Perla 2008, p.1150]. I think this is a good compromise.
Furthermore, a real advancement toward scientific rigor would be to improve the awareness of design
researchers in using statistical significance as discriminant between a "good" and a "bad" result. This is
a common but incorrect habit, "the ritual of null hypothesis significance testing " [Cohen 1994].
Johnson’s « The insignificance of Statistical Significance Testing » [Johnson 1999] well explain the
problem related to a incorrect understanding of the use of statistical significance. Anscombe [1956]
observes that statistical hypothesis tests are totally irrelevant, and that what is needed are estimates of
magnitudes of effects. "The primary product of a research inquiry is one or more measures of effect
size, not P values." [Cohen 1990] via [Sullivan and Feinn 2012] "Statistical significance is the least
interesting thing about the results. You should describe the results in terms of measures of magnitude –
not just, does a factor affect people, but how much does it affect them. " [Kline 2004] via [Sullvan and
Feinn 2012] The effect size is the magnitude of difference between groups. It is the main finding of a
quantitative study. It can be calculated both for parametric (e.g. Cohen’s d) and non-parametric
approaches (Cliff’s delta).
Related to effect size is the sample size. According to Strasak et al. [2007]: "It is crucial for high quality
statistical work, to consider a-priori effect and sample size estimation and to appropriately conduct a
statistical power calculation in the planning stage, to make sure that a study is provided with sufficient
statistical power to detect treatment effects under observation." According to Sullivan and Feinn [2012]:
"An estimate of the effect size is often needed before starting the research endeavor, in order to calculate
the number of subjects likely to be required to avoid a Type II, or β, error, which is the probability of
concluding there is no effect when one actually exists. In other words, you must determine what number
Estimate Effect size risk difference, risk ratio, odds ratio, Cohen’s d, Glass’s Cliff’s delta
delta, Hedges’ g, the probability of superiority (a good
guide to choose is:
http://www.psychometrica.de/effect_size.html)
Estimate Statistical Equal to 1-β. Use Cohen’s power tables. Monte Carlo simulations
Power A common belief is that minimum acceptable value is 0.8
(80%); with small group you will rarely reach this level
of statistical power, even for very strong effects
(Cohens’d = 1.3) and high alpha (α=0.05 or even α=0.1).
References
Adams, E., Fagot, R. F., Robinson, R. E., "A theory of appropriate statistics", Psychometrika, Vol.30, No.2, 1965,
pp. 99-127.
Anscombe, F. J., "Discussion on Dr. David's and Dr. Johnson's Paper", J. R. Stat. Soc., Vol.18, 1965, pp. 24-27.
Berka, K., "Measurement: Its Concepts, Theories and Problems", Springer Science & Business Media, 2012.
Biddix, P., "Uncomplicated Reviews of Educational Research Methods", Available at:
<https://researchrundowns.wordpress.com/quantitative-methods/effect-size/>, 2009, [Accessed 12.01.2015].
Blessing, L. T. M., Chakrabarti, A., "DRM: Design Research Methodology", Springer-Verlag London, 2009.
Blessing, L. T. M., "A process-based approach to computer-supported engineering design", Univ. of Twente, 1994.
Boone Jr, H. N., Boone, D. A., "Analyzing Likert Data", Journal of Extension, Vol.50, 2012.
Buisine, S., Besacier, G., Aoussat, A., Vernier, F., "How do interactive tabletop systems influence collaboration?",
Computers in Human Behavior, Vol.28, No.1, 2012, pp. 49–59.
Carifio, J., Perla, R., "Resolving the 50-Year Debate around Using and Misusing Likert Scales", Medical
Education, Vol.42, No.12, 2008, pp. 1150–1152.
Clason, D. L., Dormody, T. J., "Analyzing Data Measured By Individual Likert-Type Items", Journal of
Agricultural Education, Vol.35, 1994, pp. 31–35.
Cohen, J., "The Earth Is Round (p < 0.5)", American Psychologist, Vol.49, No.12, 1994, pp. 997–1003.
Cohen, L., Manion, L., Morrison, K., "Research Methods in Education", 5th ed., Routledge Falmer London, 2000.
Gidel, T., Kendira, A., Jones, A., Lenne, D., Barthès, J.-P., Moulin, C., "Conducting Preliminary Design around
an Interactive Tabletop", DS 68-2: Proceedings of ICED 11, Vol.2, 2011, pp. 366-376 .
Guerra, A. L., Gidel, T., Vezzetti, E., "Digital intermediary objects: the (currently) unique advantage of computer-
supported design tools", DS 80-5: Proceedings of ICED 15, Vol.5 - Part 1, 2015, pp. 265-274.
Guilford, J. P., "Psychometric Methods", Tata McGraw Hill Publishing CO Ltd. New Delhi, 1954.
Hinds, P. J., Kiesler, S. B., "Distributed Work", MIT Press, 2002.
Jamieson, S., "Likert scales: how to (ab)use them", Medical Education, Vol.38, No.12, 2004, pp. 1217–1218.
Johnson, D. H., "The insignificance of statistical significance testing", The Journal of Wildlife Management,
Vol.63, No.3, pp. 763-772.
Bales, R. F., "Interaction process analysis; a method for the study of small groups", Addison-Wesley Oxford
England, 1950.
Jupp, V., "The SAGE Dictionary of Social Research Methods", SAGE, 2006.