Academia.eduAcademia.edu

The Factor Structure of the System Usability Scale

2009, Human Centered Design

Since its introduction in 1986, the 10-item System Usability Scale (SUS) has been assumed to be unidimensional. Factor analysis of two independent SUS data sets reveals that the SUS actually has two factors-Usable (8 items) and Learnable (2 items-specifically, Items 4 and 10). These new scales have reasonable reliability (coefficient alpha of .91 and .70, respectively). They correlate highly with the overall SUS (r = .985 and .784, respectively) and correlate significantly with one another (r = .664), but at a low enough level to use as separate scales. A sensitivity analysis using data from 19 tests had a significant Test by Scale interaction, providing additional evidence of the differential utility of the new scales. Practitioners can continue to use the current SUS as is, but, at no extra cost, can also take advantage of these new scales to extract additional information from their SUS data. The data support the use of "awkward" rather than "cumbersome" in Item 8.

The Factor Structure of the System Usability Scale James R. Lewis1 and Jeff Sauro2 1 IBM Software Group, 8051 Congress Ave, Suite 2227 Boca Raton, FL 33487 2 Oracle, 1 Technology Way, Denver, CO 80237 [email protected], [email protected] Abstract. Since its introduction in 1986, the 10-item System Usability Scale (SUS) has been assumed to be unidimensional. Factor analysis of two independent SUS data sets reveals that the SUS actually has two factors – Usable (8 items) and Learnable (2 items – specifically, Items 4 and 10). These new scales have reasonable reliability (coefficient alpha of .91 and .70, respectively). They correlate highly with the overall SUS (r = .985 and .784, respectively) and correlate significantly with one another (r = .664), but at a low enough level to use as separate scales. A sensitivity analysis using data from 19 tests had a significant Test by Scale interaction, providing additional evidence of the differential utility of the new scales. Practitioners can continue to use the current SUS as is, but, at no extra cost, can also take advantage of these new scales to extract additional information from their SUS data. The data support the use of “awkward” rather than “cumbersome” in Item 8. Keywords: System Usability Scale, SUS, factor analysis, psychometric evaluation, subjective usability measurement, usability, learnability, usable, learnable. 1 Introduction In 1986, John Brooke, then working at DEC, developed the System Usability Scale (SUS) [1]. The standard SUS consists of the following ten items (odd-numbered items worded positively; even-numbered items worded negatively). 1. 2. 3. 4. I think that I would like to use this system frequently. I found the system unnecessarily complex. I thought the system was easy to use. I think that I would need the support of a technical person to be able to use this system. 5. I found the various functions in this system were well integrated. 6. I thought there was too much inconsistency in this system. 7. I would imagine that most people would learn to use this system very quickly. 8. I found the system very cumbersome to use. 9. I felt very confident using the system. 10. I needed to learn a lot of things before I could get going with this system. To use the SUS, present the items to participants as 5-point scales numbered from 1 (anchored with “Strongly disagree”) to 5 (anchored with “Strongly agree”). If a M. Kurosu (Ed.): Human Centered Design, HCII 2009, LNCS 5619, pp. 94–103, 2009. © Springer-Verlag Berlin Heidelberg 2009 The Factor Structure of the System Usability Scale 95 participant fails to respond to an item, assign it a 3 (the center of the rating scale). After completion, determine each item’s score contribution, which will range from 0 to 4. For positively-worded items (1, 3, 5, 7 and 9), the score contribution is the scale position minus 1. For negatively-worded items (2, 4, 6, 8 and 10), it is 5 minus the scale position. To get the overall SUS score, multiply the sum of the item score contributions by 2.5. Thus, SUS scores range from 0 to 100 in 2.5-point increments. The ten SUS items were selected from a pool of 50 potential items, based on the responses of 20 people who used the full set of items to rate two software systems, one of which was relatively easy to use, and the other relatively difficult. The items selected for the SUS were those that provided the strongest discrimination between the systems. In the original paper by Brooke [1], he reported strong correlations among the selected items (absolute values of r ranging from .7 to .9), but he did not report any measures of reliability or validity, referring to the SUS as a quick and dirty usability scale. For these reasons, he cautioned against assuming that the SUS was any more than a unidimensional measure of usability (p. 193): “SUS yields a single number representing a composite measure of the overall usability of the system being studied. Note that scores for individual items are not meaningful on their own.” Given data from only 20 participants, this caution was appropriate. 1.1 Psychometric Qualification of the SUS Despite being a self-described “quick and dirty” usability scale, the SUS has become a popular questionnaire for end-of-test subjective assessments of usability [2]. The SUS accounted for 43% of post-test questionnaire usage in a recent study of a collection of unpublished usability studies [3]. Research conducted on the SUS has shown that although it is fairly quick, it is probably not all that dirty. The typical minimum reliability goal for questionnaires used in research and evaluation is .70 [4, 5]. An early assessment of the reliability of the SUS based on 77 cases indicated a value of .85 for coefficient alpha (a measure of internal consistency often used to estimate reliability of multi-item scales) [6, 7]. More recently, Bangor, Kortum, and Miller [8], in a study of 2324 cases, found the coefficient alpha of the SUS to be .91. Bangor et al. also provided some evidence of the validity of the SUS, both in the form of sensitivity (detecting significant differences among types of interfaces and as a function of changes made to a product) and concurrent validity (a significant correlation of .806 between the SUS and a single 7-point adjective rating question for an overall rating of “user friendliness”). Although not directly measuring reliability, Tullis and Stetson [9] provided additional evidence of the reliability of the SUS. They conducted a study with 123 participants in which the participants used one of five standard usability questionnaires to rate the usability of two websites. With the entire sample size, all five questionnaires indicated superior usability for the same website. Because no practical usability test would have such a large number of participants, they conducted a Monte Carlo simulation to see, as the sample size increased from 6 to 14, which of the questionnaires would converge most quickly to the “correct” conclusion regarding the difference between the websites’ usability, where “correct” meant a significant t-test consistent with the decision reached using the total sample size. They found that two of the questionnaires, the SUS and the CSUQ [10, 11] met this goal the most quickly, making the correct decision 96 J.R. Lewis and J. Sauro over 90% of the time when n ≥ 12. This result is implicit evidence of reliability, and also suggests that comparative within-subject summative usability studies using the SUS should have sample sizes of at least 12 participants. 1.2 The Assumption of SUS Unidimensionality As previously mentioned, there has been a long-standing assumption that the SUS assesses the single construct of usability. In the most ambitious investigation of the psychometric properties of the SUS to date, Bangor et al. [8] conducted a factor analysis of their 2324 SUS questionnaires and concluded, on the basis of examining the eigenvalues and factor loadings for a one-factor solution, that there was only one significant factor, consistent with prevailing practitioner belief and practice. The problem with this conclusion is that Bangor et al. [8] did not report the possibility of a multifactor solution, especially, the possibility of a two-factor solution. The mechanics of factor analysis virtually guarantee high loadings for all items on the first unrotated factor, so although their finding supports the use of an overall SUS measure, it does not exclude the possibility of additional structure. Examination of the scree plot (see their Figure 5) shows the expected very high value for the first eigenvalue, but also a fairly high value for the second eigenvalue – a value just under 1.0. There is a rule-of-thumb used by some practitioners and computer programs to set the appropriate number of factors to the number of eigenvalues greater than 1, but this rule-of-thumb has been discredited because it is often the case that the appropriate number of factors exceeds the number of eigenvalues greater than 1 [12, 13]. 1.3 Goals of the Current Study The primary purpose of the current study was to conduct factor analyses to explore the factor structure of the SUS, using data published by Bangor et al. [8] and an independent set of data we collected as part of a larger data collection and analysis program [3] that included 324 complete SUS questionnaires. Secondary goals were to use the new data to assess the reliability and, to as great an extent as possible, the validity of the SUS. 2 Factor Analysis of the SUS At the time of this study, we had collected 324 completed SUS questionnaires from the usability data for 19 usability studies, which was an adequate number for investigating the factor structure of the SUS [5]. Fortunately, Bangor et al. [8] published the correlation matrix of the SUS items from their studies (see their Table 5). It is possible to use an item correlation matrix as the input for a factor analysis, which meant that data were available for two independent sets of solutions – one using the Bangor et al. correlation matrix, and another using the 324 cases from Sauro and Lewis [3]. Having two independent data sources for a factor analysis of the SUS afforded a unique method for assessing the factor structure. It takes at least two items to form a scale, which makes it very unlikely that the 10-item SUS would have a structure with more than four factors. Table 1 shows side-by-side solutions for both sets of data for four, three, and two factors. Our strategy was to start with the four-factor solution The Factor Structure of the System Usability Scale 97 (using common factor analysis with varimax rotation), then work our way down until we obtained similar item-to-factor loadings for both data sets. The failure of this approach would be evidence in favor of the unidimensionality of the SUS. As Table 1 shows, however, the results converged for the two-factor solution. Indeed, given the differences in the distributions and the differences in the four- and threefactor solutions, the extent of convergence at the two-factor solution was striking, with the solutions accounting for 56-58% of the total variance. For both two-factor solutions, Items 1, 2, 3, 5, 6, 7, 8, and 9 aligned with the first factor, and Items 4 and 10 aligned with the second factor. Given 8 items in common between the Overall SUS and the first factor, we named the first new scale Usable. Based on the content of Items 4 and 10 (“I think I would need the support of a technical person to be able to use this system” and “I needed to learn a lot of things before I could get going with this system”), we named the second new scale Learnable. It was surprising that Item 7 (“I would imagine that most people would learn to use this system very quickly”) did not also align with this factor, but its non-alignment was consistent for both data sets, possibly due to its focus on considering the skills of others rather than the rater’s own skills. 3 Additional Psychometric Analyses 3.1 Item Weighting Rather than weighting each scale item the same (unit weighting), it can be tempting to use the factor loadings to weight items differentially. Such a practice is, however, rarely worth the effort and increased complexity of measurement. Nunnally [5] pointed out that such weighting schemes usually produce a measurement that is highly correlated with the unweighted measurement, so there is no statistical advantage to the weighting. That was the case with these new Usable and Learnable scales, which had, respectively, weighted-unweighted correlations of .993 and .997 (both p < .0001), supporting the use of unit weighting for these scales. 3.2 Scale Correlations The correlations between the new scales and the Overall SUS were .985 for Usable and .784 for Learnable (both p < .0001). Because each of the new scales had items in common with the Overall SUS, this is an expectedly high level of correlation. The correlation between Usable and Learnable was .664 (p < .0001). They were not completely independent, but neither were they completely dependent, with shared variance (R2) of about 44%. Consistent with the interpretation of the factor analyses, this finding supports both the use of an Overall SUS score and the decomposition of that score into Usable and Learnable components. 3.3 Reliability For our 324 cases, coefficient alpha for Overall SUS was .92, a finding consistent with the value of .91 reported by Bangor et al. [8]. Coefficient alphas for Usable and Learnable were, respectively, .91 and .70. Even though only two items contributed to Learnable, the scale had sufficient reliability to meet the typical minimum standard of .70 for this type of measurement [4, 5]. 98 J.R. Lewis and J. Sauro Table 1. Four-, three-, and two-factor solutions for the two independent data sets Bangor et al. Item 1 Q1 0.64 Q2 0.38 Q3 0.66 Q4 0.22 Q5 0.61 Q6 0.37 Q7 0.59 Q8 0.41 Q9 0.61 Q10 0.25 2 0.19 0.30 0.42 0.67 0.20 0.32 0.33 0.35 0.52 0.66 3 0.31 0.53 0.31 0.22 0.38 0.58 0.30 0.52 0.20 0.25 3 0.33 0.49 0.33 0.23 0.40 0.59 0.31 0.54 0.20 0.26 Item Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 1 0.63 0.41 0.66 0.22 0.60 0.35 0.58 0.40 0.62 0.25 2 0.19 0.32 0.42 0.67 0.19 0.31 0.33 0.35 0.52 0.67 Item Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 1 0.70 0.59 0.71 0.27 0.71 0.58 0.64 0.60 0.60 0.31 2 0.22 0.38 0.45 0.69 0.23 0.39 0.36 0.41 0.52 0.69 Var % Var 3.46 34.63 2.12 21.18 Total 55.81 4 0.04 0.25 0.22 0.03 0.00 -0.04 -0.01 0.03 0.10 0.05 Current Item Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 1 0.65 0.59 0.50 0.25 0.32 0.46 0.49 0.67 0.46 0.18 2 0.17 0.43 0.39 0.64 0.16 0.36 0.28 0.34 0.45 0.68 3 0.19 0.20 0.18 0.07 0.18 0.16 0.58 0.22 0.14 0.45 Item Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 1 0.69 0.60 0.54 0.27 0.33 0.47 0.52 0.69 0.50 0.24 2 0.20 0.46 0.43 0.58 0.20 0.38 0.45 0.38 0.46 0.78 3 0.26 0.23 0.43 0.12 0.71 0.33 0.35 0.35 0.40 0.24 Item Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 1 0.71 0.62 0.69 0.28 0.60 0.58 0.62 0.77 0.64 0.32 2 0.21 0.46 0.43 0.58 0.26 0.39 0.46 0.38 0.47 0.79 Var % Var 3.61 36.07 2.20 21.95 Total 58.01 4 0.29 0.25 0.47 0.14 0.64 0.35 0.31 0.37 0.47 0.24 The Factor Structure of the System Usability Scale 99 3.4 Sensitivity To assess scale sensitivity, we conducted an ANOVA with Test as an independent variable with 19 levels (for the 19 tests from which the SUS scores came) and Scale as a dependent variable with 2 levels (Usable and Learnable). To make the Usable and Learnable scores comparable with the Overall SUS score (ranging from 0 to 100), we multiplied their summed score contributions by 3.125 and 12.5, respectively. The resulting scale score for Usable ranged from 0 to 100 in 32 increments of 3.125, and for Learnable ranged from 0 to 100 in eight increments of 12.5. The ANOVA had a significant main effect of Test (F(18, 305) = 7.73, p < .0001), a significant main effect of Scale (F(1, 305) = 47.6, p < .0001), and a significant Test by Scale interaction (F(18, 305) = 3.81, p < .0001). In particular, the significant Test by Scale interaction provided evidence of the sensitivity of the Scale variable. If there had been no interaction, then this would have been evidence that Usable and Learnable were contributing the same information to the analysis. As expected from the factor and correlation analyses, however, the results confirmed the differential information provided by the two scales, as shown in Figure 1 (with the tests ordered by decreasing value of Usable). As expected due to the moderate correlation between Usable and Learnable, when the value of Usable declined, the value of Learnable also tended to decline, but with a different pattern. In most of the studies (except for three cases), the value of Learnable tended to be greater than the value of Usable, but to varying degrees as a function of Test. 100.0 90.0 80.0 Mean Scale Score 70.0 60.0 Usable 50.0 Learnable 40.0 30.0 20.0 10.0 0.0 A B C D E F G H I J K L M N O P Q R S Study Fig. 1. The Test by Scale interaction 4 The Distribution of SUS Scores Bangor et al. [8] provided some information about the distribution of SUS scores in their data. Table 2 shows basic statistical information about their distribution and the 100 J.R. Lewis and J. Sauro distribution of our new data. Figure 2 shows a graph of the distribution of the Overall SUS scores from the current data set (for comparison with Figure 2 of Bangor et al.), and the distributions of the Usable and Learnable scores (all set to the same scale). Of particular interest is that the central tendencies of the Bangor et al. (2008) and our Overall SUS distributions were not identical, with a mean difference of 8.0. The mean of the Bangor et al. distribution was 70.1, with a 99.9% confidence interval ranging from 68.7 to 71.5 [8]. The mean of our Overall SUS data was 62.1, with a 99.9% confidence interval ranging from 58.3 to 65.9. Because the confidence intervals did not overlap, this difference in central tendency as measured by the mean was statistically significant (p < .001). There were similar differences (with the Bangor et al. scores higher) for the 1st quartile (10 points), median (10 points), and 3rd quartile (12.5 points). The distributions’ measures of dispersion (variance, standard deviation, and interquartile range) were close in value. As expected, the statistics and distributions of the Overall SUS and Usable scores from the current data set were very similar. In contrast, the distributions of the Usable and Learnable scores were distinct. The distribution of Usable, although somewhat skewed, had lower values at the tails than in the center. By contrast, Learnable was strongly skewed to the right, with 29% of its scores having the maximum value of 100. Consistent with the results of the ANOVA, their 99.9% confidence intervals did not overlap, indicating a statistically significant difference (p < .001). Table 2. Basic statistical information about the SUS distributions Statistic N Minimum Maximum Mean Variance Standard Deviation Standard Error Skewness 1st Quartile Median 3rd Quartile Interquartile Range Critical Z (99.9) Critical d (99.9) 99.9% CI Upper Limit 99.9% CI Lower Limit Bangor et al. Overall 2324 0.0 100.0 70.14 471.32 21.71 0.45 NA 55.0 75.0 87.5 32.5 3.09 1.39 71.53 68.75 Overall 324 7.5 100.0 62.10 494.38 22.24 1.24 -0.43 45.0 65.0 75.0 30.0 3.09 3.82 65.92 58.28 Current Data Set Usable Learnable 324 324 0.0 0.0 100.0 100.0 59.4 72.7 531.54 674.47 23.06 25.97 1.28 1.44 -0.38 -0.80 40.6 50.0 62.5 75.0 78.1 100.0 37.5 50.0 3.09 3.09 3.96 4.46 63.40 77.18 55.48 68.27 Table note: Add and subtract Critical d (computed by multiplying the Critical Z and the standard error) from the mean to get the upper and lower bounds of the 99.9% confidence interval. 100 90 80 70 60 50 40 30 20 10 0 101 Overall SUS 5. 0 10 .0 15 .0 20 .0 25 .0 30 .0 35 .0 40 .0 45 .0 50 .0 55 .0 60 .0 65 .0 70 .0 75 .0 80 .0 85 .0 90 .0 95 .0 10 0. 0 0. 0 Number of Scores The Factor Structure of the System Usability Scale 100 90 80 70 60 50 40 30 20 10 0 Usable 10 0. 0 93 .8 87 .5 81 .3 75 .0 68 .8 62 .5 56 .3 50 .0 43 .8 37 .5 31 .3 25 .0 18 .8 12 .5 Learnable 6. 2 0. 0 Number of Scores Score Score Fig. 2. Distributions of the Overall SUS, Usable, and Learnable scores from the current data set 5 Discussion 5.1 Benefit of an Improved Understanding of the Factor Structure of the SUS – A Cleaner and Possibly Quicker Usability Scale In the 23 years since the introduction of the SUS, it has certainly stood the test of time. The results of the current research show that it would be possible to use the new Usable scale in place of the Overall SUS. The scales had an extremely high correlation (.985), and the reduction in reliability in moving from the 10-item Overall SUS to the 8-item Usable scale was negligible (.92 to .91). The time saved by dropping Items 4 and 10, however, would be of relatively little benefit compared to the advantage of getting an estimate of perceived learnability along with a cleaner estimate of perceived usability. For this reason, we encourage practitioners who use the SUS to continue doing so, but to recognize that in addition to working with the standard Overall SUS score, they can easily decompose the Overall SUS score into its Usable and Learnable components, extracting additional information from their SUS data with very little additional effort. The difference in central tendency between the Bangor et al. [8] data and our data indicate that the two datasets may represent different types of users and products. For preliminary data on an attempt to connect SUS ratings to a 7-point adjective scale (Best Imaginable to Worst Imaginable), see Bangor et al. (pp. 586-588). 5.2 Implications for SUS Item Wording Psychometric findings for one version of a questionnaire do not necessarily generalize to other versions. Research on the SUS and similar questionnaires has shown, however, 102 J.R. Lewis and J. Sauro that slight changes to item wording most often lead to no detectable differences in factor structure or reliability [10]. For example, in a study of the interpretation of the SUS by non-native English speakers, Finstad [14] found that in Item 8 (“I found the system very cumbersome to use”), all native English speakers claimed to understand the term, but half of the nonEnglish speakers asked for clarification. When told that “cumbersome” meant “awkward”, the non-English speakers indicated that this was sufficient clarification. Bangor et al. [8] also reported some confusion (about 10% of participants) with the word “cumbersome”, and replaced it with “awkward” early in their use of the SUS. They also replaced the word “system” with “product” in all items. Consequently, about 90% of their 2324 cases used the modified version of the SUS. Our 324 cases, however, used the original SUS item wording for Item 8, and used either the word “system” or the actual product name in place of “system”. Despite these differences in item wording, estimates of reliability and the two-factor solutions for the two data sets were almost identical, which leads to the following two guidelines for practitioners. • • For Item 8, use “awkward” rather than “cumbersome”. Use either “system” or “product” or the actual product name, depending on which seems more appropriate for a given test, but for consistency of presentation, use the same term in all items for any given test or across a related series of tests. References 1. Brooke, J.: SUS: A “Quick and Dirty” Usability Scale. In: Jordan, P.W., Thomas, B., Weerdmeester, B.A., McClelland (eds.) Usability Evaluation in Industry, pp. 189–194. Taylor & Francis, London (1996) 2. Lewis, J.R.: Usability Testing. In: Salvendy, G. (ed.) Handbook of Human Factors and Ergonomics, pp. 1275–1316. John Wiley, New York (2006) 3. Sauro, J., Lewis, J.R.: Correlations among Prototypical Usability Metrics: Evidence for the Construct of Usability. In: The Proceedings of CHI 2009 (to appear, 2009) 4. Landauer, T.K.: Behavioral Research Methods in Human-Computer Interaction. In: Helander, M., Landauer, T., Prabhu, P. (eds.) Handbook of Human-Computer Interaction, pp. 203–227. Elsevier, Amsterdam (1997) 5. Nunnally, J.C.: Psychometric Theory. McGraw-Hill, New York (1978) 6. Lucey, N.M.: More than Meets the I: User-Satisfaction of Computer Systems. Unpublished thesis for Diploma in Applied Psychology, University College Cork, Cork, Ireland (1991) 7. Kirakowski, J.: The Use of Questionnaire Methods for Usability Assessment (1994), http://sumi.ucc.ie/sumipapp.html 8. Bangor, A., Kortum, P.T., Miller, J.T.: An Empirical Evaluation of the System Usability Scale. International Journal of Human-Computer Interaction 24, 574–594 (2008) 9. Tullis, T.S., Stetson, J.N.: A Comparison of Questionnaires for Assessing Website Usability. Unpublished presentation given at the UPA Annual Conference (2004), http://home.comcast.net/~tomtullis/publications/ UPA2004TullisStetson.pdf The Factor Structure of the System Usability Scale 103 10. Lewis, J.R.: IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use. International Journal of Human-Computer Interaction 7, 57– 78 (1995) 11. Lewis, J.R.: Psychometric Evaluation of the PSSUQ Using Data from Five Years of Usability Studies. International Journal of Human-Computer Interaction 14, 463–488 (2002) 12. Cliff, N.: Analyzing Multivariate Data. Harcourt Brace Jovanovich, San Diego (1987) 13. Coovert, M.D., McNelis, K.: Determining the Number of Common Factors in Factor Analysis: A Review and Program. Educational and Psychological Measurement 48, 687– 693 (1988) 14. Finstad, K.: The System Usability Scale and Non-Native English Speakers. Journal of Usability Studies 1, 185–188 (2006)