The Factor Structure of the System Usability Scale
James R. Lewis1 and Jeff Sauro2
1
IBM Software Group, 8051 Congress Ave, Suite 2227
Boca Raton, FL 33487
2
Oracle, 1 Technology Way, Denver, CO 80237
[email protected],
[email protected]
Abstract. Since its introduction in 1986, the 10-item System Usability Scale
(SUS) has been assumed to be unidimensional. Factor analysis of two independent SUS data sets reveals that the SUS actually has two factors – Usable (8
items) and Learnable (2 items – specifically, Items 4 and 10). These new scales
have reasonable reliability (coefficient alpha of .91 and .70, respectively). They
correlate highly with the overall SUS (r = .985 and .784, respectively) and correlate significantly with one another (r = .664), but at a low enough level to use
as separate scales. A sensitivity analysis using data from 19 tests had a significant Test by Scale interaction, providing additional evidence of the differential
utility of the new scales. Practitioners can continue to use the current SUS as is,
but, at no extra cost, can also take advantage of these new scales to extract additional information from their SUS data. The data support the use of “awkward”
rather than “cumbersome” in Item 8.
Keywords: System Usability Scale, SUS, factor analysis, psychometric evaluation, subjective usability measurement, usability, learnability, usable, learnable.
1 Introduction
In 1986, John Brooke, then working at DEC, developed the System Usability Scale
(SUS) [1]. The standard SUS consists of the following ten items (odd-numbered
items worded positively; even-numbered items worded negatively).
1.
2.
3.
4.
I think that I would like to use this system frequently.
I found the system unnecessarily complex.
I thought the system was easy to use.
I think that I would need the support of a technical person to be able to use this
system.
5. I found the various functions in this system were well integrated.
6. I thought there was too much inconsistency in this system.
7. I would imagine that most people would learn to use this system very quickly.
8. I found the system very cumbersome to use.
9. I felt very confident using the system.
10. I needed to learn a lot of things before I could get going with this system.
To use the SUS, present the items to participants as 5-point scales numbered from
1 (anchored with “Strongly disagree”) to 5 (anchored with “Strongly agree”). If a
M. Kurosu (Ed.): Human Centered Design, HCII 2009, LNCS 5619, pp. 94–103, 2009.
© Springer-Verlag Berlin Heidelberg 2009
The Factor Structure of the System Usability Scale
95
participant fails to respond to an item, assign it a 3 (the center of the rating scale).
After completion, determine each item’s score contribution, which will range from 0
to 4. For positively-worded items (1, 3, 5, 7 and 9), the score contribution is the scale
position minus 1. For negatively-worded items (2, 4, 6, 8 and 10), it is 5 minus the
scale position. To get the overall SUS score, multiply the sum of the item score contributions by 2.5. Thus, SUS scores range from 0 to 100 in 2.5-point increments.
The ten SUS items were selected from a pool of 50 potential items, based on the
responses of 20 people who used the full set of items to rate two software systems,
one of which was relatively easy to use, and the other relatively difficult. The items
selected for the SUS were those that provided the strongest discrimination between
the systems. In the original paper by Brooke [1], he reported strong correlations
among the selected items (absolute values of r ranging from .7 to .9), but he did not
report any measures of reliability or validity, referring to the SUS as a quick and dirty
usability scale. For these reasons, he cautioned against assuming that the SUS was
any more than a unidimensional measure of usability (p. 193): “SUS yields a single
number representing a composite measure of the overall usability of the system being
studied. Note that scores for individual items are not meaningful on their own.”
Given data from only 20 participants, this caution was appropriate.
1.1 Psychometric Qualification of the SUS
Despite being a self-described “quick and dirty” usability scale, the SUS has become
a popular questionnaire for end-of-test subjective assessments of usability [2]. The
SUS accounted for 43% of post-test questionnaire usage in a recent study of a collection of unpublished usability studies [3]. Research conducted on the SUS has shown
that although it is fairly quick, it is probably not all that dirty. The typical minimum
reliability goal for questionnaires used in research and evaluation is .70 [4, 5]. An
early assessment of the reliability of the SUS based on 77 cases indicated a value of
.85 for coefficient alpha (a measure of internal consistency often used to estimate
reliability of multi-item scales) [6, 7]. More recently, Bangor, Kortum, and Miller
[8], in a study of 2324 cases, found the coefficient alpha of the SUS to be .91. Bangor
et al. also provided some evidence of the validity of the SUS, both in the form of
sensitivity (detecting significant differences among types of interfaces and as a function of changes made to a product) and concurrent validity (a significant correlation of
.806 between the SUS and a single 7-point adjective rating question for an overall
rating of “user friendliness”).
Although not directly measuring reliability, Tullis and Stetson [9] provided additional evidence of the reliability of the SUS. They conducted a study with 123 participants in which the participants used one of five standard usability questionnaires to rate
the usability of two websites. With the entire sample size, all five questionnaires indicated superior usability for the same website. Because no practical usability test would
have such a large number of participants, they conducted a Monte Carlo simulation to
see, as the sample size increased from 6 to 14, which of the questionnaires would converge most quickly to the “correct” conclusion regarding the difference between the
websites’ usability, where “correct” meant a significant t-test consistent with the decision reached using the total sample size. They found that two of the questionnaires, the
SUS and the CSUQ [10, 11] met this goal the most quickly, making the correct decision
96
J.R. Lewis and J. Sauro
over 90% of the time when n ≥ 12. This result is implicit evidence of reliability, and
also suggests that comparative within-subject summative usability studies using the
SUS should have sample sizes of at least 12 participants.
1.2 The Assumption of SUS Unidimensionality
As previously mentioned, there has been a long-standing assumption that the SUS
assesses the single construct of usability. In the most ambitious investigation of the
psychometric properties of the SUS to date, Bangor et al. [8] conducted a factor
analysis of their 2324 SUS questionnaires and concluded, on the basis of examining
the eigenvalues and factor loadings for a one-factor solution, that there was only one
significant factor, consistent with prevailing practitioner belief and practice.
The problem with this conclusion is that Bangor et al. [8] did not report the possibility of a multifactor solution, especially, the possibility of a two-factor solution.
The mechanics of factor analysis virtually guarantee high loadings for all items on the
first unrotated factor, so although their finding supports the use of an overall SUS
measure, it does not exclude the possibility of additional structure. Examination of
the scree plot (see their Figure 5) shows the expected very high value for the first
eigenvalue, but also a fairly high value for the second eigenvalue – a value just under
1.0. There is a rule-of-thumb used by some practitioners and computer programs to
set the appropriate number of factors to the number of eigenvalues greater than 1, but
this rule-of-thumb has been discredited because it is often the case that the appropriate
number of factors exceeds the number of eigenvalues greater than 1 [12, 13].
1.3 Goals of the Current Study
The primary purpose of the current study was to conduct factor analyses to explore
the factor structure of the SUS, using data published by Bangor et al. [8] and an independent set of data we collected as part of a larger data collection and analysis program [3] that included 324 complete SUS questionnaires. Secondary goals were to
use the new data to assess the reliability and, to as great an extent as possible, the
validity of the SUS.
2 Factor Analysis of the SUS
At the time of this study, we had collected 324 completed SUS questionnaires from
the usability data for 19 usability studies, which was an adequate number for investigating the factor structure of the SUS [5]. Fortunately, Bangor et al. [8] published the
correlation matrix of the SUS items from their studies (see their Table 5). It is possible to use an item correlation matrix as the input for a factor analysis, which meant
that data were available for two independent sets of solutions – one using the Bangor
et al. correlation matrix, and another using the 324 cases from Sauro and Lewis [3].
Having two independent data sources for a factor analysis of the SUS afforded a
unique method for assessing the factor structure. It takes at least two items to form a
scale, which makes it very unlikely that the 10-item SUS would have a structure with
more than four factors. Table 1 shows side-by-side solutions for both sets of data for
four, three, and two factors. Our strategy was to start with the four-factor solution
The Factor Structure of the System Usability Scale
97
(using common factor analysis with varimax rotation), then work our way down until
we obtained similar item-to-factor loadings for both data sets. The failure of this
approach would be evidence in favor of the unidimensionality of the SUS.
As Table 1 shows, however, the results converged for the two-factor solution. Indeed, given the differences in the distributions and the differences in the four- and threefactor solutions, the extent of convergence at the two-factor solution was striking, with
the solutions accounting for 56-58% of the total variance. For both two-factor solutions,
Items 1, 2, 3, 5, 6, 7, 8, and 9 aligned with the first factor, and Items 4 and 10 aligned
with the second factor. Given 8 items in common between the Overall SUS and the first
factor, we named the first new scale Usable. Based on the content of Items 4 and 10 (“I
think I would need the support of a technical person to be able to use this system” and “I
needed to learn a lot of things before I could get going with this system”), we named the
second new scale Learnable. It was surprising that Item 7 (“I would imagine that most
people would learn to use this system very quickly”) did not also align with this factor,
but its non-alignment was consistent for both data sets, possibly due to its focus on
considering the skills of others rather than the rater’s own skills.
3 Additional Psychometric Analyses
3.1 Item Weighting
Rather than weighting each scale item the same (unit weighting), it can be tempting to
use the factor loadings to weight items differentially. Such a practice is, however, rarely
worth the effort and increased complexity of measurement. Nunnally [5] pointed out
that such weighting schemes usually produce a measurement that is highly correlated
with the unweighted measurement, so there is no statistical advantage to the weighting.
That was the case with these new Usable and Learnable scales, which had, respectively,
weighted-unweighted correlations of .993 and .997 (both p < .0001), supporting the use
of unit weighting for these scales.
3.2 Scale Correlations
The correlations between the new scales and the Overall SUS were .985 for Usable and
.784 for Learnable (both p < .0001). Because each of the new scales had items in common with the Overall SUS, this is an expectedly high level of correlation. The correlation between Usable and Learnable was .664 (p < .0001). They were not completely
independent, but neither were they completely dependent, with shared variance (R2) of
about 44%. Consistent with the interpretation of the factor analyses, this finding supports both the use of an Overall SUS score and the decomposition of that score into
Usable and Learnable components.
3.3 Reliability
For our 324 cases, coefficient alpha for Overall SUS was .92, a finding consistent
with the value of .91 reported by Bangor et al. [8]. Coefficient alphas for Usable and
Learnable were, respectively, .91 and .70. Even though only two items contributed to
Learnable, the scale had sufficient reliability to meet the typical minimum standard of
.70 for this type of measurement [4, 5].
98
J.R. Lewis and J. Sauro
Table 1. Four-, three-, and two-factor solutions for the two independent data sets
Bangor et al.
Item
1
Q1
0.64
Q2
0.38
Q3
0.66
Q4
0.22
Q5
0.61
Q6
0.37
Q7
0.59
Q8
0.41
Q9
0.61
Q10
0.25
2
0.19
0.30
0.42
0.67
0.20
0.32
0.33
0.35
0.52
0.66
3
0.31
0.53
0.31
0.22
0.38
0.58
0.30
0.52
0.20
0.25
3
0.33
0.49
0.33
0.23
0.40
0.59
0.31
0.54
0.20
0.26
Item
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
1
0.63
0.41
0.66
0.22
0.60
0.35
0.58
0.40
0.62
0.25
2
0.19
0.32
0.42
0.67
0.19
0.31
0.33
0.35
0.52
0.67
Item
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
1
0.70
0.59
0.71
0.27
0.71
0.58
0.64
0.60
0.60
0.31
2
0.22
0.38
0.45
0.69
0.23
0.39
0.36
0.41
0.52
0.69
Var
% Var
3.46
34.63
2.12
21.18
Total
55.81
4
0.04
0.25
0.22
0.03
0.00
-0.04
-0.01
0.03
0.10
0.05
Current
Item
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
1
0.65
0.59
0.50
0.25
0.32
0.46
0.49
0.67
0.46
0.18
2
0.17
0.43
0.39
0.64
0.16
0.36
0.28
0.34
0.45
0.68
3
0.19
0.20
0.18
0.07
0.18
0.16
0.58
0.22
0.14
0.45
Item
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
1
0.69
0.60
0.54
0.27
0.33
0.47
0.52
0.69
0.50
0.24
2
0.20
0.46
0.43
0.58
0.20
0.38
0.45
0.38
0.46
0.78
3
0.26
0.23
0.43
0.12
0.71
0.33
0.35
0.35
0.40
0.24
Item
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
1
0.71
0.62
0.69
0.28
0.60
0.58
0.62
0.77
0.64
0.32
2
0.21
0.46
0.43
0.58
0.26
0.39
0.46
0.38
0.47
0.79
Var
% Var
3.61
36.07
2.20
21.95
Total
58.01
4
0.29
0.25
0.47
0.14
0.64
0.35
0.31
0.37
0.47
0.24
The Factor Structure of the System Usability Scale
99
3.4 Sensitivity
To assess scale sensitivity, we conducted an ANOVA with Test as an independent variable with 19 levels (for the 19 tests from which the SUS scores came) and Scale as a
dependent variable with 2 levels (Usable and Learnable). To make the Usable and
Learnable scores comparable with the Overall SUS score (ranging from 0 to 100), we
multiplied their summed score contributions by 3.125 and 12.5, respectively. The resulting scale score for Usable ranged from 0 to 100 in 32 increments of 3.125, and for
Learnable ranged from 0 to 100 in eight increments of 12.5. The ANOVA had a significant main effect of Test (F(18, 305) = 7.73, p < .0001), a significant main effect of
Scale (F(1, 305) = 47.6, p < .0001), and a significant Test by Scale interaction (F(18,
305) = 3.81, p < .0001). In particular, the significant Test by Scale interaction provided
evidence of the sensitivity of the Scale variable. If there had been no interaction, then
this would have been evidence that Usable and Learnable were contributing the same
information to the analysis. As expected from the factor and correlation analyses, however, the results confirmed the differential information provided by the two scales, as
shown in Figure 1 (with the tests ordered by decreasing value of Usable). As expected
due to the moderate correlation between Usable and Learnable, when the value of Usable declined, the value of Learnable also tended to decline, but with a different pattern.
In most of the studies (except for three cases), the value of Learnable tended to be
greater than the value of Usable, but to varying degrees as a function of Test.
100.0
90.0
80.0
Mean Scale Score
70.0
60.0
Usable
50.0
Learnable
40.0
30.0
20.0
10.0
0.0
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
Study
Fig. 1. The Test by Scale interaction
4 The Distribution of SUS Scores
Bangor et al. [8] provided some information about the distribution of SUS scores in
their data. Table 2 shows basic statistical information about their distribution and the
100
J.R. Lewis and J. Sauro
distribution of our new data. Figure 2 shows a graph of the distribution of the Overall
SUS scores from the current data set (for comparison with Figure 2 of Bangor et al.),
and the distributions of the Usable and Learnable scores (all set to the same scale).
Of particular interest is that the central tendencies of the Bangor et al. (2008) and
our Overall SUS distributions were not identical, with a mean difference of 8.0. The
mean of the Bangor et al. distribution was 70.1, with a 99.9% confidence interval
ranging from 68.7 to 71.5 [8]. The mean of our Overall SUS data was 62.1, with a
99.9% confidence interval ranging from 58.3 to 65.9. Because the confidence intervals did not overlap, this difference in central tendency as measured by the mean was
statistically significant (p < .001). There were similar differences (with the Bangor et
al. scores higher) for the 1st quartile (10 points), median (10 points), and 3rd quartile
(12.5 points). The distributions’ measures of dispersion (variance, standard deviation,
and interquartile range) were close in value.
As expected, the statistics and distributions of the Overall SUS and Usable scores
from the current data set were very similar. In contrast, the distributions of the Usable
and Learnable scores were distinct. The distribution of Usable, although somewhat
skewed, had lower values at the tails than in the center. By contrast, Learnable was
strongly skewed to the right, with 29% of its scores having the maximum value of
100. Consistent with the results of the ANOVA, their 99.9% confidence intervals did
not overlap, indicating a statistically significant difference (p < .001).
Table 2. Basic statistical information about the SUS distributions
Statistic
N
Minimum
Maximum
Mean
Variance
Standard Deviation
Standard Error
Skewness
1st Quartile
Median
3rd Quartile
Interquartile Range
Critical Z (99.9)
Critical d (99.9)
99.9% CI Upper Limit
99.9% CI Lower Limit
Bangor et al.
Overall
2324
0.0
100.0
70.14
471.32
21.71
0.45
NA
55.0
75.0
87.5
32.5
3.09
1.39
71.53
68.75
Overall
324
7.5
100.0
62.10
494.38
22.24
1.24
-0.43
45.0
65.0
75.0
30.0
3.09
3.82
65.92
58.28
Current Data Set
Usable
Learnable
324
324
0.0
0.0
100.0
100.0
59.4
72.7
531.54
674.47
23.06
25.97
1.28
1.44
-0.38
-0.80
40.6
50.0
62.5
75.0
78.1
100.0
37.5
50.0
3.09
3.09
3.96
4.46
63.40
77.18
55.48
68.27
Table note: Add and subtract Critical d (computed by multiplying the Critical Z and the standard error)
from the mean to get the upper and lower bounds of the 99.9% confidence interval.
100
90
80
70
60
50
40
30
20
10
0
101
Overall
SUS
5.
0
10
.0
15
.0
20
.0
25
.0
30
.0
35
.0
40
.0
45
.0
50
.0
55
.0
60
.0
65
.0
70
.0
75
.0
80
.0
85
.0
90
.0
95
.0
10
0.
0
0.
0
Number of Scores
The Factor Structure of the System Usability Scale
100
90
80
70
60
50
40
30
20
10
0
Usable
10
0.
0
93
.8
87
.5
81
.3
75
.0
68
.8
62
.5
56
.3
50
.0
43
.8
37
.5
31
.3
25
.0
18
.8
12
.5
Learnable
6.
2
0.
0
Number of Scores
Score
Score
Fig. 2. Distributions of the Overall SUS, Usable, and Learnable scores from the current data set
5 Discussion
5.1 Benefit of an Improved Understanding of the Factor Structure of the SUS –
A Cleaner and Possibly Quicker Usability Scale
In the 23 years since the introduction of the SUS, it has certainly stood the test of
time. The results of the current research show that it would be possible to use the new
Usable scale in place of the Overall SUS. The scales had an extremely high correlation (.985), and the reduction in reliability in moving from the 10-item Overall SUS to
the 8-item Usable scale was negligible (.92 to .91). The time saved by dropping Items
4 and 10, however, would be of relatively little benefit compared to the advantage of
getting an estimate of perceived learnability along with a cleaner estimate of perceived usability. For this reason, we encourage practitioners who use the SUS to
continue doing so, but to recognize that in addition to working with the standard
Overall SUS score, they can easily decompose the Overall SUS score into its Usable
and Learnable components, extracting additional information from their SUS data
with very little additional effort.
The difference in central tendency between the Bangor et al. [8] data and our data
indicate that the two datasets may represent different types of users and products. For
preliminary data on an attempt to connect SUS ratings to a 7-point adjective scale
(Best Imaginable to Worst Imaginable), see Bangor et al. (pp. 586-588).
5.2 Implications for SUS Item Wording
Psychometric findings for one version of a questionnaire do not necessarily generalize
to other versions. Research on the SUS and similar questionnaires has shown, however,
102
J.R. Lewis and J. Sauro
that slight changes to item wording most often lead to no detectable differences in factor
structure or reliability [10].
For example, in a study of the interpretation of the SUS by non-native English
speakers, Finstad [14] found that in Item 8 (“I found the system very cumbersome to
use”), all native English speakers claimed to understand the term, but half of the nonEnglish speakers asked for clarification. When told that “cumbersome” meant “awkward”, the non-English speakers indicated that this was sufficient clarification.
Bangor et al. [8] also reported some confusion (about 10% of participants) with the
word “cumbersome”, and replaced it with “awkward” early in their use of the SUS.
They also replaced the word “system” with “product” in all items. Consequently,
about 90% of their 2324 cases used the modified version of the SUS. Our 324 cases,
however, used the original SUS item wording for Item 8, and used either the word
“system” or the actual product name in place of “system”. Despite these differences
in item wording, estimates of reliability and the two-factor solutions for the two data
sets were almost identical, which leads to the following two guidelines for
practitioners.
•
•
For Item 8, use “awkward” rather than “cumbersome”.
Use either “system” or “product” or the actual product name, depending on
which seems more appropriate for a given test, but for consistency of presentation, use the same term in all items for any given test or across a related series
of tests.
References
1. Brooke, J.: SUS: A “Quick and Dirty” Usability Scale. In: Jordan, P.W., Thomas, B.,
Weerdmeester, B.A., McClelland (eds.) Usability Evaluation in Industry, pp. 189–194.
Taylor & Francis, London (1996)
2. Lewis, J.R.: Usability Testing. In: Salvendy, G. (ed.) Handbook of Human Factors and Ergonomics, pp. 1275–1316. John Wiley, New York (2006)
3. Sauro, J., Lewis, J.R.: Correlations among Prototypical Usability Metrics: Evidence for the
Construct of Usability. In: The Proceedings of CHI 2009 (to appear, 2009)
4. Landauer, T.K.: Behavioral Research Methods in Human-Computer Interaction. In: Helander, M., Landauer, T., Prabhu, P. (eds.) Handbook of Human-Computer Interaction, pp.
203–227. Elsevier, Amsterdam (1997)
5. Nunnally, J.C.: Psychometric Theory. McGraw-Hill, New York (1978)
6. Lucey, N.M.: More than Meets the I: User-Satisfaction of Computer Systems. Unpublished
thesis for Diploma in Applied Psychology, University College Cork, Cork, Ireland (1991)
7. Kirakowski, J.: The Use of Questionnaire Methods for Usability Assessment (1994),
http://sumi.ucc.ie/sumipapp.html
8. Bangor, A., Kortum, P.T., Miller, J.T.: An Empirical Evaluation of the System Usability
Scale. International Journal of Human-Computer Interaction 24, 574–594 (2008)
9. Tullis, T.S., Stetson, J.N.: A Comparison of Questionnaires for Assessing Website Usability. Unpublished presentation given at the UPA Annual Conference (2004),
http://home.comcast.net/~tomtullis/publications/
UPA2004TullisStetson.pdf
The Factor Structure of the System Usability Scale
103
10. Lewis, J.R.: IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use. International Journal of Human-Computer Interaction 7, 57–
78 (1995)
11. Lewis, J.R.: Psychometric Evaluation of the PSSUQ Using Data from Five Years of Usability Studies. International Journal of Human-Computer Interaction 14, 463–488 (2002)
12. Cliff, N.: Analyzing Multivariate Data. Harcourt Brace Jovanovich, San Diego (1987)
13. Coovert, M.D., McNelis, K.: Determining the Number of Common Factors in Factor
Analysis: A Review and Program. Educational and Psychological Measurement 48, 687–
693 (1988)
14. Finstad, K.: The System Usability Scale and Non-Native English Speakers. Journal of Usability Studies 1, 185–188 (2006)