Correlations among Prototypical Usability Metrics:
Evidence for the Construct of Usability
Jeff Sauro
James R. Lewis
Oracle Corporation
1 Technology Way, Denver, CO 80237
[email protected]
IBM Software Group
8051 Congress Ave, Suite 2227
Boca Raton, FL 33487
[email protected]
ABSTRACT
Correlations between prototypical usability metrics from 90
distinct usability tests were strong when measured at the
task-level (r between .44 and .60). Using test-level
satisfaction ratings instead of task-level ratings attenuated
the correlations (r between .16 and .24). The method of
aggregating data from a usability test had a significant
effect on the magnitude of the resulting correlations. The
results of principal components and factor analyses on the
prototypical usability metrics provided evidence for an
underlying construct of general usability with objective and
subjective factors.
Author Keywords
usability measurement, usability metrics, principal
components analysis, correlation, PCA, factor analysis, FA
ACM Classification Keywords
H5.2. Information interfaces and presentation: User
Interfaces – Evaluation/Methodology; Benchmarking
usability, the associations were too low to warrant
aggregating metrics into a summary score.
They
hypothesized that Sauro and Kindlund’s [17] earlier reports
of higher correlations might be due to small sample sizes
and simple task-level measures. They also suggested that
the aggregation level of the data (task or user) could affect
the magnitude of the correlations.
The purpose of this analysis is to extend the important work
of Hornbæk and Law [7] by focusing on the prototypical
usability measures found in summative usability
evaluations. Their research provided a broad survey of
published studies, including studies that were not traditional
scenario-based usability tests. We deal instead with the type
of data found in the typical usability test presented to
product teams, executives or for other internal
benchmarking efforts [14]. In short, we wanted to see what
the correlations were in actual usability tests, and how the
level of aggregation affected the magnitude of the
correlations. The data also afforded a unique opportunity to
explore the construct validity of usability.
METHOD
INTRODUCTION
Determining how quantitative measures of usability relate
is important in understanding the construct of usability.
Using meta-analysis, Hornbæk and Law [7] recently
reported weak correlations among efficiency, effectiveness
and satisfaction, with an average Pearson-product moment
correlation (r) of about .2. The correlations were equally
weak among the specific measures of time-on-task, binary
completion rates, error rates and user satisfaction (the
measures that Hornbaek & Law defined as “prototypical”
due to their common inclusion in usability studies to
represent aspects of efficiency, effectiveness, and
satisfaction). They concluded that although their research
showed some dependence among various aspects of
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
CHI 2009, April 4–9, 2009, Boston, Massachusetts, USA.
Copyright 2009 ACM 978-1-60558-246-7/09/04…$5.00.
We gathered the raw data from usability studies by
searching the archives of present and past usability reports
and contacting colleagues across many companies to get a
large and reasonably varied set of task-level usability data.
The data collection period lasted several months and
incorporated data from usability studies conducted from
1983 to 2008, including products such as printers,
accounting and human resources software, websites and
portals. In total we obtained 97 raw data-sets from 90
distinct usability tests, all of which contained some
combination of the prototypical usability metrics, with data
from over 2000 unique users and 1000 tasks (see Table 1).
Thirteen of the 90 distinct usability tests (14.4%) were
conducted by the authors.
Data Collected
Data Sets
Distinct Usability Studies
Donors
Users
Tasks
Table 1. Dataset descriptions.
N
97
90
12
2286
1034
Description of Datasets
The type of data included in this analysis contained a
narrower range of measures and tasks than those considered
by Hornbæk and Law [7]. The bulk of the tasks in the
present study were closed-end productivity activities (e.g.,
create an expense report, install paper in a printer, review
employee performance reports, check status of a submitted
report) as opposed to the more varied tasks in Hornbæk and
Law (e.g., pointing and clicking, authoring privacy rules,
editing code, essays written with computer support).
All but four studies were between-subjects usability tests,
wherein one set of users attempted a set of tasks on one
product. Three data sets were from between-subjects
comparisons in which independent groups of users
attempted the same tasks on different products (either 2 or
3). One study used a within-subjects design in which the
same users attempted the same set of tasks across three
products. The difference between the total number of
datasets and distinct usability tests reflects the inclusion of
these between- and within-subjects comparisons.
Our goal was to obtain raw datasets from as many
companies and products as possible. Part of such an
undertaking, however, required that we extend a degree of
anonymity to the donors, and many of the details (including
all confidential details) of the usability studies were
removed from the raw datasets before we received them
(and many thanks to those who selflessly donated their
data). For the datasets in which we also obtained the
reports, the majority tested users who were unfamiliar with
the product but had experience in the domain (e.g., Human
Resources Professionals adding new hire information in an
HR application). All but three of the 97 datasets came from
lab-based moderated usability tests; the other three were
automated remote usability tests.
scale steps). Most post-task ratings were averages of 2 to 3
questions using 5- or 7-point Likert-type questions,
resulting in a composite score often between 10-15
response options. Even though 95% of the tests used 5- or
7-point scales for post-task satisfaction, we recoded the raw
scale scores as proportions of the maximum score because a
mean response of 4 on a 5-point scale represents a higher
sentiment than a 4 (the mid-point) on a 7-point scale. For
the same reasons, we used the same technique to scale the
post-test satisfaction ratings. For example, a raw SUS score
of 75 became .75 because the maximum SUS score is 100.
Metric Representation across Studies
Table 2 shows the representation of the prototypical
measures across the datasets.
Metric
Task Time
Completion Rate
Errors
Post-Test Satisfaction
Post-Task Satisfaction
N
96
95
56
47
39
%
99
98
58
48
40
Table 2. Metric distribution across the 97 datasets. Almost
every study collected task time and completion rates; only
39 collected post-task satisfaction.
Users
In total, there were 2286 unique users from the 97 datasets.
The distribution of users across tests was highly skewed by
one very large sample size (n = 296 – one of the automated
remote usability tests), making the mean number of users
per test a misleading figure. The median number of users
per test was 10, ranging from 4 to 296. Sixty-four percent
of the tests had between 8 and 12 users and 80% had fewer
than 20.
The usability test data came from 12 sources, including
software companies (e.g., PeopleSoft, Oracle, IBM, Intuit),
IT organizations within another company (e.g., Fidelity
Investments, American Family Insurance) or individuals or
organizations as part of research.
Most information about the characteristics of the users was
removed from the data-sets, preventing a representative
tabulation. There was sufficient evidence from the reports
to conclude that users were predominately from the US and
usually familiar with the application. The distribution of
gender appeared roughly representative, but there was no
evidence of representation from children or the elderly.
Coding
Tasks
Domains
Most datasets were coded from reports or spreadsheets with
little modification. Because some scales (such as the ASQ
and PSSUQ – [11,12,13]) code higher numbers as having
worse usability, in contrast to the majority of other scales,
the major source of re-coding came from ensuring the
sentiment of the satisfaction scales were pointing in the
same direction (with higher scores indicating greater
satisfaction).
In addition to scale direction, satisfaction questions differed
on the number of scale steps. For post-task ratings, 20 used
five-point scales (53%) 16 used 7-point scales (42%), one
used the 150 point SMEQ scale, and three used magnitude
estimation [5,15] (which does not have a defined number of
In total, there were 1034 unique tasks from the 97 datasets.
The distribution of tasks across tests was more normally
distributed, with a mean of 10.6 and range of 2 to 44. Fiftyone percent of tests had between 6 and 10 tasks.
Most information about the details of the task scenarios had
been removed from the data-sets before we received them.
Much of the data came from productivity tasks. For
example, two scenarios which exemplify this type of task
were “Enter a social security number for a beneficiary” and
“Create and submit an Expense Report for Mileage between
Vancouver and San Francisco.”
Task Duration
z = .5*ln((1+r)/(1-r)).
Task duration had a strong positive skew from a few very
long tasks lasting over an hour. To address this skewness,
the task time means were transformed using the natural
logarithm. The mean task duration was 172 seconds with a
range from 10 seconds to 104 minutes. Fifty percent of
tasks lasted between 90 and 270 seconds.
LEVELS OF AGGREGATION FOR ANALYSIS
A key goal of this investigation was to understand how
different levels of aggregation affect the correlations among
prototypical usability metrics. Hornbæk and Law [7, p. 625]
identified different aggregation levels as a potential cause
for different correlation magnitudes. Table 3 shows the
seven different aggregation schemes used in the current
study.
Across Tests
Within Tests
Multiple
Correlations per
Test
One
Correlation Per
Test
Tasks
TM
TO
Users
UM
UO
Observation
--
OO
Task Means
--
TAO
User Means
--
UAO
Table 3. Aggregation schemes. Task means, user means
and observation level data are only possible one time per
test.
As Table 3 shows, we aggregated tasks along two
dimensions: (1) by the level of aggregation within a test and
(2) the level of aggregation across tests. All aggregation
methods ending with “O” generated only one correlation
per test for each pair of prototypical usability metrics
collected in the study. The aggregation methods ending
with “M” generated multiple correlations per test for each
pair of variables. To help explain the different methods, the
following definitions of the aggregation schemes will
include examples using the data in Table 4.
Task Level Aggregation (TO/TM)
Task level aggregation indicates the generation of
correlations from the pairs of measures by the users for
each task, so there are as many correlations for a test as
there are tasks. For example, the correlations between task
time and errors for the four tasks in the sample dataset
shown in Table 4 are (.58, --, .89, .36) respectively. There is
no correlation for task 2 because there were no errors.
When one measurement has no variation, its correlation
with other measurements is undefined due to division by 0.
One way to estimate the overall correlation between time
and errors is to use the TO scheme, averaging the three
valid correlations. When averaging correlations, it is
standard practice to convert the correlations to standard (z)
scores, do the math, then convert the mean standard score
back to a correlation. To convert r to z, use:
To convert z back to r, use:
r = ((exp(2*z)-1)/((exp(2*z)+1))).
These formulas use Excel notation for easy pasting into a
spreadsheet, replacing r and z in the bodies of the equations
with cell designations as appropriate.
To continue the example, converting the three correlations
to standard scores produces .66, 1.4, and .38, which have a
mean of .81. Converting this standard score back to a
correlation gives r = .67 as the one correlation for this test
when using the TO aggregation scheme.
An alternative aggregation scheme using this data is to
include all three correlations with similar task level
correlations from the other datasets (the TM scheme, with
multiple correlations per test). Using the TM scheme, the
test data in Table 4 provided three estimates of the
correlation between task time and errors (.58, .89, and .36).
Task
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
User
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Raw
Sat
4.00
4.00
3.33
3.00
1.00
4.00
2.33
2.33
4.00
4.00
4.00
3.00
3.00
4.00
3.00
3.00
4.00
4.00
4.00
3.00
3.33
4.00
2.33
2.00
4.00
4.00
4.00
1.67
2.33
4.00
2.33
3.00
Scaled
Sat
0.57
0.57
0.48
0.43
0.14
0.57
0.33
0.33
0.57
0.57
0.57
0.43
0.43
0.57
0.43
0.43
0.57
0.57
0.57
0.43
0.48
0.57
0.33
0.29
0.57
0.57
0.57
0.24
0.33
0.57
0.33
0.43
Time
72
60
72
66
144
72
78
72
60
54
54
66
72
72
72
54
72
72
78
84
90
90
114
150
96
72
60
114
78
66
78
96
Comp
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
0
Errors
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
1
1
1
0
1
Table 4. A dataset used in the analysis. The satisfaction
scores were task-level, with a maximum score of 7.
User Level Aggregation (UO/UM)
User level aggregation indicates the generation of
correlations from the pairs of measures by the tasks for each
user, so for this scheme there are as many correlations for a
test as there are users. For example, to generate the
correlations between task time and scaled task-level
satisfaction for the sample data, the correlations are (--, --, .37, -.93, -.84, --, -.47, -.66). To estimate the overall
correlation between time and completion, we can either
average these 5 valid correlations (after transforming) to get
r = -.73 (One per test: the UO scheme) or use all 5
correlations with the user level correlations from the other
datasets (Multiple per test, the UM scheme).
Observation Level Aggregation (OO)
Aggregation by observation involves creating one matrix of
tasks and users within a dataset. For example, when
correlating errors with completion rates in the sample data
in Table 4, one correlation is generated from 32 pairs of
errors and completion rates to get an r of -.68, which is then
averaged with all other datasets (the OO scheme).
Task Average Level Aggregation (TAO)
Task average level aggregation indicates correlation taken
on the mean task performance. For the sample data in Table
4, the correlation between post-task satisfaction and errors
would use the mean satisfaction rating and mean number of
errors by task (for this sample data, r = -.83). With this
scheme, there is only one correlation per test for each pair
of variables.
User Average Level Aggregation (UAO)
User average level aggregation indicates that the correlation
is taken on the mean user performance across tasks. For
example, the correlation between task time and completion
in the sample dataset for the 8 users is -.73. With this
scheme, there is only one correlation per test for each pair
of variables.
Exploring the Construct of Usability
The data also provide an opportunity to use principal
components analysis (PCA) and factor analysis (FA) to
explore the construct of usability. Organizing the data as
described above for UM casts the data in a form suitable for
PCA and FA (one set of prototypical usability scores per
participant, averaged over tasks to get a set of independent
scores, restricting the final data set to those participants
who have scores for all prototypical usability metrics). The
three key questions to address with these analyses are:
1.
Do all prototypical usability metrics significantly
correlate?
2.
Do all prototypical usability metrics heavily load
on the first unrotated component of a PCA
(indicative of an underlying usability construct ‘u’,
analogous to Spearman’s ‘g’ for intelligence [9])?
3.
Does an exploratory FA indicate a reasonable
underlying factor structure for the construct of
usability?
RESULTS
All correlations in the subsequent tables were calculated
using the Fisher r-to-z transformation, then transforming the
z-scores back to report as correlations (Pearson’s r). All
reported correlations were significantly different from 0 (p
< .05). For each of the following seven tables, the
calculation of the overall mean used the standard
conversion to z-scores, then conversion back to r, so the
overall means will not match a simple average of the tabled
correlations, but will probably provide a better estimate of
the true correlation between the two metrics than any
individual correlation from the aggregation levels. Using a
similar procedure, the overall median is the mean of the
transformed medians. The 95% confidence intervals were
calculated on the z-scores and transformed back to Pearson
r’s. The intervals are asymmetrical because the distribution
of r is positively skewed, especially for values above .5.
The “% Neg.” and “% Pos.” columns show the percentage
of correlations that were either negative or positive based
on the overall tendency of the metric pairs. Higher values
in this column show higher agreement and less variability
in the datasets for that aggregation level and correlation
pair.
Correlations among Task Completion, Task Time and
Errors
Tables 5-7 show the correlations between the prototypical
measures for effectiveness and efficiency: task time, errors
and completion rates. The tables show both the mean and
the median as measures of central tendency for each
correlation.
Level
TM
UM
OO
TO
UO
TAO
UAO
Overall
Mean
-0.41
-0.36
-0.39
-0.44
-0.51
-0.61
-0.51
-0.46
Median
-0.36
-0.32
-0.38
-0.40
-0.47
-0.60
-0.45
-0.43
N
809
1921
92
92
92
92
92
7
95% CI
Low
High
-0.44
-0.38
-0.38
-0.34
-0.43
-0.34
-0.49
-0.38
-0.56
-0.46
-0.67
-0.54
-0.58
-0.43
-0.51
-0.41
% Neg.
81
83
97
96
99
91
90
91
Table 5. Correlations between completion rate and time by
aggregation level.
Level
TM
UM
OO
TO
UO
TAO
UAO
Overall
Mean
-0.59
-0.51
-0.40
-0.51
-0.56
-0.60
-0.58
-0.54
Median
-0.48
-0.43
-0.39
-0.48
-0.42
-0.58
-0.57
-0.48
N
518
675
56
55
56
56
56
7
95% CI
Low
High
-0.63
-0.54
-0.55
-0.48
-0.45
-0.33
-0.60
-0.41
-0.66
-0.43
-0.68
-0.52
-0.67
-0.47
-0.61
-0.46
% Neg.
90
88
96
91
91
95
95
92
Table 6. Correlations between completion rate and errors
by aggregation level.
Level
TM
UM
OO
TO
UO
TAO
UAO
Overall
Mean
0.47
0.62
0.54
0.47
0.66
0.80
0.53
0.60
Median
0.47
0.59
0.57
0.48
0.59
0.77
0.50
0.59
N
624
812
56
56
56
56
56
7
95% CI
Low
High
0.44
0.50
0.59
0.66
0.48
0.59
0.41
0.53
0.57
0.74
0.73
0.85
0.44
0.61
0.53
0.66
% Neg.
86
92
100
96
98
96
91
94
Table 7. Correlations between errors and task time by
aggregation level.
Correlation of Task-Level Satisfaction
Completion, Task Time, and Errors
with
Task
Tables 8-10 show the correlations between task-level
satisfaction and the prototypical measures for effectiveness
and efficiency: completion, errors and time.
Level
TM
UM
OO
TO
UO
TAO
UAO
Overall
Mean
0.41
0.56
0.42
0.42
0.63
0.68
0.42
0.51
Median
0.33
0.50
0.42
0.36
0.51
0.74
0.48
0.48
N
455
1518
39
39
39
39
39
7
95% CI
Low
High
0.32
0.50
0.51
0.62
0.36
0.48
0.34
0.49
0.38
0.79
0.59
0.74
0.31
0.52
0.41
0.61
% Neg.
79
90
97
97
95
95
90
92
Table 8. Correlations between task-level satisfaction and
completion rate by aggregation level.
Task-level satisfaction measurement (e.g., the ASQ [11,12])
takes place after the completion of each task (or scenario),
in contrast to satisfaction measures taken at the completion
of a test (post-test satisfaction), such as the SUS [2], SUMI
[8], and PSSUQ [12,13]) which appear in Table 11.
Post-Test Satisfaction
95% CI
Mean
Median
N
Low
High
% -/+
72 +
Comp
0.24
0.29
46
0.12
0.36
Time
-0.25
-0.28
47
-0.37
-0.11
68 -
Task Sat
0.64
0.62
15
0.39
0.8
93 +
Errors
-0.16
-0.16
29
-0.3
-0.02
62 -
Table 11. Correlations with post-test satisfaction. Post-test
correlation done at the aggregation level of UAO is the
only way to correlate post-test satisfaction with task-level
measures.
Correlation of Test-Level Satisfaction with Task Level
Metrics
Forty-seven of the datasets included test-level satisfaction
measurement along with some combination of task-level
measures. Correlation with post-test satisfaction ratings
with task level measures is only possible with the UAO
aggregation scheme because users complete post-test
satisfaction measures once at the end of the test. Table 11
shows the correlations between test-level satisfaction and
the other usability metrics.
Overall Correlations
95% CI
Level
Mean
Median
N
Low
High
% Neg.
TM
UM
OO
TO
UO
TAO
UAO
-0.39
-0.54
-0.41
-0.39
-0.52
-0.56
-0.43
-0.38
-0.51
-0.41
-0.37
-0.54
-0.59
-0.42
575
1676
38
38
38
38
38
-0.42
-0.57
-0.46
-0.44
-0.61
-0.65
-0.55
-0.36
-0.52
-0.36
-0.33
-0.41
-0.45
-0.3
84
90
97
97
95
89
89
Overall
-0.47
-0.46
7
-0.53
-0.39
92
Table 9. Correlations between task-level satisfaction and
task time by aggregation level.
Table 12 shows the average correlations from Tables 5-11
above. The correlations range from low correlations for
test-level satisfaction (e.g., -.16) to strong correlations (e.g.,
.60) for task time and errors.
Comp
Time
Errors
-0.46
-0.54
Time
0.60
Errors
Task-Sat
0.51
-0.47
-0.44
Task-Sat
Test-Sat
0.24
-0.25
-0.16
0.64
Table 12. Correlation matrix using the average of all
aggregation levels (except Test-Sat which necessarily used
only the UAO aggregation level).
95% CI
Level
Mean
Median
N
Low
High
% Neg.
TM
UM
OO
TO
UO
TAO
UAO
-0.37
-0.42
-0.34
-0.33
-0.52
-0.61
-0.45
-0.25
-0.37
-0.38
-0.29
-0.43
-0.63
-0.49
398
554
26
26
26
26
26
-0.49
-0.49
-0.41
-0.43
-0.74
-0.72
-0.58
-0.24
-0.35
-0.27
-0.23
-0.20
-0.48
-0.31
78
83
100
96
88
92
92
Overall
-0.44
-0.41
7
-0.57
-0.30
90
Table 10. Correlations between task-level satisfaction and
errors by aggregation level.
Levels of Aggregation and Variable Pairs
One of our key questions was the extent to which the level
of aggregation affects the magnitude of the measured
correlation. We used ANOVA to assess the main effects of
Level of Aggregation and Variable Pair (correlated pairs of
variables) and their interaction. Out of the 97 data sets in
the database, there were 26 for which we could compute all
of the following prototypical usability metrics: task time,
task completion, errors per task, and task-level satisfaction.
For each study, we used each of the five levels of
aggregation to obtain correlations for each of the six
variable pairs, for a total of 30 correlations per study. Next,
we converted correlations to z-scores (ensuring that
correlations in the expected direction were coded as
positive z-scores), then conducted the ANOVA on the zscores, treating studies as subjects in a within-subjects
design with two independent variables (Level of
Aggregation and Variable Pair).
Absolute Value of Mean Correlation
The main effect of Level of Aggregation was statistically
significant (F(4, 100) = 6.2, p < .0001), as was the main
effect of Variable Pair (F(5, 125) = 8.0, p < .0001) and their
interaction (F(20, 500) = 2.2, p = .003). Figure 1 shows the
interaction (with z-scores converted back to r).
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
TAO
UO
UAO
TO
OO
TE
CE
TS
CS
ES
CT
Variable Pair
Figure 1. Interaction between Level of Aggregation and
Variable Pair. The codes for variables are T = Time, C =
Completions, E = Errors, and S = Satisfaction (TaskLevel).
Tables 13 and 14 show the results of Bonferroni multiple
comparisons on the main effects.
For the multiple
comparisons, we used all of the study-level data available
for each level of aggregation (n = 93) and for each variable
pair (n ranging from 26 to 56, depending on the variable
pair).
TAO
r = 0.67
X
UO
r = 0.58
UAO
r = 0.48
TO
r = 0.43
OO
r = 0.42
X
X
X
X
X
Table 13. Bonferroni comparisons of Levels of
Aggregation. With five levels, there are 10 possible
comparisons, so to maintain a significance level of .05
across the set of comparisons, the critical significance level
for each individual comparison was .005 (.05/10). Levels
that have an “X” on the same row were not significantly
different.
TE
CE
CS
CT
TS
ES
r = 0.62
r = 0.53
r = 0.52
r = 0.49
r = 0.46
r = 0.46
X
X
X
X
X
X
X
X
Table 14. Bonferroni comparisons of Variable Pair. With
six pairs, there are 15 possible comparisons, so to maintain
a significance level of .05 across the set of comparisons, the
critical significance level for each individual comparison
was .0033 (.05/15). Levels that have an “X” on the same
row were not significantly different.
Construct Validity of Usability
The database contained 325 cases (from 13 studies) in
which participants provided all five prototypical usability
metrics: task completions, task times, error counts, taskbased satisfaction, and test-based (overall) satisfaction. The
correlation matrix for the metrics for this subset of the data
appears in Table 15.
Comp
Time
-0.50
Time
Errors
-0.66
0.59
Errors
Task-Sat
0.43
-0.24
-0.34
Task-Sat
Test-Sat
0.35
-0.23
-0.23
0.64
Table 15. Correlation matrix for the 325 complete cases by
participant.
All correlations were statistically significant (p < .0001)
and in the expected direction, a finding consistent with the
hypothesis of an underlying construct of usability. The
magnitudes were similar to those of the whole dataset (see
Table 12 above) with the exception of time and task-level
satisfaction which had a greater attenuation.
Table 16 shows the unrotated loadings for a PCA conducted
on this subset of the data. All variables loaded highly on
the first component, with the absolute value of the loadings
ranging from .63 to .82. Thus, this finding is consistent
with the hypothesis of an underlying construct of usability.
Measures
Comp
Time
Errors
Task-Sat
Test-Sat
1
0.82
-0.70
-0.80
0.71
0.63
2
-0.20
0.45
0.41
0.56
0.65
3
0.38
0.53
-0.17
0.09
-0.22
4
0.26
0.06
0.11
-0.41
0.32
5
0.27
-0.16
0.40
0.13
-0.17
Eigenvalue
% Variance
2.70
53.97
1.14
22.78
0.51
10.14
0.35
7.05
0.30
6.06
Table 16. Unrotated PCA loadings.
Note that the mechanics of PCA maximize the assignment
of variance to the first unrotated component, leading to
some controversy regarding its interpretability. Despite
this, some psychometricians do hold that this first unrotated
principal component is interpretable “as a general index of
a construct represented by shared variance among related
variables. For example, if one had administered five tests
of specific cognitive abilities, the first unrotated principal
component … could be viewed as a measure of general
ability" [9, p. 251]. This first unrotated component is also a
potential source for weightings to use in the computation of
a composite score. This is not evidence for a latent factor
structure with only one factor, rather, it is evidence for an
overall usability construct that might or might not have an
additional latent factor structure.
To explore the possibility of a latent factor structure, we
conducted a common factor analysis on the 325 cases. A
parallel analysis [4] of the eigenvalues from the FA (2.364,
0.805, 0.094, 0.024, -0.003) indicated a two-factor solution
(with those two factors accounting for about 63% of the
total variance). The final varimax-rotated loadings for the
two-factor solution appear in Table 17, with objective
measures loading strongly on the first factor, and subjective
measures loading strongly on the second factor.
Measures
1
2
Comp
0.70
0.33
Time
-0.65
-0.14
Errors
-0.88
-0.15
Task-Sat
0.24
0.79
Test-Sat
0.15
0.76
questions about ease of use and at least one additional
construct. For example, one questionnaire asked whether
the product met the user’s business needs and another asked
about the perceived attractiveness of the interface. The
inclusion of these items reduced the internal reliability,
suggesting that they were getting at a construct other than
usability. Three instances of the SUS questionnaire had
reliability between .52 and .68. Likely causes for this lower
reliability include small sample sizes and failure to orient
the questions in the same direction (coding errors).
DISCUSSION
Although the values of the correlations fluctuated
depending on the aggregation level, the magnitudes of the
correlations among the prototypical usability metrics tended
to be medium to large. The lower bounds of the 95%
confidence intervals around the correlations for the overall
averages never dipped below .30. This conservative lower
bound suggests task-level correlations that have at least a
medium-sized effect [3].
Comparison with Correlations of Hornbæk & Law (2007)
Table 19 shows the average correlations across aggregation
levels from this study, the correlations obtained using the
UAO scheme and post-test rather than post-task satisfaction
(closest to the scheme used by Hornbæk & Law [7]), and
the correlations reported by Hornbæk and Law.
Table 17. Rotated factor loadings.
Internal Reliability of Post-Test Questionnaires
There were 9 different post-test satisfaction questionnaires
used across 47 datasets. Seven datasets provided only
summary level data, but we had the raw data from the other
40 datasets, allowing us to examine the reliability of the
questionnaires using a procedure similar to that of Hornbæk
and Law [7]. We computed the internal reliability using
coefficient alpha, with results in Table 18.
Coefficient Alpha
Questionnaire
SUS
Homegrown
SUMI
PSSUQ
Overall
N
17
11
6
6
40
Mean
0.83
0.78
0.92
0.92
0.85
Min
0.52
0.63
0.86
0.80
Max
0.98
0.92
0.98
0.98
Table 18. Internal reliability of post-test satisfaction
questionnaires.
To test the relative reliability between homegrown and
standardized
questionnaires,
we
combined
the
questionnaires into two groups and conducted a t-test, with
the result that the standardized instruments were more
reliable (.78 vs .87, t(16)=2.24, p < .05) than the
homegrown ones, confirming the finding of Hornbæk and
Law. There were seven questionnaires that had reliability
below .70; however, they were about equally split between
standardized and homegrown (four homegrown; three
standardized). All homegrown questionnaires asked
Measures
Overall
UAO
Comp/Time
-0.46
-0.50
Comp/Errors
-0.54
-0.56
H&L
Errors/Time
0.60
0.51
Sat/Comp
0.51
0.26
0.32 / 0.44*
Sat/Time
-0.47
-0.25
-0.15
Sat/ Errors
-0.44
-0.22
-0.20
Table 19. Comparison of correlations at the UAO
aggregation level with the prototypical measures from
Hornbaek &Law (H&L). *The correlation of .44 is for their
category of errors-along-the-way, which is more similar to
the error types in the current analysis than their category of
task-completion-errors (errors in a task’s outcome).
In the current study, the UAO level of aggregation comes
closest to the correlations reported by Hornbæk and Law
[7]. In their Table 5, Hornbæk and Law (p. 623) reported
correlations of .316 (with a 95% confidence interval from
.246 to .386) for time and errors, .196 (95% CI from .012 to
.380) for errors and satisfaction, and .145 (95% CI from
.016 to .274) for time and satisfaction.
It appears that
many of their satisfaction measures were post-test, and our
UAO correlation between errors and post-test satisfaction
(see Table 10) was very similar, -.16 (95% CI from -.02 to .30), as was the UAO correlation between time and posttest satisfaction of -.25 (95% CI from -.11 to -.37). Our
UAO estimate of the correlation between time and errors
(.51, with a 95% CI from .44 to .61) was significantly
higher than Hornbæk and Law’s estimate (95% CI from
.246 to .386).
We agree with the hypothesis put forth by Hornbæk and
Law [7] that a likely cause of higher correlations in Sauro
and Kindlund [17] and in the current analysis is due to
restricting task types and task-level measures. The variety
of studies used in Hornbæk and Law most likely provide a
better picture of the broader area of human computer
interaction (HCI), whereas the data analyzed here present a
more focused picture of summative usability tests. In other
words, the results of Hornbæk and Law are more
generalizable to the entire field of HCI, whereas our results
are more generalizable to the types of usability tests
typically conducted by usability professionals – a type of
test often performed, but rarely published. For example, an
indicator of the difference in the types of studies examined
in the current study and Hornbæk and Law is the percentage
of studies that included task completion rates as a metric.
In Hornbæk and Law, 15 of 72 studies (21%) included this
metric; in our database, 95 of 97 studies (98%) included it.
Error Types
Fifty-three of our datasets contained error data. Hornbæk
and Law [7] defined a distinction between task-completionerrors (errors in task outcomes) and what they dubbed
errors-along-the-way (e.g., slips, mistakes). The data sets
we have contain total error counts at the task level,
combining these two types of errors. The correlation found
between errors and task time in this analysis (r = .60) was
closer to the Hornbæk and Law correlation of task time
with errors-along-the-way (r = .44) than their correlation of
task time with task-completion-errors (r = .16).
This is consistent with our observation that in standard
usability testing, task-completion errors are a much smaller
class of errors than errors-along-the-way. In many cases,
participants may not even be aware of task-completionerrors, which would restrict correlation between those types
of errors and satisfaction measurements. Also, errorsalong-the-way necessarily have an effect on task time (all
other things being equal, more errors lead to longer task
times), but there is no similar logical relationship between
task-completion-errors and task time. Whether usability
practitioners should routinely discriminate between these
two classes of errors is an open question because, although
this distinction is of interest to some researchers, it might be
of little practical significance in guiding product redesign.
Levels of Aggregation and Variable Pairs
As suggested by Hornbæk and Law [7], the level of
aggregation significantly affected the magnitude of the
correlations, with the highest correlations generally
associated with the TAO level of aggregation. The lowest
correlations generally occurred with the OO level of
aggregation, but even those correlations were of substantial
magnitude. The lowest correlation from the ANOVA was
for the association of completions and time using the OO
level of aggregation, with r = .30. Because this is a
correlation between two different variables collected at the
same time, it is a measure of concurrent validity. In
classical psychometrics, validity coefficients of .30 are
respectable, large enough to justify the use of the associated
psychometric instruments for personnel decisions [16].
There were also significant differences among the
magnitudes of the correlations for the variable pairs. The
strongest correlation was for time and errors (r = .62), but
this correlation was not significantly different from those
for completions and errors (r = .53) or satisfaction (taskbased) and errors (r = .52). With correlations ranging from
r = .46 to .62 in the Bonferroni comparisons for all the pairs
of variables, only the correlation between time and errors
was significantly higher than any of the other correlations
(specifically, higher than the correlations for time and
completions, time and satisfaction, and errors and
satisfaction).
These analyses (ANOVA and associated Bonferroni
multiple comparisons) show that for different levels of
aggregation, prototypical usability metrics from standard
usability tests correlate significantly, which is consistent
with the hypothesis that they are measuring different
aspects of a common underlying construct of usability.
The Construct of Usability
The results of the PCA and FA on the 325 complete cases
in the database were consistent with an underlying construct
of usability containing two components, one objective and
one subjective. Not only did the prototypical metrics of
usability correlate significantly with one another, the
pattern of their correlations was also consistent with an
easily interpreted factor structure. The magnitudes of
loadings on the first component of the PCA (ranging from
.63 to .82) were close enough in value that it is reasonable
to use unweighted combinations to create composite
usability scores, which is usually the case with combined
measurements [16]. For these 325 cases, the correlation
between weighted and unweighted combination was .99999
showing no statistical advantage to using a weighted
combination instead of a simpler unweighted combination.
This evidence for the construct validity of usability is
especially compelling given the wide variety of the sources
of data in the analyses. These data did not come from one
large study with homogenous participants, products, and
tasks. Instead, they came from a disparate collection of
studies, with values averaged across a disparate collection
of tasks (for example, for one task a completion time of five
minutes might be fast, but for a different task, it might be
slow). Even with this inherent variability in the data, the
analyses consistently supported the existence of the
construct of usability.
Why do we care if the prototypical usability metrics
correlate? From psychometric theory [16], an advantage of
a composite score (created either by summing or averaging
components of the score) is increased reliability of
measurement, but that increase depends on correlations
among the component scores. If the component scores do
not correlate, the reliability of the composite score will not
increase relative to the component scores. Even without an
increase in reliability, it might still be advantageous to
combine the scores [1], but the results of the PCA and FA
lend statistical support to the practice of combining
component usability metrics into a single score [17].
usability test (past usage, brand perception, customer
support). The nature of the questions supports this
hypothesis; e.g., the SUS item, “I think that I would like to
use this system frequently.” In contrast, responses to posttask questions are probably highly influenced by the justcompleted activity. The direct nature of the post-task
questions supports this idea; e.g., the ASQ item, “Overall, I
am satisfied with the amount of time it took to complete the
tasks in this scenario.”
Hornbæk and Law [7, p. 625] argued that attempts to
reduce usability to one measure are bound to lose important
information because there is no strong correlation among
usability aspects. There are, however, real-world situations
in which practitioners must choose only one product from a
summative competitive usability test of multiple products
and, in so doing, must either rely on a single measurement
(a very short-sighted approach) or must use a composite
score [10,17].
There are other factors that might influence a participant’s
rating of items in a post-test satisfaction questionnaire. For
example, there could be a primacy effect if the participant’s
experience with the product in the first task was unusually
good or bad. Hassenzahl and Sandweg [6] reported
evidence for recency effects from the last task, and Xie and
Salvendy [19] found similar effects in the measurement of
workload. For all these reasons, it should not be surprising
that post-task satisfaction measures correlate more highly
than post-test satisfaction with other task-level usability
measures. It is possible to assess post-task subjective
usability with a single item [18], so this need not add much
time to a usability test. Overall, these findings strongly
support the practice of collecting both post-task and posttest satisfaction measurements in usability tests.
Our PCA suggests that a single composite score of five
usability measures (including post-test satisfaction) would
likely contain about 54% of the variation of the raw scores
(see Table 16) – accounting for a substantial proportion of
the variance, but certainly not 100%. Any summary score
(median, mean, or other composite scores) must lose
important information (just as an abstract does not contain
all of the information in a full paper) – it is the price paid
for summarizing data. It is certainly not appropriate to rely
exclusively on summary data, but this analysis indicates the
retention of a reasonable amount of the original variables’
information. Also, it is important to keep in mind that the
data that contribute to a summary score remain available as
component scores for analyses and decisions that require
more detailed information.
Differences in Task- and Test-Level Satisfaction
There was a noticeable difference in satisfaction
correlations when using test-level satisfaction instead of
task-level satisfaction. For example, Table 12 shows that
the correlation between errors and task-level satisfaction
was -.44, but errors and test-level satisfaction only
correlated at r = -.16.
The correlation between task- and test-level satisfaction
was .64 (See Table 12). Thus, post-task satisfaction
accounted for around 40% of the variation in post-test
satisfaction. Hornbæk and Law [7] found correlations of
between .38 and .70 between the two, consistent with our
findings. This relationship is among the strongest between
pairs of measures, but it is not high enough to indicate
complete redundancy.
The relatively high coefficient alphas of the post-test
satisfaction questionnaires (See Table 18) also suggest that
reliability is probably not a major cause of the attenuation
in the correlations for post-test satisfaction. It is reasonable
to speculate that responses to post-test satisfaction
questions elicit reactions to aspects beyond the immediate
Task Level Independence and Range Restriction
Although there are many likely causes for the differences
among aggregation levels, one notable difference occurs
when correlating the data within users or tasks. At this level
there was often little variation. Many users completed all
tasks successfully and many tasks had 90 to 100%
successful completion rates. Error rates were also often
homogenous at this level, with many users committing no
errors and many tasks being error-free. Under these
circumstances, it is impossible to compute a correlation,
which excludes the task from the types of analysis
conducted in the present study (as illustrated with the
sample data in Table 4).
At different levels of aggregation though (e.g., OO), a task
with a 100% completion rate gets combined with other
tasks, allowing it to contribute to the computed correlations.
What’s more, at very high or low levels of magnitude there
is also a more limited opportunity to detect correlations (as
only 1 or 2 values may differ). This problem is most
noticeable when correlating the discrete measures
(completion rates and errors) when there are a limited
number of values. It is also a potential problem for posttask satisfaction scales if there are few scale steps.
This factor affected 5 out of 6 of the correlation pairs, with
the greatest range restriction expected for the correlations
between completion and errors and between completions
and satisfaction. To a slightly lesser extent, it will restrict
the correlations between completion and time, errors and
time, and errors and satisfaction. There should be little
restriction of the correlation between time and satisfaction.
CONCLUSION
Recent investigations of the magnitude of correlations
among prototypical usability metrics have had mixed
results, with some indicating substantial correlation [17]
and others less substantial [7]. In this paper, we report the
correlations computed from a database with prototypical
usability metrics (task times, completion rates, errors, posttask satisfaction, and post-test satisfaction) from 90 distinct
summative usability studies. For these types of studies and
measurements, the data indicated that prototypical usability
metrics correlate substantially. Additional analyses
provided evidence of their association with an underlying
general construct of usability made up of an objective factor
and a subjective factor, supporting the practice of
combining component usability metrics into a single score.
The results of this study help to clarify the factors that
affect the correlation structure of usability studies, such as a
focus on summative usability studies (as opposed to more
general studies of human-computer interaction),
distinguishing between post-task and post-test satisfaction
measurement, and the effect of various data-aggregation
schemes.
REFERENCES
1. Bangor, A., Kortum, P. T., & Miller, J. T. (2008). An
empirical evaluation of the System Usability Scale.
International Journal of Human-Computer Interaction,
6, 574-594.
8. Kirakowski, J., & Corbett, M. (1993). SUMI: The
Software Usability Measurement Inventory. British
Journal of Educational Technology, 24, 210–212.
9. Leong, F. T. L., & Austin, J. T. (2005). The psychology
research handbook: A guide for graduate students and
research assistants. Thousand Oaks, CA: Sage
Publications.
10. Lewis, J. R. (1991). A rank-based method for the
usability comparison of competing products. In
Proceedings of the Human Factors Society 35th Annual
Meeting (pp. 1312-1316). San Francisco, CA: Human
Factors Society.
11. Lewis, J. R. (1991). Psychometric evaluation of an afterscenario questionnaire for computer usability studies:
The ASQ. SIGCHI Bulletin, 23, 1, 78-81.
12. Lewis, J. R. (1995). IBM computer usability
satisfaction questionnaires: Psychometric evaluation and
instructions for use. International Journal of Human–
Computer Interaction, 7, 57–78.
13. Lewis, J. R. (2002). Psychometric evaluation of the
PSSUQ using data from five years of usability studies.
International Journal of Human–Computer Interaction,
14, 463–488.
14. Lewis, J. R. (2006). Usability testing. In G. Salvendy
(Ed.), Handbook of Human Factors and Ergonomics
(3rd ed.) (pp. 1275-1316). New York, NY: John Wiley.
2. Brooke, J. (1996). SUS: A quick and dirty usability
scale. In P.W. Jordan, B. Thomas, B.A. Weerdmeester
& I.L. McClelland (Eds.), Usability Evaluation in
Industry (pp. 189-194). London: Taylor & Francis.
15. McGee, M. (2004). Master usability scaling: Magnitude
estimation and master scaling applied to usability
measurement. In Proceedings of CHI 2004 (pp. 335342). Vienna, Austria: ACM.
3. Cohen, J. (1988). Statistical power analysis for the
behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.
16. Nunnally, J. C. (1978). Psychometric theory. New
York, NY: McGraw-Hill.
4. Coovert, M. D., & McNelis, K. (1988). Determining the
number of common factors in factor analysis: A review
and program. Educational and Psychological
Measurement, 48, 687–693.
17. Sauro, J., & Kindlund, E. (2005). A method to
standardize usability metrics into a single score. In
Proceedings of CHI 2005 (pp. 401-409). Portland, OR:
ACM.
5. Cordes, R. (1984). Software ease-of-use rating using
magnitude estimation (Tech. Report 82-0156). Tucson,
AZ: IBM.
18. Tedesco, D. P., & Tullis, T. S. (2006). A comparison
of methods for eliciting post-task subjective ratings in
usability testing. UPA 2006, unpublished presentation.
(www.upassoc.org/usability_resources/conference/2006/
post_task_ratings.pdf)
6. Hassenzahl, M. & Sandweg, N. (2004). From mental
effort to perceived usability: Transforming experiences
into summary assessments. In Proceedings of the CHI
04 Conference on Human Factors in Computing
Systems. Extended abstracts (pp. 1283-1286). New
York: ACM.
7. Hornbæk , K., & Law, E. (2007). Meta-analysis of
correlations among usability measures. In Proceedings
of CHI 2007 (pp. 617-626). San Jose, CA: ACM.
19. Xie, B., & Salvendy, G. (2000). Prediction of mental
workload in single and multiple tasks environments.
International Journal of Cognitive Ergonomics, 4, 213242.