A1 Paper4 Tang
A1 Paper4 Tang
A1 Paper4 Tang
1 INTRODUCTION
Millions of people track their physical activity. Eighteen percent of US adult consumers own a wearable
fitness tracker, with the majority using their device often [1]. Over 22 million wearable devices were
Authors’ addresses: Lie Ming Tang, University of Sydney, School of Information Technologies, NSW, 2006, Australia; Jochen
Meyer, OFFIS Institute for Informatics, Oldenburg, 26121, Germany; Daniel A. Epstein, University of Washington, Seattle, 37
WA, 98195-2350, United States; Kevin Bragg, University of Sydney, Sydney School of Public Health, NSW, 2006, Australia;
Lina Engelen, University of Sydney, Sydney School of Public Health, The University of Sydney, NSW, 2006, Australia;
Adrian Bauman, University of Sydney, Sydney School of Public Health, NSW, 2006, Australia; Judy Kay, University of
Sydney, School of Information Technologies, NSW, 2006, Australia.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2018 Association for Computing Machinery.
2474-9567/2018/3-ART37 $15.00
https://doi.org/10.1145/3191769
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:2 • L. Tang et al.
shipped in the second quarter of 2016 alone [21]. Beyond this, smart phones and watches are increasingly
making it possible for more people to track their physical activity.
This means that people are building up vast collections of data about their physical activity. Such
data should be valuable for the individual who wants to understand their own long term activity. The
aggregated data can provide a low cost way to collect data about various populations. Table 1 shows
examples of the types of questions that such data have the potential to answer. The first row illustrates
questions about activity level. For example, an individual may want to know how many steps a day they
average; an example answer is 10,500 steps. Answering such questions can give the benefits for reflection,
self-monitoring and planning as documented in personal informatics research [9, 15, 18, 19, 25, 28, 33].
Aggregate analyses can provide a corresponding average daily step count, such as the 7,500 for that
population. A large body of health literature has examined such questions, for example, to inform
recommendations about levels of physical activity for good health [20, 41]. The second type of question
asks whether a target goal has been met [38, 41]. For example, did I meet my goal of >30 active minutes
a day.
A key challenge for interpreting physical activity data is that it is typically incomplete [13, 26, 30, 34].
Many factors can contribute to gaps in the data, such as forgetting to wear the device, device loss or
changes in motivation to track [3, 9, 14, 26, 30]. To answer questions like those in Table 1, it is critical
to account for this incompleteness. To see why this is so, consider the first question in the table −
determining a person’s average step count. Consider the case of Alice, who wears her tracker all day, every
day; a reliable answer can be calculated as a simple average over each day’s step counts. But consider
another person, Bob, who wears his tracker only on 60% of days − but on those days he wears it at all,
he wears all day. His average daily step count should be based on just those days where he has data (the
total step count divided by the days with data). We also need to consider the impact of the wear time
within a day. Consider another person, Carol, who wears her tracker every day but only in the morning
on weekends and all day on weekdays. Now a meaningful answer is more complex to determine − it needs
to account for the incompleteness of her weekend data.
This paper aims to provide foundations for a systematic process to take account of the incompleteness
of personal sensor data for physical activity when answering questions like those in Table 1. We introduce
adherence, a notion that reflects the fact that an activity tracker should give accurate answers to questions
about activity for people like Alice who has 100% adherence, wearing her tracker all day, every day. We
aim to establish adherence measures to account for people with less than 100% adherence, be this like
Bob, Carol or the myriad of other wearing possibilities.
While previous work has studied wearing behaviour [3, 9, 13, 14, 26, 30, 34], there has been no research
on how to systematically tackle the analysis and reporting of that data to account for its incompleteness.
Table 1. Examples of important questions long term physical activity data can answer, at the level of an individual or
aggregate, population levels.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:3
This is important if people are to trust the information that applications report on physical activity,
claiming, or appear to claim, that ubicomp sensor data gives objective truths of one’s activity. Fogg [17]
warns that when systems produce questionable data, people are less likely to trust them and so they are
less useful as a behaviour change tool. Bentley et al [3] found that incomplete data led to a loss of trust
in their tool. Consolvo et al ([9], page 234) also reported this when missing data affected self-monitoring
feedback.
To address these challenges, our work aims to provide systematic foundations for analysis and reporting
physical activity data, accounting for data incompleteness. To do this we tackle the following research
questions:
RQ1 What is the impact of different adherence measures on data ignored?
RQ2 How can we account for adherence for Activity-level questions?
RQ3 How can we account for adherence for Goal-met questions?
We explored these questions by analysing the impact of different adherence definitions on 12 datasets,
with a total of 753 physical activity tracker users, who had more than 77,000 days with data, interspersed
with over 73,000 days without data. We analysed them with 4 adherence measures that have been reported
in previous literature.
∙ >0 steps: the least stringent answers activity questions using data from any day that has any data
[14, 26, 29, 30, 34],
∙ >500 uses only days with more than 500 steps [29, 30];
∙ >10 hours − uses only days with at least 10 different hours with data [6, 26, 31];
∙ 3-a-day one requires data within 3 time periods of the day [29, 30].
As this is the first work to establish a systematic way to account for adherence, we chose this core set of
research questions, 12 very diverse datasets and these 4 adherence measures from the literature.
The next two sections introduce key terminology and related work. Then the core of the paper: the
study design, results and their discussion.
2 DEFINITIONS OF ADHERENCE
This section introduces key terms for defining adherence. It begins with ways to describe tracker wear-time.
It then defines a valid day, one with data of sufficient quality that it is meaningful to include in analyses,
along with criteria for assessing the validity of a day and ways to report adherence. It concludes with a
review of adherence that goes beyond a single day, to describe adherence for one week and for longer
periods.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:4 • L. Tang et al.
tracker as instructed. We want to be able to meaningfully analyse the vast collections of activity data that
that people are building up. For this, we need to refine the notion of adherence to provide a conceptual
framework and a systematic process for ubicomp researchers, and others, to meaningfully interpret such
data.
The last row, adherence to physical activity recommendations (also called compliance), describes how
well a person or sample population meets a recommended level of physical activity. Adherence to a
recommendation is of enormous importance [20] and the use of trackers to obtain objective measures of
activity levels is of intense interest in health literature [11, 16, 31, 38, 41].
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:5
Then the only days are invalid if there is insufficient data to make the day valid if neither the threshold
is met, nor the goal e.g., reached 3,000 steps with 4 hours wear-time.
The last two rows introduce terms to use when reporting adherence. Daily adherence refers to the
percentage of valid days in a dataset, as a description of completeness in terms of valid days. Daily
adherence can be calculated for an individual or a population.
We now consider adherence beyond a single day. Weekly adherence measures the average number of
valid days per week (only calculated during weeks where there is at least one valid day). For example,
suppose a person has 50% daily adherence (i.e., having 50% of days valid) and 7 days-per-week weekly
adherence. This corresponds to a person who had only 50% of their days valid but this held over every
day of the weeks with data. Table 4 summarises other terms that have been used to describe patterns of
adherence. A streak is an unbroken sequence of valid days. In contrast, a break, lapse or gap describes
sequences of days that are not valid days. Phases or trials describe a series of streaks separated by short
breaks and ending with a long break.
In the next section, we review the large body of work reporting physical activity tracker adherence and
existing methods to analyse and address incompleteness in physical activity tracker data.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:6 • L. Tang et al.
3 RELATED WORK
In this section, we first review research on activity tracker wearing behaviour and how this impact data
completeness. We then review work that highlights why this is important for ensuring user trust and the
ability to reflect on their data. Finally, we review reported methods used to deal with data incompleteness.
We describe this work using the term, adherence, as just defined, although these terms were not used by
the authors.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:7
Participants reported that uncertainty information helped them make better decisions and alleviate
anxiety when the app information did not match their knowledge. In a study of an interface that embeds
daily and hourly adherence (wear-time) information in a calendar visualisation, [34] this adherence
information helped users reflect on their long term activity, and to link this with their knowledge about
the context. Some participants also reflected also on their wearing behaviour as well as factors affecting it.
These results indicate the importance of accounting for and presenting information about incompleteness
or uncertainty.
To summarize, adherence can impact the perceived accuracy of activity tracker data and there is
evidence that presenting this may improve confidence and trust in the application as well as to support
reflection.
3.4 Summary
Existing literature highlights that people have varying wearing behaviours, providing diverse levels and
patterns of completeness. This poses important problems for ensuring trust needed for people to effectively
reflect on their data as a foundation for behaviour change. There has been limited work on how to address
the incompleteness that can be expected in many datasets. This is the challenge that our work aims to
address.
4 STUDY DESIGN
Our study design has three elements: a suitable collection of datasets; a set of adherence definitions to
explore; and a sequence of experiments to perform to gain insights into our research questions. The design
process began when the Sydney University authors were analysing several of their datasets and began
to appreciate the need to for a systematic approach to taking account of tracker adherence. The team
initiated the collaboration with the other authors, Meyer and Epstein, to discuss their insights, based on
their work on wearing behaviours and patterns, including streaks, breaks, lapses, phases and trials as just
described. The new team then established the design for this study.
4.1 Datasets
The team collected the 12 datasets presented in Table 5. This collection is diverse on many dimensions
as we now describe. The first column shows the name we use to refer to the dataset. The second is the
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:8 • L. Tang et al.
Table 5. The 12 datasets from 9 studies of various lengths and population size. The first column is the identifier we
use to describe the dataset. Next is the sample size and average duration in days, the average step count (using only
days with >0 steps) and then the recruitment methods. The data source column distinguishes volunteers datasets (the
first block), from the remainder, being other study-generated datasets.
number of people in the dataset (col 2) and then is the average duration in days (col 3), from the first
to the last day with data. The forth column is the median steps per day calculated using the >0 steps
threshold. The next two columns (col 5 and 6) indicate the sources of the data in terms of the means used
to recruit participants and whether the data was volunteers or other study-generated. The last column
overviews the ways the datasets have been used in previous published work or that it has not been
published. The table organised groups as volunteers datasets first. The remaining datasets are all other
study-generated because participants were recruited as part of another study and in these, participants
were provided with a tracker.
The first 5 datasets are volunteers. In these, people who were already tracking were recruited to
volunteer their data. Volunteer1 consists of 113 Fitbit users who tracked from 18 to 731 days (average
344), with 73% having tracked for >=6 months. The Volunteer2 and Volunteer3 Fitbit datasets were
used to study gaps and lapses in activity tracker use [13]. These had different recruitment methods: of the
141 users, the 67 of Volunteer2 were recruited in a similar way to Volunteer1 and the other 74 via Amazon
Mechanical Turk (a popular crowd-sourcing website). The two groups had different Fitbit use patterns,
with participants from the snowball recruitment wearing their Fitbits more and walking more each day
than those recruited via Amazon Mechanical Turk. Volunteer4 recruited long term Fitbit trackers via
forums and email, for a study that sought to understand how they already used their long term activity
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:9
data and then to study their use of a calendar-based interface showing their full record of activity and
wearing behaviour [34]. All but 2 had >6 months of data. The last in this group, Volunteer5, also a
volunteered dataset, involves a different device, the VitaDock [30].
All the remaining datasets are other study-generated, meaning that the data was collected as part of
another study for which participants were recruited to answer a question unrelated to studying wearing
behaviour. The middle block of three datasets and these, along with Volunteer5 were analysed by Meyer
et al [30] to gain understanding of wearing patterns and how to describe them. These three studies used
various trackers, including various versions of Fitbit and the Medisana ViFit tracker. The first, Elder, is
distinctive in that it is the only dataset in our collection where tracker use was mandatory, as is typical in
medical studies. This means that this dataset only has participants who had recorded tracker data. This
dataset is also distinctive as it recruited an older population, aged 65-75. Cardiac users were recruited
within 30 days of a myocardial infarction as part of a 12-month rehabilitation program. Lotus is a very
small longitudinal observational study with no explicit intervention. This study’s population is closer to
the self motivated Fitbit users from Volunteer1, Volunteer2, Volunteer3 and Volunteer4. These 3 studies
used various activity tracking devices including various versions of Fitbit and the Medisana ViFit tracker.
The next 4 data sets came from studies of university students who were lent a Fitbit Zip. Student2
and Student1 involved Medical Science students, recruited in a tutorial class and split into control group
(Student2 ) and an intervention group (Student1 ) to assess the impact on wear-time from a weekly SMS
(text message) on Fridays, reminding the experimental group to wear their tracker. [4]. Student3 and
Student4 datasets were from an observational study to learn about physical activity levels of undergraduate
students [26]. These students were also recruited in a tutorial class.
To summarize, our datasets are diverse in terms of all the dimensions summarised in Table 5 as well as
the details above. This makes it a rich collection for exploring our research questions.
4.2 Thresholds
Table 3 introduced several definitions and background for defining the thresholds for a valid day. As
this is the first systematic analysis of adherence over a diverse collection of datasets, we restricted our
analyses to just four carefully chosen valid day thresholds that have been used in previous research on
wearing behaviours and in analysing activity tracking data:
∙ >0 steps as this is the simplest and least restrictive – used in [13, 26]
∙ >= 500 steps as another simple step criterion – used by [29, 30]
∙ 3-a-day, a measure of wear through the day [29, 30]
∙ >10 hours, the most stringent measure, requiring at least 10 hours, each with at least 1 step.
[26, 31, 34]
The 10-hours adherence threshold has been used in health literature [27, 31, 37, 42] - although lesser
ones have also been used (e.g. 8 hours [24]).
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:10 • L. Tang et al.
In our analysis, wear-time (hours per day) is calculated as the count of hours where at least 1 step
was recorded. This approximation is needed because the trackers for our datasets do not distinguish a
sedentary person wearing the tracker from a tracker that is not worn.
While our method is not exact for accounting non-wear, it serves as an estimate for comparison at a
population level.
5 RESULTS
We now present the results of our analysis. The presentation is organised around our three research
questions. First, we consider the impact of different adherence definitions on the data ignore (RQ1). Then
we show how these definitions impact on results. In the discussion, we consider how these results point to
ways to account for these results (RQ2 and 3).
Fig. 1. Comparing %-age of days discarded, with valid day thresholds: >=500 steps, >=10-hours and 3-a-day, against
>0 steps − days with no data
Figure 1 compares the %-age of days of data that are ignored for dataset. It takes the >0 steps measure
as a baseline, since it is the least stringent measure. It shows how the other three measures compare
against it. This and subsequent graphs order the datasets as in Table 5. This groups them according to
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:11
the broad characterisation of the datasets in terms of whether they were volunteered or part of a study
that influenced participants to track. We include the sample size in the labels so the reader can see where
the small size of a dataset may help explain the results.
First consider the >500 minimum step criterion (yellow circles). Overall, the graph shows that this
criterion discards between 5% and 10% of days that are valid on the >0 steps criterion (mean: 5.1%, SD:
2.8%, 95% CI: ±1.8%). These are days with exceedingly modest use of the tracker (i.e., between 1 and
500); if a dataset has many of these, the reasons for this may deserve exploration to understand why
people would often make so little use of a tracker or if it indicates problems with the device (such as
problems with device calibration for people with specific mobility problems, and using a walking frame).
We now consider the through the day thresholds (green diamond for >= 10 hours and the blue square
for 3-a-day). There are three striking trends here. First, the levels of data loss now have a far wider
range, from 6% (Volunteer1 ) to 47% (Student4 ). Secondly, the data loss is always higher than the >=500
step threshold; these differences are significant (one-way ANOVA F(2,33)=5.64, p<0.001). Thirdly, both
through the day thresholds are strikingly similar to each other for most datasets, also reflected in the
mean %-age of days of data discarded showing no significant differences.
∙ mean: 21.5%, SD: 11.7%, 95% CI: ±7.4% - 3-a-day threshold
∙ mean: 23.4%, SD: 12.7%, 95% CI: ±8.1% - 10-hour threshold
One clear outlier is Volunteer4, the only case where 3-a-day is around 5% and much closer to the >= 500
steps and the 10 hours threshold is almost 20%. shown in Figure 1. This may be due to the small sample
size (N=23) or variations at the individual level. As we would expect, these results show that the through
the day thresholds, being stricter, may discard far more data. However, the differences varied widely
across the datasets. The figure indicates much smaller differences for the Volunteers datasets, compared
with the other datasets (Students plus Others - Elder,Lotus,Cardiac) and the difference is significant (2
sample t-test, t10 =3, p=0.01, 95% CI: ±12.6%):
∙ mean data loss: 13.5%, SD: 4.8%, 95% CI: ±5.9% for the Volunteers datasets;
∙ mean data loss: 30.4%, SD: 11.8%, 95% CI: ±10.9% for the other datasets (Students plus Others -
Elder,Lotus,Cardiac).
The example in Figure 2 illustrates that this issue is also relevant when answering Goal-met questions.
In this example, using a valid-day threshold of >10 hours, we show the percent of days that would
be excluded (or considered insufficient data days) for each of our datasets. The green bars represent
percentage of days where participants meet the 30 active minutes goal (i.e., defined as goal-met in Table
3). The yellow bars represent percentage of days where participants did not meet goal and recorded 10
hours or more data (i.e., goal-not-met). The grey bars represent the percentage of days where participants
did not meet the goal but also did not record 10 hours of data (i.e., insufficient data). The figure shows
very wide differences in the percentage of insufficient data days across different datasets from 6% for
Volunteer1 to over 52% for Student4.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:12 • L. Tang et al.
Fig. 2. An example of a Goal-met report showing percentage of days that met 30 active minutes, not met or had
insufficient data. Note: using 10 hours valid threshold to determine insufficient data days. Note: datasets Volunteer2,
Volunteer3, Volunteer4, Cardiac and Volunteer5 are not included due to lack of per minute data. N of each sample
included in X-axis label.
the participants have wear-time above the threshold of 10 hours, it may be worthwhile exploring less
stringent thresholds, and so include more of the population.
Continuing our focus on people whose data is discarded, Figure 4 shows an analysis of weekly adherence,
the number of days a week that were valid (using the 10 hours valid day threshold). The Volunteers
datasets were dominated by people with high weekly adherence, with almost half of them averaging
7 valid days a week. The Others datasets (Elder, Cardiac and Lotus) have a flatter distribution but
still have 30% of people with 7 valid days a week. By contrast, the Students (red squares) have a very
different profile of weekly adherence, with very few reaching 7 valid days a week. This is reflected in the
weekly adherence across the groups:
We repeated this analysis for the least restrictive >0 steps threshold. We found that overall, this more
relaxed threshold gave 1.5 (21%) more valid days per week.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:13
Fig. 3. Comparison of %-age of users with median wear-time >=10 hours, >= 6 and < 10 hours and < 6 hours. N
included in X-axis label.
Fig. 4. Comparison of users median weekly adherence (no. of valid days per week). Grouped as follows: Volunteers
(Volunteer1,Volunteer2,Volunteer3,Volunteer4,Volunteer5 N=310), Students (Student1,Student2,Student3,Student4
N=305), Others (Elder,Cardiac,Lotus N=138)
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:14 • L. Tang et al.
5.2 Exposing Uncertainty: The Impact of Threshold Methods on Activity Level Reporting (RQ2
and 3)
In this section, we show how the adherence measure can affect interpretation of the activity data. We do
this in two activity levels, average step counts per day and average active minutes per day.
Fig. 5. Comparison of median steps across populations, showing impact of different valid day thresholds. N included in
dataset labels.
Figure 5 shows the median steps per day when calculated against the 4 adherence thresholds. For each
dataset, it shows, from the top, 10 hours (top end of line), 3-a-day (top of box), >=500 steps (bottom of
box) and >0 Steps / day (lower end of line) 1 . This figure indicates how results might differ depending
on the adherence definition used. The clear picture that emerges is that the impact varies considerably
across the datasets.
For example, across all datasets, the mean steps for each adherence definition are:
∙ 7133 steps, >0 steps (SD: 1197, 95% CI: ±761)
∙ 7682 steps, >=500 steps (SD: 926, 95% CI: ±588).
∙ 8415 steps, 3-a-day (SD: 1060, 95% CI: ±673)
∙ 8779 steps, >10 hours (SD: 1090, 95% CI: ±692)
1 We used a candlestick-like visualisation (or box plot) to convey the spread between the datasets. While it is theoretically
possible for steps count for the 3-a-day threshold to be higher than 10 hours or lower than the 500 steps, this is not the case
in any of our datasets. So, this format gives a compact summary of our analyses.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:15
There was no significant difference between the two step-count thresholds (>0, >=500), nor between the
through the day thresholds (3-a-day and >10 hours). However, the step count using the least stringent
threshold (>0 steps) is 1,646 less steps (23%) than using the most stringent (>=10 hours). This significant
difference is quite large in terms of absolute activity level difference (paired t-test, t11 =6.9, p<0.0001,
95% CI: ±527).
Fig. 6. Comparison of active minutes results across populations, showing the impact of different valid day thresholds.
Datasets Volunteer2, Volunteer3, Volunteer4, Cardiac and Volunteer5 are not included due to lack of per minute data.
N included in dataset labels.
We also compared >0 steps and 10 hours thresholds for Volunteers datasets and found no signif-
icant differences. However, when we examined this with others datasets (i.e., Students plus Others
-Elder,Cardiac,Lotus), there was an average of 2,020 steps difference between step counts using the 2
thresholds (paired t-test, t6 =6.9, p<0.001, 95% CI: ±718).
Figure 6 shows a similar analysis, now for active minutes per day 2 . Similar to our analysis of steps
counts, Students and Others (Elder,Lotus) datasets had significantly higher active minutes (1.9 minutes
more) using the 10 hours threshold compared to the >0 steps threshold (paired t-test, t5 =5.9, p=0.002,
95% CI: ±0.81).
2 Tocalculate active minutes calculations, we used 120 steps per minute, commonly used to calculate moderate-to-vigorous
physical activity (MVPA) [41]. Some datasets were excluded from this analysis due to lack of per minute activity tracker
data (i.e., Volunteer2, Volunteer3, Volunteer4, Volunteer5 and Cardiac)
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:16 • L. Tang et al.
6 DISCUSSION
In the introduction, we presented examples of core Activity-level and Goal-met questions, both for
individuals and in aggregate. As a foundation for our discussion, we now introduce the following questions
− these refer to two hypothetical long-term data sets, called Dataset1, Dataset2.
(1) In Dataset1, what were people’s average daily steps counts in 2014 and 2017?
(2) Were people in Dataset1 more active than those in Dataset2 in 2017?
(3) For the goal of 120 minutes a week of moderate activity, what percentage of Dataset1 people met
the goal in 2014 and 2017?
The first is an Activity-level question, to compare activity level within a dataset. The second question is
similar to the first, but involves comparisons between datasets. The third is similar to the first, but it is
for a Goal-met question. These go beyond the analyses we have reported but are useful for broadening
the scope of our discussion, building on the reported work. We refer to these questions in the discussion,
which starts with the key insights for our three research questions. Building from this, we present a set
of recommendations for analysing physical activity data. We then briefly discuss the diverse goals for
activity tracking. Finally, we discuss the limitations of our work, along with future directions for research
to support systematic and effective accounting for adherence in physical activity data and other personal
sensor data.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:17
This impacts calculations of Activity-level questions (RQ2). For example, in the combined student
datasets, the >0 steps adherence measure gave a daily step count of 6,952 where the >=10 hours gave
9,423, 35% higher (see Figure 5). From a health perspective, this is an important difference. Corresponding
to this, the different adherence measures had very different impacts on the data ignored (RQ1). The
%-age of days ignored moved from 32% to 60% (see Figures 1 and 5). In terms of people ignored, the
>10 hours per day measure excludes more than a third of students (34%), compared with days with >0
steps (see Figure 3). For weekly adherence, less than a third of students (32%) averaged 4 days or more
per week with >10 hours (see Figure 4). This highlights the challenge in interpreting such data to draw
conclusions about Activity-level and Goal-met questions. It indicates the potential for introducing bias
by using an adherence measure that excludes many low adherence users. There is a similar picture for
Goal-met questions (RQ3), For example, for the combined students dataset, the >10 hours threshold
would ignore 43% of days. our work suggests that there are no easy answers to interpreting such questions
but it does point to the importance of reporting adherence measures and their impact along with inferred
answers to questions.
2: explore the impact of the adherences measure(s). For each adherence measure considered, replicate
our analyses to assess its impact in terms of:
∙ the %-age of days ignored
∙ the %-age of people ignored.
Then follows the actual analyses of the data, based on the chosen set of adherence measures, to
determine the answers to the core questions, such as the examples at the beginning of this section.
3: Report adherence along with results of data analysis. For many contexts, analysis of tracker data needs
to provide a single answer to a question. Even in these cases, we recommend that this result is reported
along with:
∙ the adherence measure used;
∙ the results of the analyses for this measure in Step 2;
These recommendations provide a way to enhance trust and confidence in information by presenting
information about accuracy. In the case of research reports, where more information can be provided, we
also recommend providing an explanation for the choice of the adherence measure as well as details of
the fuller analyses from Step 2. Over time, this would make it easier for researchers to compare reported
results across the literature.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:18 • L. Tang et al.
6.3 Reasons for Collecting and Analysing Activity Data and Implications for Accounting for
Adherence.
Motivations for Collecting Tracking Data
There are many possible reasons for a person to track physical activity, producing a dataset. Tracking may
be initiated by the individual. Increasingly, smart-phones automatically collect activity data, often without
the owner being aware. But much tracker data is purposefully collected. This form of tracking has been the
focus of Ubicomp’s Personal Informatics, as well as the Quantified Self communities [7–9, 12, 25, 32, 33].
That research reports various motivations for such tracking, such as self-monitoring, self-reflection and
conducting n-of-1 experiments [7, 10] and with suitable interfaces, to support behaviour change [9].
(Our volunteer data is most similar to this work.) Beyond this, tracking may be initiated as part of an
intervention (Elder, Cardiac and Lotus) perhaps on advice of a medical professional or in a study of
a population (like our student datasets). People may use trackers for short periods, for example just
for a week to establish a baseline and then at later points, to assess the effects of the intervention, as
in [35]. Others track consistently over long period. In all these cases, the answers to questions such
as the examples at the beginning of this section demand a solid adherence analysis such as we have
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:19
recommended. This can be the basis for assessing the accuracy of the results and if multiple adherence
measures are used and reported on, this will support comparisons in the literature.
Using Adherence Measures to Change Wearing Behaviour Versus Using Adherence Measures to
Make Sense of Available Data
If adherence measures are reported along with results, as in the examples above, this may give individuals
information to help them re-consider their adherence levels. For example, Tang et al [34] reported that
some of their participants said that they planned to be more adherent. But a core goal of our work is
to provide foundations to harness available activity data, regardless of whether the adherence is low,
intermittently high, consistently high or any other pattern. Even with lower adherence, if the adherence
is reported, people may well gain valuable answers to their questions, along with information to assess
the reliability.
Questions explored. Our three research questions explored the impact of adherence measures and how
to account for adherence when answering questions about physical activity. We began with questions
about pure Activity-level and Goal-met questions. At the beginning of this section, we introduced similar
questions for comparisons within and between datasets. These are basic questions, representing just a
starting point in analysing physical activity data.
Adherence definitions and analyses explored. To manage the complexity of results to present, we carefully
selected just 4 valid day definitions, two each, minimum-step-count and through-the-day. In the case of
minimum-hours-in-the-day, we focused on 10-hours a day because it is common in literature, but we did
explore the impact of considering <6 hours and 6-9 hours. Similarly, the set of experiments we report
was carefully chosen to explore our research questions and enable a reader to see the key results showing
the importance of definitions of adherence.
Datasets: participants. while our datasets are large and diverse, they all do come from authors of the
paper. It will be valuable to see this work replicated on other data sets. Our results show that our
volunteer and student categorisation does show some commonalities within these categories. There are
many dimensions that may be important for describing key characteristics of the people in a dataset,
such as gender, age, health status and importantly motivation for collecting the data. Associated with
these are characteristics such as the duration of data collection.
Datasets: activity tracking devices. Our datasets all come from similar trackers, although people had
different variants on these trackers. None of our devices can distinguish inactivity from non-wear. Many
current tracker devices can do this (for example, [22]). Even in our datasets, some participants may have
changed devices through the time. We ignored these issues in our analyses. To answer the three questions
at the beginning of this discussion, device differences need to be considered. There is also a broader
discussion around accuracy of trackers for reliability and comparability [16, 31, 37, 39].
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:20 • L. Tang et al.
Using Smart-phone physical activity data. Another important wearable (or carried) device is the mobile
phone. Our work provides a foundation for identifying meaningful measures of adherence for tracking
physical activity as captured by a mobile phone. For people who always wear/carry their phones when
awake, and so having high adherence, the phone could provide a reliable way to track activity. However,
many people do not do this. Yet a recent large study assumed it was reliable to compare populations,
based on assessing daily wear-time from the first to last phone use in the day [2]. On this basis, they
reported a mean step of 5,039 across 111 countries and 717,527 users and compared step counts by country.
There is no indication of incompleteness in their data. Are comparisons between different countries based
on comparable daily adherence (e.g., 80% US versus 70% Australia or 80% US versus 20% China)? Our
work suggests that it would be valuable to also calculate the step counts for other adherence measures
that are well suited to a mobile phone. An adaptation of our >10 hours measure seems a good starting
point.
Our work has implications for other inferences from wearable devices that measure many other things,
such as heart-rate, stress, sleep quality, air-quality. We see an important role for adherence measures
in the ubicomp field when applications seek to combine and provide self monitoring using data from
different classes of devices and over time. As tracking capabilities, devices and sources of data change, so
too would the wearing and adherence patterns of the data people have collected. This is true for both
aggregate data and personal informatics.
7 CONCLUSION
We reviewed both health and computing literature to establish definitions of adherence, a measure of
incompleteness, to help answer two important classes of health questions: Activity-level and Goal-met.
We also introduced a new adherence measure Valid-Goal-Day needed when answering Goal-met based
health questions. Our analysis of 12 large and diverse physical activity tracker datasets showed that
previous threshold based methods of addressing incompleteness is not appropriate for large scale volunteers
datasets such as those collected through personal use e.g., Fitbit users. When dealing with volunteers
datasets, we recommend to analyse and also to report adherence measures for individual applications.
Further research is needed on how the information can best be delivered. For aggregate reports, it is also
important to report adherence criteria used along with the adherence measures and their impact on the
activity level results.
REFERENCES
[1] Laura Albert. 2017. The Surprising Potential Fitness Tracker Buyer. (Aug 2017). https://civicscience.com/
surprising-potential-fitness-tracker-consumer/ [Online; posted 21-July-2016 CivicScience].
[2] Tim Althoff, Rok Sosič, Jennifer L. Hicks, Abby C. King, Scott L. Delp, and Jure Leskovec. 2017. Large-scale physical
activity data reveal worldwide activity inequality. Nature (2017). https://doi.org/10.1038/nature23018
[3] F Bentley, K Tollmar, and P Stephenson. 2013. Health Mashups: Presenting Statistical Patterns between Wellbeing
Data and Context in Natural Language to Promote Behavior Change. Tochi 20, 5 (2013), 1–27. https://doi.org/10.
1145/2503823
[4] Kevin Alexander Bragg. 2015. Does The Quantified Self Equal Quantified Health? Ph.D. Dissertation. University of
Sydney.
[5] Lisa Cadmus-Bertram, Bess H Marcus, Ruth E Patterson, Barbara A Parker, and Brittany L Morey. 2015. Use of
the Fitbit to Measure Adherence to a Physical Activity Intervention Among Overweight or Obese, Postmenopausal
Women: Self-Monitoring Trajectory During 16 Weeks. JMIR mHealth and uHealth 3, 4 (2015), e96. https://doi.org/
10.2196/mhealth.4229
[6] Lisa A. Cadmus-Bertram, Bess H. Marcus, Ruth E. Patterson, Barbara A. Parker, and Brittany L. Morey. 2015.
Randomized Trial of a Fitbit-Based Physical Activity Intervention for Women. American Journal of Preventive
Medicine (2015). https://doi.org/10.1016/j.amepre.2015.01.020
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
Defining Adherence: Making Sense of Physical Activity Tracker Data • 37:21
[7] Eun Kyoung Choe, Bongshin Lee, Haining Zhu, Nathalie Henry Riche, and Dominikus Baur. 2017. Understanding
Self-Reflection: How People Reflect on Personal Data through Visual Data Exploration. In Proceedings of the 11th EAI
International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth’17). ACM, New York,
NY, USA, Vol. 10.
[8] Eun Kyoung Choe, Nicole B Lee, Bongshin Lee, Wanda Pratt, and Julie A Kientz. 2014. Understanding quantified-
selfers’ practices in collecting and exploring personal data. In Proceedings of the 32nd annual ACM conference on
Human factors in computing systems. ACM, 1143–1152.
[9] Sunny Consolvo, Predrag Klasnja, David W McDonald, James A Landay, et al. 2014. Designing for healthy lifestyles:
Design considerations for mobile technologies to encourage consumer health and wellness. Foundations and Trends® in
Human–Computer Interaction 6, 3–4 (2014), 167–315.
[10] Nediyana Daskalova, Karthik Desingh, Alexandra Papoutsaki, Diane Schulze, Han Sha, and Jeff Huang. 2017. Lessons
Learned from Two Cohorts of Personal Informatics Self-Experiments. Proceedings of the ACM on Interactive, Mobile,
Wearable and Ubiquitous Technologies 1, 3 (2017), 46.
[11] Aiden Doherty, Dan Jackson, Nils Hammerla, Thomas Plötz, Patrick Olivier, Malcolm H Granat, Tom White, Vincent T
van Hees, Michael I Trenell, Christoper G Owen, et al. 2017. Large scale population assessment of physical activity
using wrist worn accelerometers: The UK Biobank Study. PloS one 12, 2 (2017), e0169649.
[12] Chris Elsden, David S Kirk, and Abigail C Durrant. 2015. A Quantified Past: Toward Design for Remembering With
Personal Informatics. Human–Computer Interaction (2015), 1–40.
[13] Daniel A Epstein, Monica Caraway, Chuck Johnston, An Ping, James Fogarty, and Sean A Munson. 2016. Beyond
Abandonment to Next Steps: Understanding and Designing for Life After Personal Informatics Tool Use. Proceedings
of the 2016 CHI Conference on Human Factors in Computing Systems (2016), 1109–1113. https://doi.org/10.1145/
2858036.2858045
[14] Daniel A. Epstein, Jennifer Kang, Laura R. Pina, James Fogarty, and Sean A. Munson. 2016. Reconsidering the Device
in the Drawer: Lapses as a Design Opportunity in Personal Informatics. In Proceedings of the 2016 ACM International
Joint Conference on Pervasive and Ubiquitous Computing. ACM, 829—-840.
[15] Daniel A Epstein, An Ping, James Fogarty, and Sean A Munson. 2015. A lived informatics model of personal informatics.
In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM,
731–742.
[16] Kelly R Evenson, Michelle M Goto, and Robert D Furberg. 2015. Systematic review of the validity and reliability
of consumer-wearable activity trackers. The international journal of behavioral nutrition and physical activity 12, 1
(2015), 159. https://doi.org/10.1186/s12966-015-0314-1
[17] B J Fogg. 2003. Persuasive Technology: Using Computers to Change What We Think and Do. The Morgan Kaufmann
series in interactive technologies, Vol. 5. Morgan Kaufmann. 283 pages. https://doi.org/10.4017/gt.2006.05.01.009.00
[18] Thomas Fritz, Elaine M Huang, Gail C Murphy, and Thomas Zimmermann. 2014. Persuasive technology in the real
world: a study of long-term use of activity sensing devices for fitness. In Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems. ACM, 487–496.
[19] Rúben Gouveia, Evangelos Karapanos, and Marc Hassenzahl. 2015. How do we engage with activity trackers?: a
longitudinal study of Habito. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and
Ubiquitous Computing. ACM, 1305–1316.
[20] William L Haskell, I Lee, Russell R Pate, Kenneth E Powell, Steven N Blair, Barry A Franklin, Caroline A Macera,
Gregory W Heath, Paul D Thompson, Adrian Bauman, and Others. 2007. Physical activity and public health: updated
recommendation for adults from the American College of Sports Medicine and the American Heart Association.
Medicine and science in sports and exercise 39, 8 (2007), 1423.
[21] IDC. 2017. Worldwide Quarterly Wearable Device Tracker. (Aug 2017). https://www.idc.com/tracker/showproductinfo.
jsp?prod_id=962
[22] Hayeon Jeong, Heepyung Kim, Rihun Kim, Uichin Lee, and Yong Jeong. 2017. Smartwatch Wearing Behavior Analysis :
A Longitudinal Study. ACM Ubicomp 2017 / Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous
Technologies (IMWUT) 1, 3 (2017). https://doi.org/10.1145/3131892
[23] Matthew Kay, Tara Kola, Jessica R Hullman, and Sean A Munson. 2016. When (ish) is My Bus? User-centered
Visualizations of Uncertainty in Everyday, Mobile Predictive Systems. In Proceedings of the 2016 CHI Conference on
Human Factors in Computing Systems. ACM, 5092–5103.
[24] K Konstabel, T Veidebaum, V Verbestel, L A Moreno, K Bammann, M Tornaritis, G Eiben, D Molnar, A Siani, O
Sprengeler, N Wirsik, W Ahrens, and Y Pitsiladis. 2014. Objectively measured physical activity in European children:
the IDEFICS study. Int J Obes (Lond) 38 Suppl 2, S2 (2014), S135–43. https://doi.org/10.1038/ijo.2014.144
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.
37:22 • L. Tang et al.
[25] Ian Li, Anind K. Dey, and Jodi Forlizzi. 2012. Using context to reveal factors that affect physical activity. ACM
Transactions on Computer-Human Interaction 19, 1 (2012), 1–21. https://doi.org/10.1145/2147783.2147790
[26] Judy Tang Lie Ming and Kay. 2016. Daily & hourly adherence : towards understanding activity tracker accuracy. CHI
’16 Extended Abstracts on Human Factors in Computing Systems (2016).
[27] Charles E. Matthews, Kong Y. Chen, Patty S. Freedson, MacIej S. Buchowski, Bettina M. Beech, Russell R. Pate, and
Richard P. Troiano. 2008. Amount of time spent in sedentary behaviors in the United States, 2003-2004. American
Journal of Epidemiology 167, 7 (2008), 875–881. https://doi.org/10.1093/aje/kwm390 arXiv:NIHMS150003
[28] Jochen Meyer, Wilko Heuten, and Susanne Boll. 2016. No Effects But Useful ? Long Term Use of Smart Health Devices.
Ubicomp/ISWC’16 Adjunct (2016), 516–521. https://doi.org/10.1145/2968219.2968314
[29] Jochen Meyer, Jochen Schnauber, Wilko Heuten, Harm Wienbergen, Rainer Hambrecht, Hans-Jürgen Appelrath,
and Susanne Boll. 2016. Exploring Longitudinal Use of Activity Trackers. Procedings of IEEE ICHI - International
Conference on Healthcare Informatics (2016), 198–206. https://doi.org/10.1109/ICHI.2016.29
[30] Jochen Meyer, Merlin Wasmann, Wilko Heuten, Abdallah El Ali, and Susanne Boll. 2017. Identification and Classification
of Usage Patterns in Long-Term Activity Tracking. CHI ’17 Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems (2017). https://doi.org/10.1145/3025453.3025690
[31] Jairo H. Migueles, Cristina Cadenas-Sanchez, Ulf Ekelund, Christine Delisle Nystrom, Jose Mora-Gonzalez, Marie
Lof, Idoia Labayen, Jonatan R. Ruiz, and Francisco B. Ortega. 2017. Accelerometer Data Collection and Processing
Criteria to Assess Physical Activity and Other Outcomes: A Systematic Review and Practical Considerations. Sports
Medicine (2017), 1–25. https://doi.org/10.1007/s40279-017-0716-0
[32] Amon Rapp and Federica Cena. 2016. Personal Informatics for Everyday Life: How Users without Prior Self-
Tracking Experience Engage with Personal Data. International Journal of Human-Computer Studies 94 (2016), 1–17.
https://doi.org/10.1016/j.ijhcs.2016.05.006
[33] John Rooksby, Mattias Rost, Alistair Morrison, and Matthew Chalmers Chalmers. 2014. Personal tracking as lived
informatics. Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI ’14
(2014), 1163–1172. https://doi.org/10.1145/2556288.2557039
[34] Lie Ming Tang and Judy Kay. 2017. Harnessing Long Term Physical Activity Data: How Long-term Trackers Use
Data and How an Adherence-based Interface Supports New Insights. Proc. ACM Interact. Mob. Wearable Ubiquitous
Technol. 1, 2 (jun 2017), 26:1—-26:28. https://doi.org/10.1145/3090091
[35] Anne Tiedemann, Leanne Hassett, and Catherine Sherrington. 2015. A novel approach to the issue of physical inactivity
in older age. Preventive medicine reports 2 (2015), 595–597.
[36] Fumiharu Togo, Eiji Watanabe, Hyuntae Park, Akitomo Yasunaga, Sungjin Park, Roy J. Shephard, and Yukitoshi
Aoyagi. 2008. How many days of pedometer use predict the annual activity of the elderly Reliably? Medicine and
Science in Sports and Exercise 40, 6 (2008), 1058–1064. https://doi.org/10.1249/MSS.0b013e318167469a
[37] Stewart G. Trost, Kerry L. Mciver, and Russell R. Pate. 2005. Conducting accelerometer-based activity assessments
in field-based research. Medicine and Science in Sports and Exercise 37, 11 SUPPL. (2005), 531–543. https:
//doi.org/10.1249/01.mss.0000185657.86065.98 arXiv:arXiv:1011.1669v3
[38] Jared M. Tucker, Gregory J. Welk, and Nicholas K. Beyler. 2011. Physical activity in U.S. adults: Compliance with
the physical activity guidelines for Americans. American Journal of Preventive Medicine 40, 4 (2011), 454–461.
https://doi.org/10.1016/j.amepre.2010.12.016
[39] Catrine Tudor-locke. 2016. The Objective Monitoring of Physical Activity: Contributions of Accelerometry to
Epidemiology, Exercise Science and Rehabilitation. (2016). https://doi.org/10.1007/978-3-319-29577-0
[40] C Tudor-Locke, D R Bassett, A M Swartz, S J Strath, B B Parr, J P Reis, K D Dubose, and B E Ainsworth. 2004. A
preliminary study of one year of pedometer self-monitoring. Ann Behav Med 28 (2004). https://doi.org/10.1207/
s15324796abm2803_3
[41] Catrine Tudor-Locke, Cora L Craig, Wendy J Brown, Stacy A Clemes, Katrien De Cocker, Billie Giles-Corti, Yoshiro
Hatano, Shigeru Inoue, Sandra M Matsudo, Nanette Mutrie, Jean-Michel Oppert, David A Rowe, Michael D Schmidt,
Grant M Schofield, John C Spence, Pedro J Teixeira, Mark A Tully, and Steven N Blair. 2011. How many steps/day
are enough? for adults. International Journal of Behavioral Nutrition and Physical Activity 8, 1 (jul 2011), 79.
https://doi.org/10.1186/1479-5868-8-79
[42] Catrine Tudor-Locke, Yoshiro Hatano, Robert P. Pangrazi, and Minsoo Kang. 2008. Revisiting "how many steps
are enough?". Medicine and Science in Sports and Exercise 40, 7 SUPPL.1 (2008). https://doi.org/10.1249/MSS.
0b013e31817c7133
Received August 2017; revised November 2017; revised January 2018; accepted January 2018
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 2, No. 1, Article 37.
Publication date: March 2018.