Dr. Anuj Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Does Access to Human Coaches Lead to More Weight

Loss Than With AI Coaches Alone?∗


Anuj Kapoor1 , Sridhar Narayanan2 , and Puneet Manchanda3
1
Indian Institute of Management, Ahmedabad
2
Graduate School of Business, Stanford University
3
Ross School of Business, University of Michigan

January 2023

Abstract
Obesity and excess weight are major global health challenges. A number of tech-
nological solutions, including mobile apps, have been developed to help people lose
weight. Many such applications provide access to human coaches who help consumers
set goals, motivate them, answer questions and help them in their weight loss jour-
neys. Alternatively, similar services could be provided using AI coaches, which would
be cheaper and more scalable than human coaches. In this study, we ask if access
to human coaches incrementally affects weight loss outcomes for consumers relative
to having AI coaches alone. Our empirical context is a mobile app with two types
of subscription plans, those with AI coaches only and those with additional access to
human coaches. We compare adopters of the two types of plans on their weight loss
achievements. We address potential self-selection into these plans using a matching-
based approach that leverages rich behavioral data to find matching consumers on the
two types of plans. Our empirical analysis of about 65000 consumers reveals access to
human coaches leads to higher weight loss than with AI coaches alone. We document
heterogeneity in these differences based on age, gender, and starting BMI of the con-
sumers. We also explore potential mechanisms for the human coach impact on weight
loss.


Anuj Kapoor ([email protected]) is an Assistant Professor of Marketing at the Indian Institute of Man-
agement, Ahmedabad; Sridhar Narayanan ([email protected]) is Professor of Marketing and
the Younger Family Faculty Fellow for 2022-23 at the Graduate School of Business, Stanford University;
Puneet Manchanda ([email protected]) is the Isadore and Leon Winkelman Professor and Professor
of Marketing at the Ross School of Business, University of Michigan. The authors are grateful to Nathan
Yang, Unnati Narang and Jessica Fong, as well as participants at the 2022 Marketing Science Conference
and seminar participants at the Stanford GSB for their comments on the paper. The authors would also like
to thank Shrey Bansal, Vaibhav Mishra and Hemank Bajaj for their excellent research assistance. They are
also grateful to Manan Chandan, Tushar Vashisht, Sachin Shenoy, Tharun Reddy and others at HealthifyMe
for their help with getting the data for the project and for their insights.

1
1 Introduction

One of the biggest health issues worldwide in recent years has been the extent of obesity in
a large and growing proportion of the world population. As pointed out by Blüher (2019),
obesity has reached pandemic proportions worldwide and is associated with an increased risk
of a variety of non-communicable health conditions, including type-2 diabetes, fatty liver
disease, hypertension, myocardial infarction, stroke, dementia, osteoarthritis, obstructive
sleep apnea, and several cancers. These are associated with reduced life expectancy, lowered
quality of life, expensive treatments, loss of employment, and various other negative factors.
By extension, many, if not all, of these issues apply to varying degrees to the weight that is
above the normal range, even if it does not approach the level of obesity.
The rapid diffusion of smartphones in recent years has led to the development of numerous
mobile phone-based applications (apps) to help consumers achieve their weight loss goals.
Prominent examples include MyFitnessPal, Lose It!, Noom and HealthifyMe. The main idea
behind fitness tracking applications such as these is that through quantification of otherwise
hard-to-track behaviors, they generate awareness and mindfulness about consumers’ food
and exercise-related choices, build accountability for their own actions, and motivate them
towards achieving their weight loss goals. Several of these applications have also developed
AI-based tools to motivate consumers, answer their questions, and help them in their weight-
loss journeys. However, despite their increasing sophistication, there is still a question of
whether they can be as effective as human coaches. While this question has not been directly
examined in the context of weight loss, evidence is mixed about the impact of AI-based tools
vs. humans in healthcare contexts (Mitchell et al. 2021).
Apps such as HealthifyMe and Noom have built varying degrees of human coaching into
their weight loss offerings. One of the arguments made in favor of human coaches is that they
can show greater empathy and that this is crucial for consumers struggling to achieve their
goals. Another argument is that consumers may feel more accountable toward human coaches
than AI coaches. On the flip side, human coaching is not scalable like AI-based coaching.

2
Firms often take a hybrid approach, where consumers can choose to use AI coaches alone
or to augment the AI coaches with human coaching. Understanding the difference, if any,
human coaching can make is crucial to designing effective interventions and solutions for
helping consumers lose weight. In this study, we aim to answer whether human coaches
enable consumers to lose more weight than if they could only access AI coaches.
Causal comparisons of plans that have human coaches vs. those that do not are hard
because of the potential for self-selection into the different types of plans (Cadario et al.
2021, Longoni et al. 2019). If the set of individuals selecting into the two kinds of plans is
systematically different, it would be hard to causally compare the outcomes of users across
these plans (Eckles et al. 2017, Mehrotra et al. 2018). For instance, it is plausible that
younger, more technologically sophisticated consumers self-select into AI plans. If their
usage of the apps and achievement of weight loss goals are systematically different at the
baseline relative to less technologically sophisticated consumers, it would be difficult to
separate between the effect of AI vs. human coaches from these baseline differences.
A typical solution to problems of self-selection and confounding factors is the use of
controlled experiments, where randomization allows us to find the causal effects of treatment -
in our case, plans that include human coaches vs. those that do not. But this is challenging in
this context due to a number of reasons. First, weight loss is typically a slow, time-consuming
process extending over months (Ashtary-Larky et al. 2017). Second, only a relatively small
proportion of consumers on typical fitness tracking apps lose weight (Gordon et al. 2019)1 ,
and the differences between the different kinds of plans are likely to be quite small. These two
factors would suggest the need for a long duration of the experiment and very large sample
sizes2 . Additionally, there are ethical and potential legal challenges to a research design that
randomizes consumers into conditions that permit or deny them access to human coaches due
1
Randomized encouragement design i.e., allocation of AI and AI+Human coach plans can still lead to
selection bias.
2
For instance, to detect a 1 Kg difference, assuming a 15 kg standard deviation in weight, which is
approximately what we see in our data, would require a sample size of 7066 subjects for a significance level
of 0.05 and power of 0.8. This number would go up depending on the level of compliance in the experiment.

3
to the potential health consequences of the different types of plans. Encouragement designs
that provide incentives to consumers to join one or the other plan would be challenging
as well due to large sample size needed - for instance, if 5% of consumers respond to the
incentive, the incentive would need to be offered to over 140,000 consumers in the example
above. Thus, running field experiments is challenging and perhaps even infeasible in this
context.
The alternative of running studies in the laboratory is also not a good option in this
context. First, it is challenging to extrapolate laboratory studies to the real world due to
the absence of real consequences for the experimental subjects. Second, the long duration
and likely small effect sizes mentioned earlier in the context of the challenges of running field
experiments apply here as well.
In this study, we use a rich dataset containing observational data but attempt to obtain
estimates that are close to causal by conducting a valid comparison between consumers on
plans that have human coaches and those that do not. We conduct our empirical analysis
for consumers of the HealthifyMe app in India, which has subscription plans that provide
only AI-based coaching to consumers (henceforth referred to as AI coach plans), and other
plans that include human coaches in addition to an AI coach (henceforth AI+Human coach
plans). We observe activity logs, weight logs, food logs, and goal-setting activity for a large
sample of users of this platform who adopted one of the two types of plans over a three-year
period between 2018 and 2020. We compare them on the achievement of their weight loss
goals.
Our context has plans that are similar other than the presence or absence of human
coaches - this eliminates concerns about confounding factors between AI coach and AI+Human
coach plans. To account for the self-selection in assignment to AI+Human coach plans vs.
AI coach plans of HealthifyMe, we develop a matching approach to make a valid compar-
ison between adopters of the two types of plans. Specifically, we pair each adopter of an
AI+Human coach plan with a comparable consumer that adopted an AI coach plan during

4
our observation period. Our adopters were all on a free plan of HealthifyMe before they
adopted either an AI+Human coach plan or an AI coach plan, and we use the data from the
time these consumers were on the free plan to match them on their observed behaviors, in
addition to demographic variables and their tenure on the app. This, under a standard set
of assumptions, obtains valid apples-to-apples comparisons of consumers on the two kinds
of plans.
The main outcome we focus on is weight loss, as we are interested in finding the weight
loss differences between consumers who choose the two different kinds of plans. However,
we add several other outcome measures to get deeper insights into the mechanisms and
processes by which these differences manifest themselves. We examine two mechanisms by
which AI+Human coaches might differ from AI coaches and affect weight loss outcomes.
These relate to the two different ways in which weight loss apps might help consumers lose
weight. The first is the process of setting a goal. In weight loss apps such as HealthifyMe,
consumers have to set up specific (as opposed to general) goals - how much weight they would
like to lose and in what period of time. Goal-setting is in and of itself an important step
towards goal achievement (Phillips and Gully 1997, Kozlowski and Bell 2006). If AI+Human
coaches systematically differ from AI coaches in this process, it could lead to differential
weight loss across the two types of plans. The second mechanism by which weight loss apps
help consumers achieve weight loss goals is by tracking their weight and their consumption
of food. This builds self-awareness amongst the consumers (Goukens et al. 2009, Jami 2016,
Pham et al. 2010, Sentyrz and Bushman 1998), and also fosters accountability towards their
own choices and actions. It is plausible that consumers feel differently accountable towards
diligently tracking their weight and food intake when they have AI+Human vs. AI coaches.
This difference would manifest itself in different levels of tracking across the two kinds of
plans, in turn potentially leading to different weight loss outcomes.
Our analysis of the daily tracking data of 64,688 consumers over three years showed that,
on average, they lost more weight with AI+Human coaches than with AI coaches alone.

5
On average, users on AI coach plans lose 1.22 kg over three months, while those on the
AI+Human coach plans lose about 2.12 kg in the same time frame, a difference of about
74%. Given the average starting weight of about 79 kg for the users in our sample, this
translates into a weight loss on AI coach plans of about 1.5% of starting body weight on
average and a weight loss of 2.7% of body weight for users on AI+Human coach plans.
In terms of the mechanisms by which consumers lose weight, we find that, on average
AI+Human coach plan users set higher weight loss goals than users on AI coach plans.
While users on AI coach plans set a weight loss goal of about 15.2 kg on average, those on
AI+Human coach plans set a weight loss goal of 17.6 kg on average (a 16% increase). They
also log their weight and food intake more frequently than AI-coach plan users. On average,
AI coach users log their weight 1.1 times a week vs. 1.35 times a week for AI+Human coach
users. There is a much more dramatic difference in food logging - whereas AI coach users log
their food intake an average of 49.7 times a week, those on AI+Human coach plans log their
food intake 102.0 times a week, an increase of over 105%. While our analysis cannot make a
causal connection between these increases and weight loss or determine the extent to which
increased weight loss with the addition of human coaches is a result of these factors, we can
make causal claims for the impact of human coaches on these measures of user activity (goal
setting, weight logging, food logging).
We also document considerable heterogeneity in this AI+Human vs. AI plan difference,
with the variation being meaningful based on consumers’ gender, age, and the degree to
which they are distant from their ideal weight (as measured by their body mass indices -
BMI). We find that women, older users, and those with lower starting BMI have a greater
impact on the addition of human coaches relative to having only AI coaches. We also find
considerable heterogeneity in our mechanism measures.
Next, we discuss the contribution of our work. This is the first study to carefully docu-
ment the difference that human assistance makes in the weight loss context in a field setting.
We find that adding human coaches affects weight loss outcomes for consumers meaning-

6
fully. Second, we document that the weight loss difference between AI+Human and AI coach
plans varies across different types of consumers. These estimates could be useful to firms
and policymakers as they design interventions for weight loss. Specifically, they can inform
discussions about the benefits of including human coaches and targeting them to the right
set of consumers.
The rest of the paper is organized as follows. First, we discuss our data and context
and describe our empirical strategy, including specifics on the matching approach. Then,
we present the results and subsequently discuss the implications for platform providers and
users. Finally, we discuss the limitations of our study and conclude by discussing future
directions for inquiry in this context.

2 Data and Context

As mentioned earlier, the data for our empirical analysis comes from a mobile application
focused on fitness, HealthifyMe. Our data are for the years 2018 through 2020 and are
primarily from India, where the firm that developed the application is based. The app
launched in 2012 and has seen tremendous growth in its user base, with over 28 million total
users and 16 million active users in 2021. It has also expanded its footprint beyond the
Indian market, with users in South East Asia and the Middle East in addition to India and
new countries being added over time.
HealthifyMe is a health and fitness app that provides smart meal plans personalized
by expert nutritionists and customized workout plans with certified fitness coaches. The
app enables users to track their food intake, workouts, steps walked, sleep, etc. It allows
users to set weight goals and comes up with customized plans to help them achieve them.
The app uses a ”freemium” model, with users typically starting on a free tier. This free tier
allows users to access the basic tracking functionality of the platform and generic educational
information, but without customized coaching. The paid subscription tiers of the plan add

7
customized coaching and other features, including the ability to set up customized meal
plans, enhanced nutrition, and calorie calculators, and tools that allow users to view their
information more easily. At the time of the study, the HealthifyMe app had 537,692 paying
subscribers and over 27 million users on the free tier of the platform.
There were a number of paid plans, but they were broadly divided into two buckets. The
first set of plans, with a typically lower subscription amount, allowed access to a set of AI
tools, including a conversational AI coach called Ria. The second set of plans, the premium
tier of the app with a higher subscription fee, added human coaches to these plans, where
users were paired with specific coaches. The coaches helped consumers develop a weight loss
plan focused on nutrition and exercise, motivated the customers to continue their weight
loss journeys, answered questions, etc. Figures 8, 9, and 10 in Web Appendix A.1 show
the interface of the app in the free tier, an AI coach plan, and an AI+Human coach plan,
respectively. There were 254,200 users on AI coach plans and 283,492 users on AI+Human
coach plans at the time of our study.

3 Empirical Strategy and Data Summary

The main goal of the empirical analysis is to compare weight loss outcomes for users on the
AI coach plans versus users on the AI+Human coach plans. The challenge in this analysis is
that the assignment of users to the two types of plans is not random. Consumers self-select
into the two types of plans. As a result, consumers who chose the AI+Human coach plans
are likely systematically different from those who chose the AI coach plans. As mentioned
in the previous section, the AI+Human coach plans are more expensive than those with
AI coaches alone. Therefore, it is plausible that consumers with greater motivation to lose
weight may choose plans with AI+Human coaches - this could cause the AI+Human coach
plan users to have higher weight loss achievements, but this cannot be causally attributed
to the type of plan. On the flip side, it is plausible that consumers who have a more difficult

8
weight loss journey are the ones that choose the AI+Human coach plans. Their potentially
lower weight loss achievement again cannot be attributed to the type of plan.
In the absence of an experiment and the difficulty of directly comparing users on the
two types of plans due to selection concerns, our empirical strategy is to construct matched
samples of users on the two types of plans. We restrict our sample to users that started their
journeys on HealthifyMe on its free tier and switched to a paid subscription. These users
have usage history during their free journeys and some demographic variables - age, height,
weight, gender, and BMI. We use features (summary variables) created from the users’ free
journey and the demographic variables to match the set of users. Thus, for every user on an
AI+Human coach plan, we find a corresponding matched user on a plan with only AI coaches.
The idea behind this matching procedure is that users are comparable in demographics and
usage behavior across the two conditions. As long as the standard assumptions underlying
matching procedures are satisfied, we have an apples-to-apples comparison between the two
groups, allowing us to make causal conclusions on the impact of AI+Human coaches. Such
matching approaches have been widely employed in the literature, with (Datta et al. 2018)
being a recent example in the marketing literature.
Next, we describe the matching procedure. Our main empirical analysis is based on a
Propensity Score Matching procedure (Rosenbaum and Rubin 1983). The approach uses a
set of variables to set up a predictive model for assignment into treatment. In our case,
this model predicts whether a consumer selects an AI coach plan or an AI+Human coach
plan. Each consumer is assigned a propensity score by this predictive model, with the score
indicative of the likelihood of choosing the AI+Human coach plan. The matching proce-
dure then finds for each AI+Human consumer a corresponding AI consumer with a closely
matched propensity score. In other words, for every consumer choosing an AI+Human coach
plan, the procedure finds a consumer with the same likelihood of choosing an AI+Human
coach plan but who actually chose an AI coach plan for unobserved reasons. Under two
identifying conditions of unconfoundedness (i.e., the unobservables being uncorrelated with

9
the variables used for constructing the propensity score) and full coverage (the distributions
of propensity scores for consumers in both treatment and control conditions having the same
support), this procedure allows us to draw causal interpretations from the comparisons of
outcome variables for the treatment (AI+Human coach) and matched control (AI coach)
consumers (Abadie and Imbens 2006).
There are two main steps in this matching procedure. The first step is to build the
predictive model described above. We use a set of demographic and usage variables as
predictors. The demographic variables are snapshots at the start of the consumers’ free
journeys on HealthifyMe. Thus, a consumer’s age, weight, or BMI, for instance, are the
values reported by the consumer when they first sign up on the platform (and since our data
is restricted to consumers who start on a free plan, these are values reported at the start of
their free journeys). The usage variables are summaries constructed from their goal, weight,
and food logs during the free parts of their journeys, i.e. when they were on the free tier of
the platform. We filtered the data to only include consumers who had at least one weight log
during the free journey so that we had their starting weights. We also filtered out consumers
who never recorded their weights after starting on one of the two types of paid subscription
plans. Table 1 lists the variables we finally use for our predictive model.
The predictive model we used was a Random Forest classifier (Wager and Athey 2018,
Zhao et al. 2016). We used 60% data for training, 20% for validation, and the final 20%
for testing. We also attempted the predictive model using other approaches, including a
binomial logit model but selected the Random Forest classifier for its superior predictive
performance.4
Table 2 presents the impurity-based feature importance of different variables used in the
Random Forest Classifier model. The length of the user’s free journey in months, their age,
the number of weight loss spells they had in their free journey, their weight logging frequency
and food logging frequency are variables that are high on importance for the classification
4
More details on the predictive model and comparisons with other approaches available from the authors
on request

10
Table 1: Variable Descriptions

Variables Variables Descriptions


avg spells The average number of weight loss spells3 of the user
avg goal loss The average weight loss goal per day per spell
avg actual loss The average weight lost per day per spell by user
avg goal achievement The average % goal achieved
avg weight logs The average number of weight logs per day of user
avg food calories The average number of food calories of user per day
avg food logs The average number of food logs of user per day
height The starting height of user
weight The starting weight of user
gender The gender of user
bmi The starting BMI of the user
age The age of the user at the start of the free journey
months The number of months of the user’s free journey

task.

3.1 Matching Procedure

We now describe the details of the matching procedure in detail. The aim of this procedure
is to find pairs of matched treatment (AI+Human coach) and control (AI coach) users so
that we can find valid estimates of the incremental effect of human coaches. These pairs
are constructed without replacement, i.e., one user can only be in one paired comparison.
Our distance metric is the Mahalanobis distance (De Maesschalck et al. 2000) for each pair.
The typical way the algorithm proceeds is to find these distance measures between every
possible treatment/control pair, sort them in increasing order of distance and pick the pairs
in increasing order of distance (making sure to remove all observations corresponding to
users that have already been picked). This is done until a pre-determined threshold of the
distance between treatment and control users is reached, and the remaining observations are
discarded. This is the procedure followed in Datta et al. (2018). However, in our case, the
number of pairs this would entail would be very large because of the large number of users

11
Table 2: Importance of variables in Random Forest Classifier

Importance
months 0.231
age 0.213
avg spells 0.172
avg weight logs 0.139
avg goal achievement 0.052
avg food calories 0.041
avg food logs 0.040
bmi 0.037
avg goal loss 0.021
height 0.018
avg actual loss 0.017
weight 0.016
gender 0.002

Notes: The table presents the impurity-based feature importance of different variables used in the Random-
Forest Classifier model. The method calculates the mean decrease in the impurity measure (as measured
by the GINI-impurity), with the average taken across all the trees in the forest. The table presents the
importance of the variables in descending order, implying that the variable ”months” have the highest
importance and ”gender” the least.

in our sample5 . Therefore, we use the ‘ball tree’ data structure for constructing our nearest
neighbor algorithm (Dolatshah et al. 2015).
The ball-tree algorithm (Fukunaga and Narendra 1975, Moore 2013, Omohundro 1990)
works by segregating the data points represented in a multi-dimensional space into balls rel-
ative to the order of distances between these data points. Each such ball is an n-dimensional
sphere that encloses a certain number of data points. If the number of data points in the
ball is greater than a certain threshold, the ball is further subdivided into smaller balls. This
algorithm is similar to a KD tree in terms of the partitioning of a higher dimensional space
into disjoint sections but is much more computationally efficient when the data points are
relatively concentrated in a smaller region of space. The algorithm efficiently computes the
closest 1000 nearest neighbors to each data point in this space. From these neighbors, we
5
Datta et al. (2018) have less than 2000 users in their sample, whereas we have close to 65000 users,
requiring us to make over 4 billion paired comparisons, over a thousand times as many as their ∼ 4 million
paired comparisons.

12
pick the closest neighbor (which has not already been matched to some other user) to each
user in the treatment group without replacement.
The details of the procedure are as follows.

1. A pre-defined threshold x (in our case 0.02) is chosen to partition the data into ‘balls’6 .

2. Our aim is to construct the ‘ball tree’ from the users in the control group.

3. For each user t in treatment group, we find 1000 nearest control group users with a
threshold of 0.02 and store them.

4. Next, we sort all (c, t) pairs (where c is control group user, t is treatment group user)
based on Mahanolobis distance (Datta et al. 2018).

5. We match c, t pairs, i.e., each user in the control group corresponding to a user in the
treatment group, without replacement and by iterating in sorted order.

The matching procedure yields 32,344 treated (AI+Human coach) consumers and cor-
responding 32,344 matched control (AI coach) consumers7 . The set of observations in the
matched sample is comparable across a set of observables to the set of observations in the
full sample before matching.
After matching, both consumer groups (control=AI coach users and treatment=AI+Human
coach users) are indistinguishable in terms of their adoption propensities. This can be seen
in Figures 1 and 2. Figure 1 shows that one of the assumptions of the matching procedure -
the full coverage assumption - is satisfied. The distributions of both treatment and control
6
We tested for robustness of our findings to the choice of threshold - these are reported in Table 15 in
Web Appendix A.2, and show that the results do not change meaningfully across a range of thresholds used
for this procedure - the signs of the results are identical, and the magnitudes are very similar.
7
Note that this number is lower than the smaller of the two groups before matching (the treatment group,
with 45,932 users). This reflects the loss that comes from the ball tree matching algorithm that we used,
over and above the typical loss of observations that happen in most matching exercises. There is a tradeoff
between computational efficiency and the inclusion of all data points in our sample. It is plausible that if we
had first computed the Mahalanobis distance between every pair of observations, sorted them, and then run
our matching procedure, and we could have retained more observations. However, this would have come at
a considerable computational cost. We chose this procedure for efficiency, showing later that this does not
lead to systematic selection in any meaningful way.

13
group users have similar support, with the ranges of observations lying between a propensity
score of -2 to +3. However, these two distributions are also different, indicating the need
for the matching procedure in the first place. If we merely compared all AI+Human coach
plan users to all users on the AI coach plans, we would not get a valid apples-to-apples com-
parison. Figure 2 shows the distribution of propensity scores after the matching procedure
was completed. As can be seen, the procedures yield a virtually perfectly matched sample
of consumers in the treatment and control groups in terms of their propensity scores.

Figure 1: Propensity Scores Before Matching


Notes: The figure above shows the propensity scores before matching. The blue line represents the
propensity scores for the AI coach plans users, and the green line for the AI+Human plans users.

Figure 2: Propensity Scores After Matching


Notes: The Figure above shows the propensity scores after matching. The blue line represents the
propensity scores for the AI coach plans users, and the green line for the AI+Human plans users. We only
see a single line (which is blue) because there is a very close overlap of the green and blue lines.

Next, we compare the summary statistics across a number of demographic and behavioral
variables during the free journey for treated and control users before and after the match.
These are the variables we use for our matching procedure as well. These are reported in

14
table 3. The top panel reports the summary statistics for treatment and control users of
these variables, as well as the p-values of the test of differences between them. It is important
to note that apart from gender, for which the p-values are high, these values are low for all
other variables, suggesting that these are systematically different. Comparing the columns
of mean values for the control and treatment groups, these differences are often quite large
and meaningful. The lower panel shows that the differences are statistically insignificant
for a number of variables after matching. Even in those instances where the differences are
statistically significant after matching, the mean values for the two are not very far apart - in
fact, in most of these instances, the difference between mean values is lower after matching
than before matching.8 Another point from this table is that while the matched set is a
subset of the full set of observations, it is comparable in terms of these summary variables
to the set of observations before matching.
Table 3 shows that various variables used for matching were significantly different for
the treatment (AI+human coach plan) and control (AI coach plan) users but were largely
not significantly different after matching. But it is plausible to have no meaningful differ-
ences between the matched treatment and control observations in terms of means, and yet
systematic differences in their distributions. To rule these out, we plot the densities of all
the major variables before and after the match in Figures 3 through 6. In all these figures,
the distributions for the respective variable are plotted on the same graph for treatment
(AI+human coach) and control (AI coach) users. The upper panel shows the distributions
before the matching procedure is implemented, and the lower panel shows the corresponding
distributions for the same variables after matching. A common point across these figures is
that where differences exist in the distributions of the corresponding variable for the treat-
ment and control groups before matching, the matching procedure reduces or even virtually
8
One point to note is that the variable ”weight” refers to the starting weight of the user when they first
signed up on the platform on the free plan. This is not their weight at the time of switch to the AI plan or
Human+AI plan. We are unable to compare weights at the time of switch since consumers do not necessarily
record their weights at the time of the switch. In contrast, everybody records their weight at the start of
the free journey. The same applies to the BMI variable since it is derived from the weight variable.

15
eliminates any differences in these distributions. In the case of multiple variables, e.g., age
(Figure 3), BMI (Figure 4), number of tracking spells for the customer (Figure 4), the aver-
age number of weight logs (Figure 5), the number of months the customer has been on the
free journey (Figure 5), the month in which the customer started the free journey (Figure
7), the month in which they ended it (Figure 6), it is quite visible that there were significant
differences before matching, but either no visually discernible difference after matching, or
drastically reduced differences after matching. Thus, as illustrated by these figures, we find
pairs of users on the AI+Human and AI plans that are well-matched not just on their overall
propensity scores but also on the individual variables used for matching.
Further, we look at the distribution of weight loss immediately before the switch to ei-
ther an AI-only or AI+Human paid app. This will help alleviate concerns that our matching
procedure cannot match on (unobserved) intent to lose weight. The concern is that there
may be a systematic difference in this intent for AI+Human vs. other plans. We can-
not directly test for that. But we could look at weight loss in short periods before they
switched. If AI+Human coach users were more motivated to lose weight, and that’s why
they chose those higher-priced plans, it might show up in their attempts to lose weight just
before they switched. We compare the two groups for weight logs and weight loss for the
matched AI+Human and AI users in those two periods before the switch. We find no sys-

Figure 3: Distribution Before (upper panel) and After (lower panel) Matching (A)

16
Figure 4: Distribution Before (upper panel) and After (lower panel) Matching (B)

Figure 5: Distribution Before (upper panel) and After (lower panel) Matching (C)

Figure 6: Distribution Before (upper panel) and After (lower panel) Matching (D)

17
Figure 7: Distribution Before (upper panel) and After (lower panel) Matching (D)

tematic difference between weight logs in the immediate period before the switch for AI vs.
AI+Human consumers, with p-values at 0.52 for the 30-day period and 0.2 for the 14-day
period. For weight loss, the p-values are 0.08 and 0.54, respectively, for these periods. The
weight loss difference in the 30 days period may seem marginally significant, but given that
the test is conducted with over 65,000 observations and given the multiple comparisons we
have conducted, this level of significance is not a source for concern. Applying the Bonferroni
correction for multiple comparisons with the same data gives us a corrected p-value of 0.16.

3.2 Main Empirical Specification

Our main research objective is to find if human coaches make a significant difference over
and above the AI coach in weight loss. We also wish to examine the potential mechanisms by
which any such difference manifests itself and examine heterogeneity in these effects. For this
purpose, we conduct a series of regressions, measuring the differences between the treatment
and control group for a set of outcome variables measured after the date they switched to a
paid plan.
We conduct our analysis at the level of a weight loss spell. We define a spell as the period
of time between the setting of a weight loss goal by a user and the next time it is changed.
This is because when a user enters a new weight loss goal, s/he also enters her/his weight

18
at that time. This provides us with at least two weight measurements to be able to conduct
an analysis of our main outcome of interest, which is weight loss. We have a total of 67095
observations across our sample of 32344 users in each of the treatment and control groups
(i.e., 64688 users in total). Thus, for most of the users in our sample, we have only one spell
in the data. A drawback of analysis at the level of a spell is that the length of a spell might
itself be systematically different between the two types of plans. In our data, the users on
AI coach plans had an average spell length of 90.4 days (std. deviation = 110.6 days), while
those on AI+Human coach plans had an average spell length of 100.2 days (std. deviation
= 102.9 days). Although this difference is small, We control for this difference in the various
regressions we employ for our empirical analysis.
Alternatively, we could take a fixed period of time in which to measure weight loss (e.g.,
three months or six months). The advantage of this is that the time period of the analysis
is not endogenous. The disadvantage, coming from the self-reported nature of weight, with
consumers deciding whether and when to report their weight, is that we are not guaranteed
to have weight measurement for a given user at the end of that period. In other words, we
have weight measurements at the beginning of that time period but not at the end unless
the consumer entered their weight on the last date of the analysis time period. One option
is to interpolate the weights on the last date of the analysis time period, conditional on
having weight measurements before and after that date. This is not a perfect solution either
because this measure is subject to measurement error and can be sensitive to the choice
of interpolation method. Also, there is some data loss because we have to condition the
consumer to be active on the platform and record at least one weight measurement after the
end of the time period for analysis.
Since both approaches have their own advantages and disadvantages and neither is unam-
biguously better, we take both approaches. Relying on the fact that there is no systematic
difference in the time period of spells between the treatment and control group, we report
the spell-level analysis output as our main results, noting the fact that a typical spell lasts

19
about three months. We then conduct robustness checks, using the interpolation approach
detailed above to obtain weight measures on the last date of the analysis period to find
effects for the same set of dependent variables. We do this for three different fixed periods -
three months, six months, and twelve months.
As discussed in the introduction, we study the mechanisms in addition to the main
outcome of interest, which is weight loss. The mechanisms by which weight loss apps might
help consumers lose weight is by helping them set meaningful weight loss goals that they can
strive towards and then help them track their weight and food intake. To examine differences
in tracking and accountability across the two types of plans, we use daily averages of weight
logging and food logging activity by the users.
Across the different outcomes, the main empirical specification looks like the following

yit = α + β · 1 (AI + Human) + γXit + εit (1)

Here, i indexes the individual, and t indexes the count of spells for the individual. yit
refers to the outcome of interest, which might be weight loss (both absolute and relative
to starting weight), weight loss goal (again absolute and relative), and the average number
of weekly logging events for food and weight. The variable 1 (AI + Human) is an indicator
variable for whether the consumer chose a plan with AI+Human coaches. Its coefficient β is
the main quantity of interest - it indicates the systematic difference between the AI+Human
coach plan users and AI coach plan users in the respective outcome variable. We add
time (month-year level) fixed effects to account for any systematic differences over time in
the main outcomes of interest. We also include a few additional controls, including the
age, gender, and BMI of the customers, to account for systematic differences in consumers’
outcomes across these variables. Finally, we add a control for the duration of the weight
loss spell in months. Note that these controls are strictly not needed due to our matching
procedure. There are no systematic differences in these variables between our treatment and

20
control samples. But using them accounts for some of the residual variations in the outcome
variable, allowing us to obtain more efficient estimates. Finally, we cluster our standard
errors at the user level to account for correlated unobservables across multiple spells within
users.

3.3 Heterogeneity in Treatment Effects

Next, we examine heterogeneous treatment effects. Principally, we examine how user-level


characteristics moderate the AI+Human vs. AI coach differences documented in the previous
sections. These variables are BMI, age, gender, and the time of adoption of the paid plan.
We choose these variables for various reasons. User characteristics like age and gender
might relate to comfort with interacting with an AI coach, as opposed to a human coach.
Consumers of different ages/gender might be heterogeneous in the degree to which they
experience embarrassment or discomfort with sharing personal information. They may also
differ in the degree to which empathy and a ‘personal touch’ are important (Longoni et al.
2019). BMI (measured at the time of adoption of the plan) is indicative of the degree to
which the consumer is overweight and, thereby, the difficulty of the weight loss process.
The idea here is that human and AI coaches might be better suited to serving the needs of
consumers with different difficulty levels. We conduct this analysis by splitting the data at
the median for BMI - separately examining consumers with higher and lower than median
BMI at the start of the spell.
The choice of gender results from prior research indicates that women desire human
characteristics like empathy and experience emotions like embarrassment to a greater degree
than men, potentially indicating better weight loss achievement with AI+Human coaches.
An important aspect of the AI+Human Coach plan involves conveying feedback to the user
along with providing reminders to the app consumers. Gender response is different for
different types of feedback (Beyer 1990, Motro and Ellis 2017). Further, males are more
self-oriented, while females are more other-oriented; females are more responsive to negative

21
data and process data more comprehensively (Meyers-Levy and Loken 2015). Thus, there
might be significant differences in how human coaching affects these consumers’ weight loss
outcomes. To conduct an analysis of heterogeneity based on gender, we split the sample into
those for men and women and conducted our analysis separately.
The choice of age as a moderating variable is based on the idea that younger consumers
might be more comfortable with the use of technological solutions such as AI coaches (Egan
1988, Stypińska 2021). A major role of the fitness apps is to provide feedback to the con-
sumers on their goals and weight loss progress, and feedback has been documented to be a
major determinant of a user’s success in goal achievement (Eskreis-Winkler and Fishbach
2019, Finkelstein and Fishbach 2012, Fishbach et al. 2010). Prior research suggests that
older consumers are more receptive to feedback than relatively younger consumers (Amorose
and Weiss 1998, Ferdinand and Kray 2013, Fishbach et al. 2010, Kast and Connor 1988,
Belkin and Niyogi 2002). These two factors suggest that older consumers might be more
receptive to human coaches and lose more weight with human coaches than with AI coaches.
We examine heterogeneity in our treatment effects based on age by splitting the sample at
the median age. We conduct separate regressions for the two samples.
Finally, the time of adoption of the plans might indicate whether the consumer is an early
vs. late adopter of the technology - once again, a proxy for comfort with using technology
to assist with weight loss goals. Earlier adopters might be more comfortable with the use
of technological tools, including AI tools, and this may result in better health outcomes for
these consumers using these tools than for later adopters. Once again, we split the data on
the median start date and conducted separate analyses for consumers of the two sub-samples.

22
Table 3: Comparison of Variables During Free Journey for Treatment (AI+Human) and
Control (AI) Users

Control Control Treatment Treatment p-value


(N) (Mean) (N) (Mean) of t-test
Before Matching
age 71848 30.813 45932 33.398 0.000
height 71848 166.161 45932 165.549 0.000
weight 71848 78.566 45932 79.451 0.000
gender 71848 0.574 45932 0.577 0.305
bmi 71848 28.369 45932 28.889 0.000
avg spells 71848 0.388 45932 0.585 0.000
avg weights 71848 0.651 45932 0.815 0.000
months 71848 17.393 45932 10.740 0.000
avg goal achievement 71848 -5.349 45932 -3.589 0.000
avg goal loss 71848 0.075 45932 0.074 0.000
avg actual loss 71848 -0.025 45932 -0.008 0.015
avg food logs per day 71848 1.383 45932 1.951 0.000
avg food calories per day 71848 172.510 45932 257.807 0.000
After Matching
age 32344 33.042 32344 32.500 0.000
height 32344 165.704 32344 165.721 0.810
weight 32344 79.641 32344 78.965 0.000
gender 32344 0.582 32344 0.579 0.503
bmi 32344 28.915 32344 28.649 0.000
avg spells 32344 0.476 32344 0.480 0.489
avg weights 32344 0.720 32344 0.722 0.684
months 32344 13.798 32344 13.223 0.000
avg goal achievement 32344 -4.533 32344 -4.499 0.916
avg goal loss 32344 0.074 32344 0.074 0.143
avg actual loss 32344 -0.016 32344 -0.011 0.625
avg food logs per day 32344 1.675 32344 1.635 0.129
avg food calories per day 32344 211.246 32344 210.060 0.712

Notes: This table presents the means and count of users in treatment and control before and after the
Mahalanobis matching without replacement with 0.02 as the threshold for considering residual imbalance.
The unit of analysis is an individual user in (a) the unmatched and (b) the matched sample. Table 1 provides
the descriptors for the various variables.

23
4 Results

4.1 Main Results

The results of our main analysis are reported in Table 4. The first column here lists our
main outcome of interest - weight loss. The baseline weight loss for consumers on the plans
that only include AI coaches is 1.22 Kilograms. As we have seen earlier, an average spell is
about three months in length. Thus, consumers, on average, lose about 1.22 kilograms over
a 3-month period when they can only access an AI coach. When they have access to human
coaches in addition to the AI coach, on the other hand, they lose an additional 0.91 kilograms
making a total of 2.13 Kilograms, and this difference is statistically highly significant, with
a p-value of 0.033. This represents a very large (74.20%) increase in weight loss relative to
the baseline. This is a novel and important result based on the data for a large number of
real customers in a field setting. Further, these weight loss numbers compare quite favorably
with those reported in other studies on weight loss using mobile apps (e.g., (Laing et al.
2014, Patel et al. 2019)). Tables 16 in the Web appendix A.3 lists prior studies in the weight
loss domain.
The results of this regression also point to some other important conclusions. There
appear to be some significant differences in the baseline rate of weight loss across some of
the major variables we control for. On average, women’s weight loss is lower than that of
men by about 0.5 kilograms. Consumers with higher BMI to begin with lose less weight,
though the difference is small at under 0.1 Kilograms. The greater the duration of the spell,
the greater the weight loss, with each month adding just under 0.2 kilograms in weight loss.
We further examine the impact of AI+Human coach on weight loss fraction. We define
weight loss fraction as

ActualW eightLoss + 1
W eightLossF raction =
W eightLossGoal + 1

24
On average, consumers in the AI-Coach condition have their weight loss fraction as 0.178.
Thus, their actual weight loss is about 17.8% of the goal they set for themselves in that
spell. This increases by 0.042 to 0.22 for AI+Human-Coach users, i.e. they lose about 22%
of their weight loss goal. This is an increase of 24%. Column (4) reports the results of the
weight loss goal fraction.
Next, we examine the weight loss goal fraction, defined as

(W eightLossGoal)
W eightLossGoalF raction =
StartW eight

On average, consumers in the baseline AI-coach condition have a weight loss goal fraction
as 0.184, i.e. their goal is to lose 18.4% of their weight. This increases by 0.016 to 0.2 for
AI+Human users. This is an increase of 8.7% in the weight loss goal relative to starting
weight.
We also examine the impact of the type of coaches on the logging activity of the con-
sumers. The impact on food logs and weight logs is reported in the last two columns of
Table 4 respectively. On average, consumers in the baseline (AI coach) condition log their
food intake 49.65 times in a week. This increases by 52.38 to 102.03, more than doubling
in the process. This points to the possibility that AI+Human coaches are more effective at
getting their clients to enter their food intake more regularly. Since this is one of the main
ways in which weight loss apps help consumers lose weight, this is an important finding.
Consumers on plans with only AI coaches log their weight 1.12 times on average per week.
This increases to 1.35 times a week for consumers with AI+Human coaches, an increase of
20.65%. While this is a more modest increase, this increase is still quite meaningful and
may help explain the higher weight loss for consumers in AI+Human coach plans because
self-awareness is an important tool by which weight loss apps purportedly work.
To sum up, we find that AI+Human coaches help consumers lose more weight on average.
The difference is small in absolute terms when averaged across all customers, only a few of

25
whom lose weight, to begin with. But it is very large in relative terms, with the weight
loss being over 74% higher for consumers with AI+Human coaches. This higher weight loss
is associated with higher weight loss goals set by these customers and greater frequency of
both weight and food logging by these customers. We would also like to point out that
the statistical significance of our estimates is preserved even after accounting for multiple
comparisons (due to the very high levels of statistical significance of our estimates).

Table 4: Main Results - Impact of AI+Human Coaches

Weight Weight Weight Weight Weekly Weekly


Loss Loss Loss Loss No. of No. of
(Kgs) Fraction Goal Goal Food Weight
(Kgs) Fraction Logs Logs
AI + Human Coach 0.906*** 0.042*** 1.456*** 0.016*** 52.375*** 0.231***
(0.033) (0.002) (0.089) (0.001) (0.956) (0.014)
Age -0.001 0.002*** -0.119*** -0.001*** 1.81*** 0.009***
(0.003) (0.000) (0.008) (0.000) (0.078) (0.001)
Female -0.496*** -0.04*** -0.241*** 0.03*** -1.099 -0.046***
(0.033) (0.002) (0.087) (0.001) (1.002) (0.014)
BMI -0.087*** -0.024*** 1.875*** 0.016*** -3.46*** -0.013***
(0.007) (0.000) (0.016) (0.000) (0.127) (0.002)
Spell Duration in Months 0.18*** 0.005*** 0.337*** 0.004*** -2.168*** -0.114***
(0.005) (0.000) (0.011) (0.000) (0.094) (0.001)
Baseline (Mean) 1.221 0.178 15.176 0.184 49.655 1.119
Time Fixed Effects Yes Yes Yes Yes Yes Yes
Observations 67095 67095 67095 67095 67095 67095
R-squared Adj. 0.083 0.215 0.537 0.444 0.118 0.105

Notes: The table shows regression results with robust standard errors clustered at the user level in paren-
theses. *** p <0.01, ** p <0.05, * p <0.1

4.2 Heterogeneous Treatment Effects

First, we examine if there is a difference between male and female consumers in the impact
of AI+Human coaches. For this purpose, we run separate regressions for men and women
consumers in our sample. These are reported in Tables 5 and 6, respectively. A comparison
of the estimates for the AI+Human coach impacts in Tables 5 and 6 is instructive. On

26
average, men have higher baseline weight loss than women. This is to be expected since
they weigh much more, to begin with. However, the difference made by AI+Human coaches
is almost the same in absolute terms for men and women at 0.90 kilograms each. Given
the lower baseline for women, the relative impact of AI+Human coaches is much greater for
women, at 87.5% vs. 61.4% for men. The patterns persist for the mechanism measures too.
Men set slightly higher weight loss goals in the baseline case than women, but the impact of
AI+Human coaches is slightly smaller in absolute terms (1.34 kilograms for men vs. 1.525
kilograms for women) and in relative terms (8.1% for men and 9.9% for women). The big
difference that AI+Human coaches seem to make for women, and much more so than for
men, is in their tracking activity. Women with AI+Human coaches see an increase of 125.9%
in food logging vs. 76.5% for men. The increase in weight logging activity by women due
to AI+Human coaches is 24.1% vs. 17.9% for men. Overall, AI+Human coaches make a
greater difference, in relative terms, for women than for men.
Table 5: Impact of AI+Human Coaches for Male Consumers

Weight Weight Weight Weight Weekly Weekly


Loss Loss Loss Loss No. of No. of
(Kgs) Fraction Goal Goal Food Weight
(Kgs) Fraction Logs Logs
AI + Human Coach 0.902*** 0.047*** 1.337*** 0.014*** 43.502*** 0.201***
(0.051) (0.003) (0.143) (0.001) (1.624) (0.023)
Age -0.004 0.001*** -0.093*** -0.001*** 2.202*** 0.014***
(0.003) (0.000) (0.009) (0.000) (0.136) (0.002)
BMI -0.117*** -0.028*** 2.060*** 0.016*** -4.727*** -0.024***
(0.007) (0.000) (0.022) (0.000) (0.229) (0.003)
Spell Duration in Months 0.196*** 0.005*** 0.345*** 0.003*** -2.422*** -0.112***
(0.007) (0.000) (0.016) (0.000) (0.148) (0.002)
Baseline (Mean) 1.468 0.182 16.472 0.178 56.849 1.125
Time Fixed Effects Yes Yes Yes Yes Yes Yes
Observations 26748 26748 26748 26748 26748 26748
R-squared Adj. 0.072 0.227 0.538 0.425 0.113 0.101

Notes: The table shows a regression with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1

In Tables 7 and 8, we investigate differences between younger and older customers in

27
Table 6: Impact of AI+Human Coaches for Female Consumers

Weight Weight Weight Weight Weekly Weekly


Loss Loss Loss Loss No. of No. of
(Kgs) Fraction Goal Goal Food Weight
(Kgs) Fraction Logs Logs
AI + Human Coach 0.901*** 0.038*** 1.525*** 0.018*** 58.563*** 0.255***
(0.042) (0.002) (0.112) (0.001) (1.161) (0.018)
Age 0.00 0.002*** -0.132*** -0.002*** 1.466*** 0.005***
(0.005) (0.000) (0.012) (0.000) (0.090) (0.001)
BMI -0.0835*** -0.0207*** 1.7522*** 0.0159*** -2.6182*** -0.0049**
(0.011) (0.000) (0.020) (0.000) (0.148) (0.002)
Spell Duration in Months 0.1673*** 0.0055*** 0.3270*** 0.0038*** -1.8557*** -0.1148***
(0.006) (0.0003) (0.0142) (0.0002) (0.112) (0.002)
Baseline(Mean) 1.030 0.155 15.471 0.199 46.528 1.057
Time Fixed Effects Yes Yes Yes Yes Yes Yes
Observations 39386 39386 39386 39386 39386 39386
R-squared Adj. 0.085 0.205 0.541 0.447 0.130 0.111

Notes: The table shows regressions with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1

the difference that human coaches make. The two sets of customers are not very different
in either the baseline weight loss (1.19 kilograms for younger customers vs. 1.22 kilograms
for older consumers) or in the impact of AI+Human coaches in absolute terms (0.88 kg
and 0.94 kg for younger and older consumers respectively). In relative terms, the difference
that AI+Human coaches make in weight loss is 73.8% and 76.9% for younger and older
customers, respectively. So while the difference made by AI+Human coaches is greater for
older consumers, as we had predicted, the difference is quite small. On the mechanism
measures, however, there are some more interesting patterns. Older consumers set less
ambitious weight loss goals at baseline than younger consumers (15.51 kg vs. 16.25 kg),
but the impact of AI+Human coaches is 1.69 kg (10.9% of baseline) for older consumers vs.
1.23 kg (7.5%) for younger consumers. The baseline (i.e., AI coach plan) average for food
logging for older consumers is higher at 58.83 weekly logs vs. 41.85 for younger consumers.
The absolute impact of AI+Human coaches is also higher at 53.59 logs for older consumers
vs. 51.45 logs for younger consumers. However, in relative terms, the impact is greater

28
for younger consumers (122.9%) than for older consumers (91.1%). Finally, the impact
of AI+Human coaches on weight logging is somewhat higher for older consumers both in
absolute terms (0.28 vs. 0.19 for younger consumers) and in relative terms (25.35% vs.
17.6%). In summary, there is a small difference between younger and older consumers in
the impact of AI+Human coaches. Older consumers have a greater impact of AI+Human
coaches on goal setting and weight logging, but younger consumers have a greater relative
impact of AI+Human coaches on food logging.

Table 7: Impact of AI+Human Coaches for Younger Consumers

Weight Weight
Weight Weight Weekly Weekly
Loss Loss
Loss Loss No.of No. of
Goal Goal
(Kgs) Fraction Food Logs Weight Logs
(Kgs) Fraction
AI + Human Coach 0.877*** 0.044*** 1.226*** 0.014*** 51.450*** 0.188***
(0.044) (0.003) (0.115) (0.001) (1.162) (0.019)
Age -0.015** 0.001** -0.159*** -0.002*** 1.133*** 0.007**
(0.007) (0.000) (0.020) (0.000) (0.193) (0.003)
BMI -0.101*** -0.023*** 1.885*** 0.016*** -2.942*** -0.009***
(0.011) (0.000) (0.020) (0.000) (0.148) (0.002)
Female -0.477*** -0.041*** -0.016 0.033*** 4.008*** 0.019
(0.043) (0.003) (0.113) (0.001) (1.203) (0.018)
Spell Duration in Months 0.179*** 0.006*** 0.330*** 0.004*** -2.133*** -0.115***
(0.007) (0.000) (0.014) (0.000) (0.117) (0.002)
Baseline(Mean) 1.189 0.158 16.251 0.197 41.853 1.071
Time Fixed Effect Yes Yes Yes Yes Yes Yes
Observations 33084 33084 33084 33084 33084 33084
R-squared Adj. 0.063 0.215 0.565 0.461 0.115 0.111

Notes: The table shows a regressi on with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1

In Tables 9 and 10, we investigate heterogeneous treatment effects for consumers based
on BMI. Specifically, we split our sample at the median BMI and separately examine the
two sub-samples of consumers above and below the median BMI. These results are reported
in Tables 9 and 10, respectively. We find that consumers who have above-median BMI
to start with set higher goals than those with below-median BMI (21.321 kg vs. 10.176

29
Table 8: Impact of AI+Human Coaches for Older Consumers

Weight Weight
Weight Weight Weekly Weekly
Loss Loss
Loss Loss No.of No. of
Goal Goal
(Kgs) Fraction Food Logs Weight Logs
(Kgs) Fraction
AI + Human Coach 0.937*** 0.039*** 1.686*** 0.018*** 53.593*** 0.278***
(0.050) (0.003) (0.136) (0.002) (1.516) (0.021)
Age 0.001 0.002*** -0.099*** -0.001*** 1.781*** 0.004*
(0.006) (0.000) (0.015) (0.000) (0.146) (0.002)
BMI -0.091*** -0.024*** 1.8486*** 0.016*** -3.927*** -0.016***
(0.011) (0.001) (0.024) (0.000) (0.207) (0.003)
Female -0.513*** -0.040*** -0.456*** 0.026*** -5.671*** -0.106***
(0.048) (0.003) (0.130) (0.001) (1.586) (0.021)
Spell Duration in Months 0.182*** 0.005*** 0.346*** 0.004*** -2.158*** -0.113***
(0.007) (0.000) (0.016) (0.000) (0.141) (0.002)
Baseline(Mean) 1.218 0.173 15.513 0.185 58.830 1.097
Time Fixed Effect Yes Yes Yes Yes Yes Yes
Observations 33047 33047 33047 33047 33047 33047
R-squared Adj. 0.099 0.215 0.511 0.426 0.110 0.102

Notes: The table shows regressions with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1

kg) but have lower weight loss in the baseline (1.060 kg vs. 1.355 kg). And the impact
of AI+Human coaches in absolute terms is also lower for these consumers vs. those with
below BMI. These are perhaps reflective of the greater challenge faced by consumers who
are overweight. However, what is interesting is that even in relative terms, the impact of
AI+Human coaches is higher for consumers with below-average BMI (80.1% increase in
weight loss due to AI+Human coaches vs. 69.7% for consumers with above-median BMI).
Also interesting is the fact that the baseline rate of food logging for above-median con-
sumers is much lower than for below-median consumers (43.31 vs. 58.28 weekly logs) but
the difference made by AI+Human coaches in food logging is almost the same in relative
terms for these two sets of consumers ( 103%). Where AI+Human coaches make a bigger
difference for below-median BMI consumers than for above-median consumers is in weight
logging, the impact of AI+Human coaches is 23.8% vs. 18.7%) and what is more, this comes

30
on a somewhat higher baseline rate of weight logging (1.137 weekly weight logs vs. 1.033
logs). These results are contrary to our expectations - we expected human coaches to make
a bigger difference for higher BMI consumers, who typically have a greater challenge in their
weight loss goals. While we cannot directly examine the reasons for the results we find, we
conjecture that perhaps the greater accountability, empathy, and on-the-spot customization
that human coaches provide are counterbalanced by the greater degree of embarrassment
these consumers might face when interacting with a human coach. Consumers with lower
BMI, to begin with, might not face embarrassment to the same extent.

Table 9: Impact of AI+Human Coaches for Consumers with BMI below median

Weight Weight
Weight Weight Weekly Weekly
Loss Loss
Loss Loss No.of No. of
Goal Goal
(Kgs) Fraction Food Logs Weight Logs
(Kgs) Fraction
AI + Human Coach 1.085*** 0.068*** 1.139*** 0.015*** 60.412*** 0.271***
(0.046) (0.003) (0.083) (0.001) (1.496) (0.019)
Age -0.003 0.002*** -0.097*** -0.001*** 2.142*** 0.011***
(0.004) (0.000) (0.006) (0.000) (0.129) (0.002)
BMI -0.231*** -0.054*** 1.483*** 0.017*** -5.592*** -0.025***
(0.054) (0.001) (0.051) (0.000) (0.452) (0.005)
Female -0.713*** -0.082*** 0.421*** 0.031*** -8.772*** -0.106***
(0.049) (0.003) (0.084) (0.001) (1.605) (0.021)
Spell Duration in Months 0.213*** 0.008*** 0.317*** 0.004*** -2.199*** -0.118***
(0.008) (0.000) (0.011) (0.000) (0.159) (0.002)
Baseline(Mean) 1.355 0.227 10.176 0.140 58.276 1.137
Time Fixed Effects Yes Yes Yes Yes Yes Yes
Observations 33075 33075 33075 33075 33075 33075
R-squared Adj. 0.114 0.213 0.263 0.278 0.119 0.091

Notes: The table shows a regression with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1

Finally, we examine how earlier vs. later subscribers of the paid plans differ in the impact
of human coaches. We split our sample at the median start date on the paid journey and
separately analyzed consumers who had an earlier start date than this split date and later
start dates. We refer to them as earlier and later subscribers, respectively. The conjecture

31
Table 10: Impact of AI+Human Coaches for Consumers with BMI above median

Weight Weight
Weight Weight Weekly Weekly
Loss Loss
Loss Loss No.of No. of
Goal Goal
(Kgs) Fraction Food Logs Weight Logs
(Kgs) Fraction
AI + Human Coach 0.739*** 0.020*** 1.783*** 0.018*** 44.896*** 0.193***
(0.045) (0.002) (0.152) (0.002) (1.136) (0.012)
Age 0.002 0.001*** -0.132*** -0.001*** 1.512*** 0.007***
(0.004) (0.000) (0.014) (0.000) (0.088) (0.001)
BMI -0.054*** -0.010*** 2.050*** 0.014*** -1.853*** 0.001
(0.008) (0.000) (0.032) (0.000) (0.193) (0.003)
Female -0.342*** -0.014*** -1.024*** 0.030*** 5.258*** 0.004
(0.045) (0.002) (0.144) (0.002) (1.168) (0.018)
Spell Duration in Months 0.158*** 0.004*** 0.347*** 0.003*** -2.083*** -0.110***
(0.007) (0.000) (0.016) (0.000) (0.112) (0.002)
Baseline(Mean) 1.060 0.107 21.321 0.239 43.312 1.033
Time Fixed Effects Yes Yes Yes Yes Yes Yes
Observations 33056 33056 33056 33056 33056 33056
R-squared Adj. 0.050 0.065 0.355 0.241 0.097 0.117

Notes: The table shows regressions with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1

is that later adopters are less tech-savvy than earlier adopters and, therefore, might have a
bigger impact of human coaches. We report these results in Tables 11 and 12, respectively.
We find that later subscribers have a greater baseline rate of weight loss with the AI coach
plans than earlier subscribers (1.284 kg vs. 1.120 kg). They also have a greater difference in
weight loss due to AI+Human coaches (0.98 kg vs. 0.80 kg) and a greater relative impact as
well (76.5% vs. 71.3%). The big point of difference between these two sets of consumers is
in goal-setting behavior. While the baseline weight loss goals for later subscribers are lower
than for earlier subscribers (15.66 kg vs. 16.09 kg), the impact of human coaches on goal
setting is higher for these customers in absolute terms (1.60 kg vs. 1.28 kg) and of course in
relative terms as well (101.9% vs. 79.7%). By contrast, the impact of AI+Human coaches
on weight logging and food logging is both higher in relative terms for earlier subscribers
than later subscribers, though the baseline numbers are smaller.

32
Table 11: Impact of AI+Human Coaches for Earlier Subscribers

Weight Weight
Weight Weight Weekly Weekly
Loss Loss
Loss Loss No.of No. of
Goal Goal
(Kgs) Fraction Food Logs Weight Logs
(Kgs) Fraction
Below Start Date Split
AI + Human Coach 0.798*** 0.043*** 1.282*** 0.014*** 49.189*** 0.286***
(0.047) (0.002) (0.116) (0.001) (1.246) (0.018)
Age -0.000 0.001*** -0.115*** -0.001*** 1.882*** 0.010***
(0.004) (0.000) (0.009) (0.000) (0.104) (0.001)
BMI -0.115*** -0.024*** 1.882*** 0.016*** -3.2402*** -0.012***
(0.010) (0.000) (0.012) (0.000) (0.165) (0.002)
Female -0.479*** -0.039*** -0.388*** 0.028*** -3.555*** -0.036**
(0.045) (0.003) (0.112) (0.0012) (1.292) (0.018)
Spell Duration in Months 0.210*** 0.006*** 0.386*** 0.004*** -1.085*** -0.112***
(0.008) (0.000) (0.016) (0.000) (0.151) (0.002)
Baseline(Mean) 1.120 0.155 16.091 0.193 45.231 1.040
Time Fixed Effect Yes Yes Yes Yes Yes Yes
Observations 33149 33149 33149 33149 33149 33149
R-squared Adj. 0.062 0.217 0.543 0.449 0.108 0.094

Notes: The table shows a regression with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1

4.3 Treatment Effects for Fixed Time Periods

So far, we have analyzed the impact of AI+Human coaches using a weight loss spell as
the unit of analysis, with an average spell lasting about 3 months. However, it is plausible
that the length of the spell itself is endogenous. In other words, the type of coach makes
a difference to the duration of the spell itself. To conduct a more like-to-like analysis, we
consider a fixed duration of time rather than spells as our unit of analysis. There are two
drawbacks to this analysis, which is why we do not present these as our primary results. First,
we do not have weight measurements at the end of that duration unless a user happened to
measure their weight on the exact day that this duration of analysis ended. For instance, if
we wish to compare users on the two types of plans in terms of their weight loss 3 months
after they started their journey on one of the two types of paid plans, we have the starting

33
Table 12: Impact of AI+Human Coaches for Later Subscribers

Weight Weight
Weight Weight Weekly Weekly
Loss Loss
Loss Loss No.of No. of
Goal Goal
(Kgs) Fraction Food Logs Weight Logs
(Kgs) Fraction
AI + Human Coach 0.982*** 0.040*** 1.595*** 0.018*** 55.092*** 0.181***
(0.045) (0.003) (0.122) (0.001) (1.306) (0.020)
Age -0.001 0.002*** -0.123*** -0.001*** 1.721*** 0.008***
(0.004) (0.000) (0.011) (0.000) (0.099) (0.002)
BMI -0.078*** -0.023*** 1.849*** 0.016*** -3.708*** -0.013***
(0.011) (0.000) (0.022) (0.000) (0.171) (0.003)
Female -0.511*** -0.042*** -0.095 0.032*** 1.178 -0.056***
(0.047) (0.003) (0.121) (0.001) (1.381) (0.021)
Spell Duration in Months 0.161*** 0.005*** 0.306*** 0.003*** -2.871*** -0.115***
(0.007) (0.000) (0.013) (0.000) (0.115) (0.002)
Baseline(Mean) 1.284 0.177 15.658 0.188 55.786 1.125
Time Fixed Effects Yes Yes Yes Yes Yes Yes
Observations 32984 32984 32984 32984 32984 32984
R-squared Adj. 0.103 0.219 0.532 0.438 0.123 0.114

Notes: The table shows regressions with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1

weight for all users, but the ending weight is only available to the subset of users who
happened to enter their weight exactly 3 months after they started their paid subscriptions.
The need for interpolation to find the ending weight leads to the potential for measurement
error and loss of observations. Further, if we were to compare users after longer periods,
e.g., 6 months vs. 3 months, we would have fewer users who have continued to be active
and logged their weights after that long time period. Nevertheless, we conduct an analysis
of fixed time periods to check for the robustness of our main conclusions and to assess how
using the app for longer periods of time impacts weight loss, specifically the difference that
human coaches make.
To conduct this analysis, we first sub-sample the data to get consumers who stay on
the app for at least T months, where T has the values 3, 6, 9, and 12. Then, we fix the
time period for analysis, i.e. the time period after which we compare weight loss for the

34
consumers. For the sub-sample that stays on the app for at least 3 months, this period
is also 3 months. For the 6-month sub-sample, this period is either 3 or 6 months. For
the 9 months, sub-sample, this period is 3, 6, or 9 months. And finally, for the 12 months
sub-sample, this analysis period is 3, 6, 9, or 12 months. Next, we interpolate the weight
at the end of each analysis time period, for each sub-sample described above. If there is a
weight log on that date, then we use that weight. If there is no weight log on that day, we
impute the weight on that day using linear interpolation. Thus, if the user had a weight
log 5 days before the focal date, and again 3 days after, we draw a line between the weights
logged on these two dates, and impute the weight on the focal date as the point on the line
corresponding to the focal date. We then use these imputed weights to conduct the analysis
as we did before, with the same set of controls including fixed effects. First, we report some
summary statistics for the different sub-samples in Table 13.

Table 13: Summary Statistics for the Different Sub-Samples of the Data

Months Age BMI Female


Mean Std. Dev. Mean Std. Dev. Mean Std. Dev.
3 32.820 7.640 28.718 3.904 0.609 0.488
6 33.213 7.647 29.078 3.866 0.584 0.493
9 34.609 8.126 29.452 4.028 0.542 0.498
12 35.140 8.632 29.416 4.081 0.539 0.499

Notes: The table reports summary statistics for the different sub-samples of the data for 3, 6, 9, and 12
months duration.

Next, we report the regression results using the imputed weights in Table 14. First,
considering the 3 months sub-sample, for which we conduct a regression for imputed weights
at the end of 3 months, we find that impact of AI+Human coaches is 1.092 kg. This is very
similar to the 0.906 kg difference we reported in our spell-level regressions in Table 4. This
is not surprising since the average spell duration is about 3 months, also the sample sizes
across these two regressions are quite similar.
Next, we consider the 6-month sub-sample, i.e. users who stayed on the app and recorded
their weights for at least 6 months, and then imputed their weights at 3 and 6-month time

35
periods after they started. The impact of AI+Human coaches at the end of 3 months for
this sub-sample is almost identical to that for the 3 months sub-sample, at 1.096 kg for
the former vs. 1.092 kg for the latter. This is again not surprising given the comparable
sample sizes. What is interesting is that for this sub-sample, the difference that AI+Human
coaches make after 6 months increases to 1.908 kg, which is about 74.1% higher than that
at the 3-month mark. Note that the sub-samples remain the same in these two regressions
conducted at the 3- and 6-month marks - thus this is a valid like-to-like comparison. There
is some deceleration in weight loss in the second block of 3 months relative to the first block
of 3 months.
Considering the 9 months sub-sample, i.e. users who stayed on the app for at least
months, we find that the sample size drops considerably because of this restriction, relative
to the previously discussed sub-samples of 3 months and 6 months. Also, these seem to be
more committed users, with higher weight loss for comparable periods of time than for users
in the 3 months and 6 months sub-samples. For this sample of users, the AI+Human coach
difference at the end of 3 months is about 1.6 kg. about 2.8 kg at the end of 9 months and 3.1
kg at the end of 9 months. The 6-month weight loss is about 75% higher than the 3-month
weight loss for this sample, mirroring what we found for the 6-month sub-sample. However,
there is a sharp deceleration after this period - the 9-month weight loss is only 13% higher
than the 6-month weight loss for this group.
Finally, looking at the 12-month sub-sample, we find that these consumers have even
higher levels of weight loss. The difference that AI+Human coaches make at the 3-month
mark is 2.112 kg, at the 6-month mark is 3.538 kg, at the 9-month mark is 3.870 kg and at
the 12-month mark is 3.833 kg.
The previous analysis, reported in Table 14 examined the differential impact of AI+Human
coaches for sub-samples of consumers who stayed on the app for a different duration. For
instance, to examine the impact of weight loss at 3, 6, 9, and 12 months, we considered a
sub-sample of consumers who were active for at least 12 months. We similarly looked at

36
Table 14: Regression results for Weight Loss (Dependent Variable) at a Fixed Time Duration

Sub-Sample Duration Age Female BMI Human Coach Baseline(Mean) No. Of Observations
3 months 3 months 0.002 -0.873*** 0.334*** 1.092*** 2.128 9565
(0.007) (0.102) (0.036) (0.100)
6 months 3 months -0.008 -0.867*** 0.282*** 1.096*** 1.885 4589
(0.008) (0.129) (0.046) (0.128)
6 months 6 months 0.003 -0.83*** 0.394*** 1.908*** 2.819 4589
(0.010) (0.16) (0.046) (0.161)
9 months 3 months 0.002 -0.840*** 0.296*** 1.621*** 2.265 1625
(0.013) (0.221) (0.112) (0.221)
9 months 6 months 0.021 -0.918*** 0.398*** 2.770*** 3.319 1625
(0.017) (0.268) (0.110) (0.269)
9 months 9 months 0.018 -0.886*** 0.513*** 3.136*** 3.895 1625
(0.018) (0.291) (0.110) (0.302)
12 months 3 months 0.019 -0.655* 0.35 2.112*** 2.472 713
(0.019) (0.335) (0.232) (0.290)
12 months 6 months 0.024 -0.673* 0.418* 3.538*** 3.531 713
(0.021) (0.379) (0.223) (0.352)
12 months 9 months 0.02 -0.557 0.499** 3.870*** 4.013 713
(0.024) (0.418) (0.221) (0.402)
12 months 12 months 0.024 -0.498 0.625*** 3.833*** 4.371 713
(0.026) (0.454) (0.211) (0.441)

Notes: Here, sub-sample(S) denotes the group of users who are active on the app for at least S months, and
Duration(T) represents the number of months after which weight loss is considered. The table reports the
regression results for Weight Loss at a Fixed Time Duration of T months for the sub-sample S of users as
the dependent variable. The coefficients for independent variables are reported in the table.

sub-samples that had stayed at least 3, 6, and 9 months. However, the 9 months sub-sample
for instance included consumers in the 12 months sub-sample, making comparisons across
samples a bit harder. We, therefore, conduct another set of analyses that examines mutually
exclusive samples. We consider consumers who stayed on the app between 3 and 6 months,
between 6 and 9 months, between 9 and 12 months, and greater than 12 months. The rest of
the procedure of the analysis remains the same. These results are reported in Web Appendix
A.4 and show that the results remain substantively the same.

4.4 Robustness to Alternative Matching Methods

Our main analysis relies on the commonly used matching algorithm with replacement. In
addition, we use the Mahalanobis matching without replacement with 0.1 as the threshold for
considering residual imbalance, nearest neighbor matching with replacement & Mahalanobis

37
matching without replacement after removal of outlying observations in pre-period. Table 15
in the Web Appendix A.2 reports coefficients of the main variable of interest i.e. the impact
of AI+Human coaches on weight loss). We find that the results remain robust to different
approaches to matching.

5 Conclusion

In this study, we conduct the first causal evaluation of the incremental value of human
coaches above AI tools in helping consumers achieve their health outcomes in a real-world
setting. We document that human coaches do better than AI coaches in helping consumers
achieve their weight loss goals. Importantly, there are significant differences in this effect
across different consumer groups. This suggests that a one-size-fits-all approach might not
be most effective. We find that human coaches help consumers achieve their goals better
than AI coaches for consumers below the median BMI relative to consumers who have above-
median BMI. Human coaches help consumers achieve their goals better than AI coaches for
consumers below the median age relative to consumers who have above-median age. Human
coaches help consumers achieve their goals better than AI coaches for consumers below
the median time in a spell relative to consumers who spent above-median time in a spell.
Further, human coaches help consumers achieve their goals better than AI coaches for female
consumers relative to male consumers.
Finally, we discuss some limitations of our study and the potential for future research.
First, we conduct our analysis using observational data rather than a randomized controlled
trial. While we employ a rigorous matching approach to get as close to causal estimates as
possible using these data, it is not a substitute for an experiment. In particular, while we
attempt to rule out selection into AI vs. AI+Human plans on observables, our approach
cannot fully eliminate concerns about selection on unobservables. Second, our measures
for weight loss and consumers’ logging activities are all self-reported measures. Thus, there

38
might potential reporting biases between the different types of plans that our estimates would
not be able to uncover. Future work can also look into the temporal patterns associated with
food logging i.e. the weight loss behavior of individuals who login their meals individually
or bunch them. Thus, while challenging to set up in this context, future work should aim
to find experimental estimates for the incremental impact humans can make, and use more
objective measures rather than self-reported measures.

39
References
Abadie, A. and Imbens, G. W. (2006), ‘Large sample properties of matching estimators for average

treatment effects’, Econometrica 74(1), 235–267.

Amorose, A. J. and Weiss, M. R. (1998), ‘Coaching feedback as a source of information about

perceptions of ability: A developmental examination’, Journal of sport and exercise psychology

20(4), 395–420.

Ashtary-Larky, D., Ghanavati, M., Lamuchi-Deli, N., Payami, S. A., Alavi-Rad, S., Boustaninejad,

M., Afrisham, R., Abbasnezhad, A. and Alipour, M. (2017), ‘Rapid weight loss vs. slow weight

loss: which is more effective on body composition and metabolic risk factors?’, International

journal of endocrinology and metabolism 15(3).

Belkin, M. and Niyogi, P. (2002), Using manifold stucture for partially labeled classification, in

‘Advances in neural information processing systems’, pp. 929–936.

Beyer, S. (1990), ‘Gender differences in the accuracy of self-evaluations of performance.’, Journal

of personality and social psychology 59(5), 960.

Blüher, M. (2019), ‘Obesity: global epidemiology and pathogenesis’, Nature Reviews Endocrinology

15(5), 288–298.

Cadario, R., Longoni, C. and Morewedge, C. (2021), ‘Understanding, explaining, and utilizing

medical artificial intelligence’, Nature Human Behaviour .

Chin, S. O., Keum, C., Woo, J., Park, J., Choi, H. J., Woo, J.-t. and Rhee, S. Y. (2016), ‘Successful

weight reduction and maintenance by using a smartphone application in those with overweight

and obesity’, Scientific reports 6(1), 1–8.

Datta, H., Knox, G. and Bronnenberg, B. J. (2018), ‘Changing their tune: How consumers’ adoption

of online streaming affects music consumption and discovery’, Marketing Science 37(1), 5–21.

De Maesschalck, R., Jouan-Rimbaud, D. and Massart, D. L. (2000), ‘The mahalanobis distance’,

Chemometrics and intelligent laboratory systems 50(1), 1–18.

Dolatshah, M., Hadian, A. and Minaei-Bidgoli, B. (2015), ‘Ball*-tree: Efficient spatial indexing for

constrained nearest-neighbor search in metric spaces’, arXiv preprint arXiv:1511.00628 .

40
Eckles, D., Karrer, B. and Ugander, J. (2017), ‘Design and analysis of experiments in networks:

Reducing bias from interference’, Journal of Causal Inference 5(1).

Egan, D. E. (1988), Individual differences in human-computer interaction, in ‘Handbook of human-

computer interaction’, Elsevier, pp. 543–568.

Eskreis-Winkler, L. and Fishbach, A. (2019), ‘Not learning from failure—the greatest failure of all’,

Psychological science 30(12), 1733–1744.

Ferdinand, N. K. and Kray, J. (2013), ‘Age-related changes in processing positive and negative

feedback: Is there a positivity effect for older adults?’, Biological psychology 94(2), 235–241.

Finkelstein, S. R. and Fishbach, A. (2012), ‘Tell me what i did wrong: Experts seek and respond

to negative feedback’, Journal of Consumer Research 39(1), 22–38.

Fishbach, A., Eyal, T. and Finkelstein, S. R. (2010), ‘How positive and negative feedback motivate

goal pursuit’, Social and Personality Psychology Compass 4(8), 517–530.

Fukunaga, K. and Narendra, P. M. (1975), ‘A branch and bound algorithm for computing k-nearest

neighbors’, IEEE transactions on computers 100(7), 750–753.

Gordon, M., Althoff, T. and Leskovec, J. (2019), Goal-setting and achievement in activity tracking

apps: a case study of myfitnesspal, in ‘The World Wide Web Conference’, pp. 571–582.

Goukens, C., Dewitte, S. and Warlop, L. (2009), ‘Me, myself, and my choices: The influence of

private self-awareness on choice’, Journal of Marketing Research 46(5), 682–692.

Ingels, J. S., Misra, R., Stewart, J., Lucke-Wold, B. and Shawley-Brzoska, S. (2017), ‘The effect

of adherence to dietary tracking on weight loss: using hlm to model weight loss over time’,

Journal of diabetes research 2017.

Jami, A. (2016), ‘Healthy reflections: The influence of mirror-induced self-awareness on taste per-

ceptions’, Journal of the Association for Consumer Research 1(1), 57–70.

Kast, A. and Connor, K. (1988), ‘Sex and age differences in response to informational and control-

ling feedback’, Personality and Social Psychology Bulletin 14(3), 514–523.

Knäuper, B., Carriere, K., Frayn, M., Ivanova, E., Xu, Z., Ames-Bull, A., Islam, F., Lowensteyn, I.,

Sadikaj, G., Luszczynska, A. et al. (2018), ‘The effects of if-then plans on weight loss: Results

41
of the mcgill chip healthy weight program randomized controlled trial’, Obesity 26(8), 1285–

1295.

Kozlowski, S. W. and Bell, B. S. (2006), ‘Disentangling achievement orientation and goal setting:

Effects on self-regulatory processes.’, Journal of Applied Psychology 91(4), 900.

Laing, B. Y., Mangione, C. M., Tseng, C.-H., Leng, M., Vaisberg, E., Mahida, M., Bholat, M.,

Glazier, E., Morisky, D. E. and Bell, D. S. (2014), ‘Effectiveness of a smartphone application

for weight loss compared with usual care in overweight primary care patients: a randomized,

controlled trial’, Annals of internal medicine 161(10 Supplement), S5–S12.

Liu, F., Kong, X., Cao, J., Chen, S., Li, C., Huang, J., Gu, D. and Kelly, T. N. (2015), ‘Mobile

phone intervention and weight loss among overweight and obese adults: a meta-analysis of

randomized controlled trials’, American journal of epidemiology 181(5), 337–348.

Longoni, C., Bonezzi, A. and Morewedge, C. K. (2019), ‘Resistance to medical artificial intelligence’,

Journal of Consumer Research 46(4), 629–650.

Mehrotra, R., McInerney, J., Bouchard, H., Lalmas, M. and Diaz, F. (2018), Towards a fair mar-

ketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfaction

in recommendation systems, in ‘Proceedings of the 27th acm international conference on in-

formation and knowledge management’, pp. 2243–2251.

Meyers-Levy, J. and Loken, B. (2015), ‘Revisiting gender differences: What we know and what lies

ahead’, Journal of Consumer Psychology 25(1), 129–149.

Mitchell, E. G., Maimone, R., Cassells, A., Tobin, J. N., Davidson, P., Smaldone, A. M. and

Mamykina, L. (2021), ‘Automated vs. human health coaching: Exploring participant and

practitioner experiences’, 5(CSCW1).

URL: https://doi-org.stanford.idm.oclc.org/10.1145/3449173

Moore, A. (2013), ‘The anchors hierachy: Using the triangle inequality to survive high dimensional

data’, arXiv preprint arXiv:1301.3877 .

Motro, D. and Ellis, A. P. (2017), ‘Boys, don’t cry: Gender and reactions to negative performance

feedback.’, Journal of Applied Psychology 102(2), 227.

42
Omohundro, S. (1990), ‘Bumptrees for efficient function, constraint and classification learning’,

Advances in neural information processing systems 3.

Patel, M. L., Hopkins, C. M., Brooks, T. L. and Bennett, G. G. (2019), ‘Comparing self-monitoring

strategies for weight loss in a smartphone app: randomized controlled trial’, JMIR mHealth

and uHealth 7(2), e12209.

Pham, M. T., Goukens, C., Lehmann, D. R. and Stuart, J. A. (2010), ‘Shaping customer satisfaction

through self-awareness cues’, Journal of Marketing Research 47(5), 920–932.

Phillips, J. M. and Gully, S. M. (1997), ‘Role of goal orientation, ability, need for achievement, and

locus of control in the self-efficacy and goal–setting process.’, Journal of Applied Psychology

82(5), 792.

Rosenbaum, P. R. and Rubin, D. B. (1983), ‘The central role of the propensity score in observational

studies for causal effects’, Biometrika 70(1), 41–55.

Sentyrz, S. M. and Bushman, B. J. (1998), ‘Mirror, mirror on the wall, who’s the thinnest one of

all? effects of self-awareness on consumption of full-fat, reduced-fat, and no-fat products.’,

Journal of Applied Psychology 83(6), 944.

Stypińska, J. (2021), Ageism in ai: new forms of age discrimination in the era of algorithms and

artificial intelligence, in ‘CAIP 2021: Proceedings of the 1st International Conference on AI for

People: Towards Sustainable AI, CAIP 2021, 20-24 November 2021, Bologna, Italy’, European

Alliance for Innovation, p. 39.

Uetake, K. and Yang, N. (2020), ‘Inspiration from the “biggest loser”: social interactions in a

weight loss program’, Marketing Science 39(3), 487–499.

Volpp, K. G., John, L. K., Troxel, A. B., Norton, L., Fassbender, J. and Loewenstein, G. (2008), ‘Fi-

nancial incentive–based approaches for weight loss: a randomized trial’, Jama 300(22), 2631–

2637.

Wager, S. and Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using

random forests’, Journal of the American Statistical Association 113(523), 1228–1242.

Wharton, C. M., Johnston, C. S., Cunningham, B. K. and Sterner, D. (2014), ‘Dietary self-

43
monitoring, but not dietary quality, improves with use of smartphone app technology in an

8-week weight loss trial’, Journal of nutrition education and behavior 46(5), 440–444.

Yancy, W. S., Shaw, P. A., Reale, C., Hilbert, V., Yan, J., Zhu, J., Troxel, A. B., Foster, G. D.

and Volpp, K. G. (2019), ‘Effect of escalating financial incentive rewards on maintenance of

weight loss: a randomized clinical trial’, JAMA network open 2(11), e1914393–e1914393.

Zhao, P., Su, X., Ge, T. and Fan, J. (2016), ‘Propensity score and proximity matching using random

forest’, Contemporary clinical trials 47, 85–92.

44
A Web Appendices

A.1 Web Appendix: App Screens

Figure 8: FREE Plan app screen

Notes: The figure above shows the interface of the application under the ‘Free’ plan. The plan has an AI
coach (with a limited set of features compared to a paid AI plan) and is free.

45
Figure 9: Smart Plan app screen

Notes: The Figure shows a snapshot of the interface of a user enrolled for the Smart Plan on the HealthifyMe
app. This is a paid plan with an AI coach.

46
Figure 10: Coach Plan app screen

Notes: The Figure shows a snapshot of the interface of a user enrolled for the Paid Plan involving a Human
coach on the HealthifyMe app.

47
A.2 Web Appendix: Comparing AI+Human Effects across Dif-

ferent Matching Procedures

Table 15: Comparing Different Matching Procedures


Weight Loss Weekly no. of Weekly no. of
Weight Weight Loss
Weight Loss Goal Goal food weight
Loss Fraction
Fraction logs logs
Nearest Neighbor
(NNR) 0.948*** 0.041*** 1.567*** 0.018*** 51.910*** 0.222***
(0.035) (0.002) (0.098) (0.001) (0.977) (0.015)
Mahalanobis
without replacement 0.894*** 0.042*** 1.445*** 0.016*** 52.386*** 0.231***
(0.034) (0.002) (0.089) (0.001) (0.957) (0.014)
Mahalanobis
(outlier removal) 1.049*** 0.050*** 1.600*** 0.018*** 53.217*** 0.259***
(0.039) (0.002) (0.105) (0.001) (1.144) (0.017)
Mahalanobis (0.1) 0.901*** 0.045*** 1.237*** 0.014*** 51.175*** 0.205***
(0.047) (0.002) (0.108) (0.001) (1.088) (0.017)

Notes: The table shows a regression with robust standard errors clustered at the user level in parentheses.
*** p <0.01, ** p <0.05, * p <0.1. Estimates are calculated on the matched sample of adopters of the Coach
plan and the AI plan using different matching methods. Month Fixed effects are used, and the unit of analysis
is the user spell. The dependent variable in Column-1 is Weight Loss, Column-2 is the fraction of Weight
Loss Fraction, Column-3 is goal loss, Column-4 is the fraction of goal loss, Column-5 is the Average Number
of Food Logs, and Column-6 is the Avg No. of Weight Logs. The independent variables are indicators
for the user’s adoption of a paid plan, age of the user, BMI of the user, and the number of months since
the user joined the corresponding plan. Mahalanobis (0.02) - Mahalanobis matching without replacement
with 0.02 threshold. Mahalanobis (0.1) - Mahalanobis matching without replacement with 0.1 thresholds.
NNR - nearest neighbor matching with replacement. Mahalanobis (outlier removal) - Mahalanobis matching
without replacement with 0.02 threshold after outlier removal in the pre-period.

48
A.3 Web Appendix: Prior Studies in the Weight Loss Domain

Table 16: Weight Loss Papers

Paper Nature of com- Type of esti- Estimate of weight loss


parison mate
Chin et al. Male-app versus Retrospective Baseline BMI was 30.2±0.1kg/m2 for males and 28.0±0.0kg/m2 for females.
(2016) Female-app cohort study Among the participants (35,921 Noom users), 77.9% reported a decrease in body
weight while they were using the app with 22.7% experiencing more than 10%
weight loss compared with baseline with a higher weight loss success rate in
males (83.9 vs. 76.1%), resulting in final BMIs of 28.1±0.1kg/m2 for males and
26.5±0.0kg/m2 for females.
Uetake Weight loss Observational The weight loss for the average peer leads to individual weight gain, as a 1
and Yang program with Study kg increase in the average peer performerı́s weight loss is associated with an
(2020) In-person group individual’s decrease in weight loss by about 0.02 kg. In contrast, weight loss by
meetings as the top performer leads to increased individual weight loss, as a 1 kg increase
an important in the top performerı́s weight loss is associated with an individual’s increase in

49
component weight loss by about 0.01 kg. A large number of observations (about 41%) involve
participants gaining weight.
Ingels et al. DPM program Observational The study divided participants into three tracking groups: rare (¡33% of days
(2017) (West Virginia) Study tracked, n = 25), inconsistent (33–66%, n = 5), and consistent (¿66%, n = 15).
(2015-16) The average total weight loss for consistent trackers was 5.6 pounds (SD = 12 0)
for a 49-week period.
Table 17: Weight Loss Papers (Continued ..)

Paper Nature of comparison Type of esti- Estimate of weight loss


mate
Knäuper The NIH-developed Diabetes Prevention Field Experi- The McGill Healthy Weight Program was admin-
et al. Program (DPP) across two groups i.e. ment istered in 22 sessions over 12 months. Participants
(2018) standard or enhanced DPP (by integrat- were randomly assigned by computer-generated 1:1
ing habit formation tools (i.e., if-then sequence either to the standard or the enhanced
plans)) DPP. The active control group received the stan-
dard group-based DPP delivered over 1 year (12
weekly core sessions, 4 transitional sessions over
3 months, and 6 monthly support sessions). The
enhanced DPP group followed the same program
as the standard DPP group, but instructions for
if-then planning were integrated into it. On aver-
age, participants lost 9.98% of their initial body
weight in the program. At baseline, the standard
DPP (Diabetes Prevention Program) group had
a slightly higher mean weight than the enhanced

50
DPP group. Controlling for this difference, weight
loss did not differ between the groups over the
course of the intervention. Both groups displayed
significant reductions in weight from baseline to 3
months and 12 months, losing, on average of 20.36
pounds over the course of the program.
Liu et al. Mobile phone app interventions compared Meta-analysis Compared with the control group, mobile phone
(2015) with control group app interventions resulted in significant decreases
in body weight, with the pooled estimates of the
net change in body weight being -1.04 kg (95% CI
-1.75 to -0.34; I2 = 41%).
Volpp et al. Fifty-seven healthy participants aged 30- A Randomized The incentive groups lost significantly more weight
(2008) 70 years with a body mass index of 30-40 Trial than the control group (mean, 3.9 lb). Compared
were randomized to 3 weight loss plans: with the control group, the lottery group lost a
monthly weigh-ins, a lottery incentive mean of 13.1 lb (95% confidence interval [CI] of
program, or a deposit contract that al- the difference in means, 1.95-16.40; P=.02) and the
lowed for participant matching, with a deposit contract group lost a mean of 14.0 lb (95%
weight loss goal of 1 lb (0.45 kg) a week CI of the difference in means, 3.69-16.43; P =.006).
for 16 weeks.
Table 18: Weight Loss Papers (Continued..)

Paper Nature of comparison Type of esti- Estimate of weight loss


mate
Wharton Diet tracking and weight loss were com- RCT No difference in weight loss was noted between
et al. pared across participants during an 8- groups.
(2014) week weight loss trial. Participants
tracked intake using 1 of 3 methods: the
mobile app “Lose It!”, the memo feature
on a smartphone, or a traditional paper-
and-pencil method.
Yancy All participants were advised to weigh RCT Mean weight changes at the end of phase 1 were 1.1
et al. themselves daily, with a goal of 6 or more (95% CI, 2.1 to 0.1) kg in the incentive group and
(2019) days per week, and received text messag- 1.9 (95% CI, 2.9 to 0.8) kg in the control group,
ing feedback on their performance. Incen- with a mean difference of 0.7 (95% CI, 0.7 to 2.2)
tive group participants were eligible for a kg (P = .30 for comparison). At the end of phase
lottery-based incentive worth an expected 2, mean weight changes were 0.2 (95% CI, 1.2 to
value of 3.98inweek1thatescalatedby0.43 1.7) kg in the incentive group and 0.6 (95% CI,

51
each week they achieved their self- 2.0 to 0.8) kg in the control group, with a mean
weighing goal during months 1 to 6 difference of 0.8 (95% CI, 1.2 to 2.8) kg (P = .41
(phase 1), followed by no incentives dur- for comparison).
ing months 7 to 12 (phase 2).
Laing et al. evaluate the effect of introducing primary RCT After 6 months, weight change was minimal, with
(2014) care patients to a free smartphone app for no difference between groups (mean between-group
weight loss. 2 academic primary care clin- difference, -0.30 kg [95% CI, -1.50 to 0.95 kg]; P =
ics. 6 months of usual care without (n 0.63).
= 107) or with (n = 105) assistance in
downloading the MyFitnessPal app (My-
FitnessPal).
Patel et al. Participants were randomized to a 12- RCT There was no difference in weight change at 3
(2019) week stand-alone weight loss intervention months between the Sequential arm (mean -2.7 kg,
using the MyFitnessPal smartphone app 95% CI -3.9 to -1.5) and either the App-Only arm
for daily self-monitoring of either (1) both (-2.4 kg, -3.7 to -1.2; P=.78) or the Simultaneous
weight and diet, with weekly lessons, ac- arm (-2.8 kg, -4.0 to -1.5; P=.72).
tion plans, and feedback (Simultaneous);
(2) weight through week 4, then added
diet, with the same behavioral compo-
nents (Sequential); or (3) only diet (App-
Only).
A.4 Web Appendix: Regression results for Weight Loss (Depen-

dent Variable) at a Fixed Time Duration: Non-Overlapping

Sub-Samples

The main results remain the same as in Section 4.3. The impact of AI+Human coaches
increases with time, but at diminishing rates. On average, there is an approximately 67-79%
(with variation based on which sub-sample we are looking at) increase in the weight loss
impact of AI+Human coaches at the 6-month mark relative to the 3-month mark. This is
quite comparable to the approximately 75% impact we saw in Section 4.3.

Table 19: Regression results for Weight Loss (Dependent Variable) at a Fixed Time Duration:
Non-Overlapping Sub-Samples

T2 - T1 Y const Age Female BMI Human Coach


6-3 months 3 months -9.192*** 0.01 -0.889*** 0.382*** 1.154***
(1.531) (0.010) (0.155) (0.055) (0.157)
9-6 months 3 months -5.162*** -0.02* -0.884*** 0.264*** 0.78***
(0.780) (0.011) (0.163) (0.027) (0.159)
9-6 months 6 months -8.028*** -0.021* -0.799*** 0.375*** 1.397***
(0.882) (0.013) (0.200) (0.031) (0.202)
12-9 months 3 months -4.569*** -0.016 -1.003*** 0.242*** 1.206***
(1.276) (0.019) (0.297) (0.050) (0.329)
12-9 months 6 months -8.613*** 0.017 -1.069*** 0.367*** 2.145***
(1.778) (0.027) (0.372) (0.067) (0.395)
12-9 months 9 months -12.329*** 0.012 -1.114*** 0.51*** 2.553***
(1.995) (0.027) (0.404) (0.078) (0.438)
12+ months 3 months -8.966 0.019 -0.655* 0.350 2.112***
(6.460) (0.019) (0.335) (0.232) (0.290)
12+ months 6 months -10.989* 0.024 -0.673* 0.418* 3.538***
(6.231) (0.021) (0.379) (0.223) (0.352)
12+ months 9 months -12.942** 0.02 -0.557 0.499** 3.87***
(6.180) (0.024) (0.418) (0.221) (0.402)
12+ months 12 months -16.149*** 0.024 -0.498 0.625*** 3.833***
(5.922) (0.026) (0.454) (0.211) (0.441)

Notes: The table reports the regression results for Weight Loss at a Fixed Time Duration (i.e Weight Loss
after Y months for a sample of users who stay in the app for time T1-T2) as the dependent variable. The
coefficients for independent variables are reported in the table.

52

You might also like