RWJF Reality Mining Whitepaper 0309
RWJF Reality Mining Whitepaper 0309
RWJF Reality Mining Whitepaper 0309
Executive Summary
We live our lives in digital networks. We wake up in the morning, check our e-mail,
make a quick phone call, commute to work, buy lunch. Many of these transactions
leave digital breadcrumbs tiny records of our daily experiences, as illustrated in
Figure 1. Reality mining, which pulls together these crumbs using statistical
analysis and machine learning methods, offers an increasingly comprehensive
picture of our lives, both individually and collectively, with the potential of
transforming our understanding of ourselves, our organizations, and our society in a
fashion that was barely conceivable just a few years ago. It is for this reason that
reality mining was recently identified by Technology Review as one of 10
emerging technologies that could change the world (Technology Review, April
2008).
services use, surveillance of disease and risk factors, and public health investigation
and disease control. The goal of this paper is to survey the potential of reality
mining to improve public health and medicine, and to make recommendations for
action.
Currently, the single most important source of reality mining data is the ubiquitous
mobile phone. Every time a person uses a mobile phone, a few bits of information
are left behind. The phone pings the nearest mobile-phone towers, revealing its
location. The mobile phone service provider records the duration of the call and the
number dialed.
In the near future, mobile phones and other technologies will collect even more
information about their users, recording everything from their physical activity to
their conversational cadences. While such data pose a potential threat to individual
privacy, they also offer great potential value both to individuals and communities.
With the aid of data-mining algorithms, these data could shed light on individual
patterns of behavior and even on the well-being of communities, creating new ways
of improving public health and medicine.
To illustrate, consider two examples of how reality mining may benefit individual
health care. By taking advantage of special sensors in mobile phones, such as the
microphone or the accelerometers built into newer devices like Apples iPhone,
important diagnostic data can be captured. Clinical pilot data demonstrate that it
may be possible to diagnose depression from the way a person talks -- depressed
people tend to speak more slowly, a change that speech analysis software on a
phone might recognize more readily than friends or family do. Similarly,
monitoring a phones motion sensors can also reveal small changes in gait, which
could be an early indicator of ailments such as Parkinsons disease.
Within the next few years reality mining will become more common, thanks in part
to the proliferation and increasing sophistication of mobile phones. Many handheld
devices now have the processing power of low-end desktop computers, and they
can also collect more varied data, due to components such as GPS chips that track
location. The Chief Technology Officer of EMC, a large digital storage company,
estimates that this sort of personal sensor data will balloon from 10% of all stored
information to 90% within the next decade.
While the promise of reality mining is great, the idea of collecting so much
personal information naturally raises many questions about privacy. It is crucial
that behavior-logging technology not be forced on anyone. But legal statutes are
lagging behind data collection capabilities, making it particularly important to
begin discussing how the technology will and should be used. Therefore, an
additional focus of this paper will be on developing a legal and ethical framework
for using reality mining techniques in research, medical care, other health service
delivery, and public health surveillance and disease control.
autonomic
ACTIVITY
CHANGE
thalamic attention
INFLUENCE
ON TIMING
mirror neurons
MIMICRY
cerebellar motor
CONSISTENCY
OF MOVEMENT
Figure 2: Reality mining has shown that statistical analysis of behavior can be
related to the function of some major brain systems, providing capabilities that can
be thought of as a sort of low-resolution brain scanning technology.
These qualitative measurements of brain function have been shown to be powerful,
predictive measures of human behavior (Pentland, 2008). They play an important
role in human social interactions, serving as honest signals that provide social
cues to dominance, empathy, attention, and trust, and may offer new methods of
diagnosis, treatment monitoring, and population health assessments.
Self-report data can also be collected to complement the unobtrusive,
automatically-generated and -collected reality mining data streams. The widespread
use of portable digital devices such as cell phones and personal digital assistants
(PDAs) enable this marriage of subjective and objective data types. For over a
decade, these devices have been used to gather reported data from individuals
during the course of their daily lives on such phenomena as symptoms, substance
use, and mood.
The technologies for collecting self-reported data in this way are rapidly evolving,
but even existing approaches are flexible, automated, and deployable on a large
scale. Scheduling of self-reports can be fixed (e.g., on a daily or some more or less
frequent timetable), event-based (in response to an event experienced by the
respondent or a pattern in one of the data streams objectively recorded by the
digital device), or randomly determined. Depending on the device, questions or
prompts can be delivered aurally or visually, and responses can be given by voice
or touch (e.g., keypad presses or interaction with a display).
Self-reported data offer direct assessment of individuals cognitive and emotional
states, perceptions of events, and information on their behaviors and the contexts in
which they are involved that cannot be captured through other reality mining data
streams. In many cases, the outcomes of interest in medicine and public health,
such as some kinds of symptoms, can be measured only through self-report. By
gathering self-reported data in tandem with other reality mining data streams,
memory errors are reduced and dynamic aspects of health phenomena are more
fully revealed.
1.2 Mapping Social Networks
One of the most important applications of reality mining may be the automatic
mapping of social networks (Eagle and Pentland, 2006). In Figure 3 (a), you see a
smart phone that is programmed to sense and continuously report on its users
location, who else is nearby, the users call and SMS patterns, and (with phones
that have accelerometers) how the user is moving. One hundred of these phones
were deployed to students at MIT during the 2004-2005 academic year. Figure 3
(b) shows the pattern of proximity among the participants during one day; even
casual examination shows that the students were part of two separate groups: the
Sloan School and the Media Lab.
Figure 3: Mapping social networks from mobile phone location / proximity data.
3(a) shows a `smart phone programmed to sense other people using Bluetooth,
3(b) shows the pattern of proximity between people during one day, and 3(c) shows
that different social relationships are associated with different patterns of
proximity.
Careful analysis of these data shows different patterns of behavior depending upon
the social relationship between people. Figure 3(c) shows the pattern of proximity
during one week, and it can be seen that self-reported reciprocal friends (both
persons report the other as a friend), non-reciprocal friends (only one of a pair
reports the other as a friend), and reciprocal non-friends (neither of a pair reports
the other as a friend) exhibit very different patterns (Eagle, Lazer and Pentland,
2007). By using more sophisticated statistical analysis we can map each
participants social network of friends and co-workers with an average accuracy of
96% (Dong and Pentland, 2007).
Reality minings capability for automatic social network mapping is now being
used in a variety of research applications. As an example, a current research project
underway at MIT is aimed at understanding health-related behaviors and infectious
disease propagation. At this time, we have above 80% participation of students in a
MIT dormitory that includes freshmen and upperclassmen, and are beginning to
compare the behavior and health changes that freshmen normally experience with
the changes in their various social networks. This experiment should help to
disentangle causal pathways about how social networks influence obesity and other
health-related behaviors, as well as provide unprecedented detail for modeling the
spread of infectious disease (see http://hd.media.mit.edu/socially_aware.html).
1.3 Beyond Demographics to Behavior Patterns
Most government health services rely on demographic data to guide service
delivery. Demographic characteristics, however, are a relatively poor predictor of
individual behavior, and it is behavior not wealth, age, or place of residence
that is the major determinant of health outcomes. Reality mining provides a way to
characterize behavior, and thus provides a classification framework that is more
directly relevant to health outcomes (Pentland, 2008).
The pattern of movement between the places a person lives, eats, works, and hangs
out are known as a behavior pattern. Reality mining research has shown that most
people have only a small repertoire of these behavior patterns, and that this small
set of behavior patterns accounts for the vast majority of an individuals activity
(Pentland, 2007).
The fact that all mobile phones constantly measure their position (either through
GPS or by finding the nearest cell tower) means that we can use reality mining of
mobile phone location data to directly characterize an individuals set of behavior
patterns. We can also cluster together people with similar behavior patterns in order
to discover the independent subgroups within a city.
Figure 4(a) shows movement patterns with popular hang outs color coded by the
different subpopulations that populate these destinations, where the subpopulations
are defined by both their demographics and, more importantly, by their behaviors.
Figure 4(b) shows that the mixing between these different behavior subpopulations
is surprisingly small.
Understanding the behavior patterns of different subpopulations and the mixing
between them is critical to the delivery of public health services, because different
subpopulations have different risk profiles and different attitudes about healthrelated choices. The use of reality mining to discover these behavior patterns can
potentially provide great improvements in health education efforts and behavioral
interventions.
10
11
12
13
14
exposures. Location tracking data generated by cell phones, when coupled with
measurements of ambient air pollution at numerous places in a community
(gathered from existing air quality monitoring stations and/or inferred from vehicle
traffic patterns and locations of industrial facilities), may offer just the kind of
exposure measurement needed. This inexpensive approach would yield dynamic
and temporally and spatially more precise measures of exposure suitable for
studying large samples of individuals. The location tracking data might even permit
differentiating time spent indoors vs. outdoors, through momentary observations in
which an individuals (cell phones) location can be detected from cell tower data
but not through GPS data. (GPS readings require a line of sight to satellites
overhead and are thus generally not available when indoors.)
2.4 Mental Health
Even though they are quite treatable, mental diseases rank among the top health
problems worldwide in terms of cost to society. Major depression, for instance, is
the leading cause of disability in established market economies (RAND
Corporation, 2004). Reality mining technology might assist in the early detection
of psychiatric disorders such as depression, attention deficit hyperactive disorder
(ADHD), bipolar disorder, and agoraphobia.
Diagnoses of psychiatric disorders often are based on both subjective states and
observable behaviors. In clinical settings, measurement of these states and
behaviors for diagnostic purposes is based overwhelmingly on patient self-report,
proxy/informant report (e.g., by a teacher or family member), laboratory
performance tasks, and/or clinician assessment of patient behavior in clinical
settings. Each of these approaches provides useful information, but the agreement
between these methods tends to be moderate, which may produce somewhat
unreliable diagnoses. Electroencephalograph (EEG) assessments are quite useful in
detecting the possible presence of some disorders (e.g., ADHD), although currently
they cannot always differentiate diagnoses reliably (i.e., similar EEG patterns can
reflect different disorders).
Many signs and symptoms of these kinds of psychiatric disorders explicitly or
implicitly relate to an individuals physical movement and activity patterns and
communicative behavior, usually with reference to particular temporal periods or
cycles. Data streams from reality mining approaches allow direct, continuous, and
long term assessment of these patterns and behaviors. Accelerometers in mobile
phones, if carried in a pocket, for instance, might reveal fidgeting, pacing, abrupt or
frenetic motions, and other small physical movements. Location tracking functions
reveal individuals spatial and geographic ranges, variation in locations visited and
15
routes taken, and overall extent of physical mobility. The frequency and pattern of
individuals communications with others and the content and manner of speech
might also reflect key signs of several psychiatric disorders.
The value of these data streams would be multiplied when combined with data
reported in real-time by individuals about their psychological states and the
contexts in which they are involved. Such reports could be through a variety of
response modes on a cell phone and might be triggered by patterns in the location,
movement, or communication data or collected on a fixed (hourly, daily, weekly,
etc.) or random schedule.
Reality mining methods for diagnosis of medical and psychiatric disorders might be
particularly valuable with children and adolescents, who may report their past
emotional, mental, and physical states less reliably and articulately than adults.
Linguistic and cultural differences between patients, their families, and clinicians
also can make diagnosis more difficult, and reality mining approaches might yield
diagnostic data that circumvent these challenges to a large degree.
Figure 6: (a) Voice analysis to extract activity, influence, mimicry, and consistency
measures. (b) As estimates of depression level, there is a correlation of r = 0.79
between these telephone-based measures and the Hamilton Depression Index.
For a more specific example of the potential power of reality mining technology in
aiding diagnosis, consider the data presented in Figure 6. Researchers have long
known that speech activity can be affected in pathological states such as depression
or mania. Thus, they have used audio features such as fundamental frequency,
amplitude modulation, formant structure, and power distribution to distinguish
16
between the speech of normal, depressed, and schizophrenic subjects (France et al.,
2000; Stoltzman, 2006). Similarly, movement velocity, range, and frequency have
been shown to correlate with depressed mood (Teicher, 1995) (These results can be
understood in terms of the qualitative brain system assessment illustrated in Figure
2).
In the past, performing such measurements outside the laboratory was difficult
given the required equipments size and ambient noise. Today, however, even
common cell phones have the computational power needed to monitor these
correlates of mental state, as illustrated in Figure 6a. We also can use the same
methodology for more sophisticated inferences, such as the quantitative
characterization of social interactions. The ability to use inexpensive, pervasive
computational platforms such as cell phones to monitor these sensitive indicators of
psychological state offers the dramatic possibility of early detection of mental
problems.
2.5 Treatment Monitoring
Once a course of treatment (whether behavioral, pharmaceutical, or otherwise) has
been chosen, it is important for a clinician to monitor the patients response to
treatment. The same types of reality mining data used for diagnosis would also be
relevant for monitoring patient response to treatment, especially when such data on
the patient are available for a period before diagnosis and can serve as a baseline
for comparison. Even when these data streams are not relevant for diagnosis, they
might be useful in assessing side effects of treatment, such as reduced mobility,
activity, and communicative behavior. Because these data are collected in real-time,
a clinician would be able to adjust treatment according to the patients response,
perhaps leading to more effective treatment and preventing more costly office visits.
Continuous monitoring of motor activity, metabolism, and so on can be extremely
effective in tailoring medications to the individual. Currently, doctors prescribe
medications based on population averages rather than individual characteristics,
and they assess patients for the appropriateness of the medication levels only
occasionally and expensively. With such a data-poor system, it is not surprising
that medication doses are frequently over- or underestimated and that unforeseen
drug interactions occur. Going further, correlating a continuous, rich source of
behavioral data to prescription medication use for millions of people could make
drug therapies more effective and help medical professionals detect new drug
interactions more quickly.
17
18
19
20
21
22
sharing of data about individuals. Robust models of collaboration and data sharing,
between government, industry and the academy need to be developed; guarding
both the privacy of consumers as well as corporations legitimate competitive
interests are vital here. The use of anonymous data should be enforced and analysis
at the group level should be preferred over that at the individual level.
Thus, we need to adopt policies that encourage the combination of massive
amounts of anonymous data. Aggregate and anonymous location data can produce
enormous benefits for society. Patterns of how people move around can be used for
early identification of infectious disease outbreaks, protection of the environment,
and public safety. It can also help us measure the effectiveness of various
government programs, and improve the transparency and accountability of
government and non-profit organizations. Thus, advances in analysis of network
data must be approached in tandem with understanding how to create value for the
producers and owners of the data, while at the same time protecting the public
good. Clearly, our notions of privacy and ownership of data need to evolve in order
to adapt to these new challenges.
This raises another important question: how do we design institutions to manage
the new types of privacy issues that will emerge? It seems likely that new types of
institutions are required to deal with this information, but what form should they
take? Private companies will also have a key role in this new deal for privacy and
ownership. Perhaps market mechanisms can be put in place that allow people to
give up their data for monetary or service rewards.
23
5. RECOMMENDATIONS
In summary, a computational medical and public health science based on reality
mining is emerging, a science that leverages the capacity to collect and analyze
data with a breadth and depth that was previously inconceivable. The capacity to
collect and analyze massive amounts of data has unambiguously transformed such
fields as biology and physics. The emergence of such a data driven computational
public health science (CPH science) has been much slower, largely driven by a
few intrepid computer scientists, physicists, and social scientists. If one were to
look at the leading disciplinary journals in health science and related disciplines,
there would be minimal evidence of an emerging CPH science engaged in reality
mining of these new kinds of digital traces. What then are the obstacles that stand
in the way of a computational public health science?
First of all, there are significant infrastructural barriers to the forward movement of
computational health science. The leap from todays static, snap-shot based public
health science to a proactive, dynamic, and CPH science is considerably larger than,
for example, from biology to a computational biology, in large part due to the scale
and complex ownership of the infrastructure that makes reality mining possible.
The resources available to exploratory research in public health are significantly
smaller, the computational abilities of public health scientists likely generally lag
behind those in the other sciences, and even the physical (and administrative)
distance between medicine and public health departments and engineering or
computer science departments tends to be greater than for the other sciences.
On the tool side, the availability of easy-to-use tools would greatly magnify the
presence of a CPH science. Just as mass-market computer assisted design software
revolutionized the engineering world decades ago, common CPH analysis tools and
the sharing of data will lead to significant advances. The development of these
tools can, in part, piggyback on tools developed in biology, physics, but also
requires substantial investments in applications customized to public health needs.
In addition, many important challenges exist on the data side, primarily around
access and privacy. Properly managing the issues around privacy is essential. As
the recent NRC report on GIS data highlights, it is often possible to pull individual
profiles out of even carefully anonymized data. A single dramatic incident
involving such a breach of privacy could produce a set of statutes, rules, and
prohibitions that could strangle this nascent field in its crib.
What is necessary now is to produce a self-regulatory regime of procedures,
technologies, and rules that reduces this risk but preserves most of the research
24
potential. As part of this, it is necessary for IRBs to vastly increase their technical
knowledge to understand the potential for intrusions and harm to individuals,
because the possibilities do not fit their current paradigms for harm. In the longer
run, it may be necessary to rethink how IRBs are organized, possibly involving, for
example, audits of the safeguards researchers have instituted. These safeguards, in
turn, may prove a useful model for industry in enabling their internal research.
To be avoided, however, is either the retreat of an emerging computational public
health science into the exclusive domain of private companies, or the development
of a Dead Sea Scrolls model, with academic researchers sitting on private data
from which they produce papers that cannot be critiqued or replicated. Neither
scenario will serve the long-term public interest in the agglomeration of knowledge.
Finally, the academic community needs to figure out how to train computational
social scientists. A key requirement for CPH analysis to be successful is the
development of complementary and synergistic explanations spanning different
fields (e.g., epidemiology and network science) and scales (from individuals to
small groups and organizations to entire nations). Certainly, in the short run, CPH
science needs to be the work of teams of public health and computer scientists. In
the longer run, the question will be: should academia be building computational
public health scientists, or teams of computationally literate public health scientists
and public health literate computer scientists?
The emergence of cognitive science in the 1960s and 1970s offers a powerful
model for the development of a computational health science. Cognitive science
emerged out of the power of the computational metaphor of the human mind. It has
involved fields ranging from neurobiology to philosophy to computer science. It
attracted the investment of substantial resources to create a common field, and it
has created enormous progress for the good in the last generation. We would argue
that a computational public health science has a similar potential, and is worthy of
similar investments.
25
References
Bailenson, J., and Yee, N. 2005. Digital chameleons: Automatic assimilation of
nonverbal gestures in immersive virtual environments. Psychological Science,
16(10): 814-819.
Chartrand, T., and Bargh, J. 1999. The chameleon effect: The perception-behavior
link and social interaction. J. Personality and Social Psychology, 76(6): 893910.
Christakis, N., and Fowler, J. 2007. The spread of obesity in a large social network
over 32 years. New England Journal of Medicine, 357: 370-379.
Christakis, N., and Fowler, J. 2008. The collective dynamics of smoking in a large
social network. New England Journal of Medicine, 358: 2249-2258.
Dong, W., and Pentland, A. 2007. Modeling influence between experts. Lecture
Notes on AI: Special Volume on Human Computing, 4451: 170-189.
Eagle, N. Aug. 2008. Behavioral inference across cultures: Using telephones as a
cultural lens. IEEE Intelligent Systems.
Eagle, N., and Pentland, A. 2006. Reality mining: Sensing complex social systems.
Personal and Ubiquitous Computing, 10(4): 255-268.
http://hd.media.mit.edu
Eagle, N., Lazer, D., and Pentland, A. 2007. Inferring friendship from proximity,
unpublished manuscript.
France, D., et al. July 2000. Acoustical properties of speech as indicators of
depression and suicidal risk. IEEE Trans. Biomedical Eng., 829-837.
Franks, P., Campbell, T.L., and Shields, C.G. Apr. 1992. Social relationships and
health: The relative roles of family functioning and social support. Social
Science & Medicine, 779-788.
Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S., and
Brilliant, L. (in press). Detecting influenza epidemics using search engine query
data. Nature.
26
Gonzalez, M., Hidalgo, C., and Barabasi, A.L. June 5, 2008. Understanding
individual human mobility patterns. Nature, 453, 779-782.
Jaffee, J., Beebe, B., Feldstein, S., Crown, C., and Jasnow, M. 2001. Rhythms of
dialogue in early infancy. Monographs of the Society for Research in Child
Development, 66(2): 264.
Klapper, D. 2003. Use of a wearable ambulatory monitor in the classification of
movement states in Parkinsons Disease. Masters thesis, Harvard-MIT Health
Sciences and Technology Program.
Kumar, V., and Pentland, A. 2002. DiaBetNet: Learning and predicting blood
glucose results to optimize glycemic control, 4th Ann. Diabetes Technology
Meeting, Atlanta, GA.
www.diabetestechnology.org.
Olguin, D., Gloor, P., and Pentland, A. 2009. Capturing individual and group
behavior with wearable sensors, AAAI Spring Symposium, Stanford, CA.
Pentland, A. 2005. Socially aware computation and communication, IEEE
Computer, 38(3): 33-40.
Pentland, A. 2007. Automatic mapping and modeling of human networks. Physica
A: Statistical Mechanics and Its Applications, 378(1): 59-67.
Pentland, A. 2008. Honest signals: How they shape your world. MIT Press,
Cambridge, MA.
Rand Corporation 2004.
http://www.rand.org/pubs/research_briefs/RB9055/index1.html
Sense Networks 2008.
http://www.sensenetworks.com
Stoltzman, W. 2006. Toward a social signaling framework: Activity and emphasis
in speech. Masters thesis, MIT EECS.
http://hd.media.mit.edu
Sung, A., Marci, C., and Pentland, A. 2005. Objective physiological and behavioral
measures for tracking depression. Technical Report 595, MIT Media Lab.
http://hd.media.mit.edu.
27
Teicher, M.H. 1995. Actigraphy and motion analysis: New tools for psychiatry.
Harvard Rev. Psychiatry, (3): 18-35.
Van Noort, S.P., Muehlen, M., Rebelo de Andrade, H., Koppeschaar, C., Lima
Loureno J.M., and Gomes, M.G. 2007. Gripenet: An internet-based system to
monitor influenza-like illness uniformly across Europe. Eurosurveillance,
12(7): pii=722.
http://www.eurosurveillance.org/ViewArticle.aspx?ArticleId=722.
28
29