Paper 2
Paper 2
Paper 2
Ting-Fang Yen1∗, Yinglian Xie2 , Fang Yu2 , Roger Peng Yu3 , Martı́n Abadi2†
1
RSA Laboratories
2
Microsoft Research Silicon Valley
3
Microsoft Corporation
[email protected], {yxie, fangyu, rogeryu, [email protected]}
Abstract 1 Introduction
Many web services aim to track clients as a basis It is in the interest of web services and ISPs to track
for analyzing their behavior and providing personalized the mobility and usage patterns of client hosts. This
services. Despite much debate regarding the collection tracking allows them to understand user behavior for
of client information, there have been few quantitative supporting applications such as product suggestions, tar-
studies that analyze the effectiveness of host-tracking geted advertising, and online fraud detection. How-
and the associated privacy risks. ever, clients may not wish that their activities be tracked,
and can intentionally remove stored browser cookies or
In this paper, we perform a large-scale study to quan-
choose not to perform user logins. The growing aware-
tify the amount of information revealed by common host
ness of privacy concerns is exemplified by the recent
identifiers. We analyze month-long anonymized datasets
“do-not-track” initiative from the Federal Trade Com-
collected by the Hotmail web-mail service and the Bing mission [15], which outlines guidelines to which service
search engine, which include millions of hosts across the
providers must adhere in the collection and distribution
global IP address space. In this setting, we compare the
of client information.
use of multiple identifiers, including browser informa-
Several works aim to improve the accuracy of host-
tion, IP addresses, cookies, and user login IDs.
tracking by collecting detailed host information, such
We further demonstrate the privacy and security im- as installed browser plug-ins and system fonts [20, 31]
plications of host-tracking in two contexts. In the first, or packet-level information that reveals subtle hardware
we study the causes of cookie churn in web services, and differences [28]. By comparison, few studies exist
show that many returning users can still be tracked even on the effectiveness and privacy implications of host-
if they clear cookies or utilize private browsing. In the tracking. Previous work tends to be qualitative in na-
second, we show that host-tracking can be leveraged to ture [29, 30] or limited to a single identifier [20].
improve security. Specifically, by aggregating informa- In this paper, we attempt to facilitate the debate re-
tion across hosts, we uncover a stealthy malicious attack garding host-tracking by performing a large-scale study
associated with over 75,000 bot accounts that forward to quantify the amount of identifying information re-
cookies to distributed locations. vealed by common identifiers. Such analysis is criti-
cal to both service providers and end users. For ex-
ample, service providers can determine where existing
identifiers are insufficient and more sophisticated meth-
∗ This work was done while Ting-Fang was an intern at Microsoft
ods may be preferred. Users who do not wish to be
Research.
tracked can learn the circumstances in which they can
† Martı́n Abadi is also affiliated with the University of California, be identified accurately, so that they can take effective
Santa Cruz. measures to protect privacy. Our analysis is based on
month-long anonymized datasets from the Hotmail web- Although our research relies on anonymized datasets
mail service and the Bing search engine, including hun- from Hotmail and Bing, the analyses that we describe
dreds of millions of users across the global Internet IP are a research effort only. Our goal is not to identify
address space. By characterizing hosts’ activities across or study specific individual activities, but rather to un-
time using “binding windows”, we show that common derstand the patterns of the aggregated activities and to
identifiers allow us to track hosts with high accuracy. explore their implications.
We further consider cases where users take initiatives In the following, we first describe the identifiers that
to preserve privacy, e.g., by clearing cookies or switch- we study and our host-tracking methodology in Sec-
ing to private browsing mode. Specifically, we analyze tion 2, and present the evaluation of those identifiers in
“one-time” cookies that do not return again in subse- Section 3. We investigate the privacy and security impli-
quent web requests, a phenomenon known as cookie cations of host-tracking in the context of cookie churn in
churn. These cookies appear to be anonymous. How- Section 4 and of host mobility in Section 5. Finally, we
ever, by applying our host-tracking results, we show that describe related work in Section 6 and conclude in Sec-
a surprisingly large fraction can be recognized as be- tion 7.
longing to returning users.
In addition to its privacy implications, we demon- 2 Exploring Common Identifiers
strate that host-tracking can also be applied to improve
security. We examine the mobility patterns of hosts trav-
eling across multiple IP ranges, and establish normal Given a log of application-level events collected over
user mobility profiles from aggregate host activities. In time, such as requests directed to a web server or user
doing so, we are able to analyze unusual activities, e.g., logins to a service, our goal is to quantify the amount
the use of anonymous routing networks, and develop of host-identifying information that is captured in iden-
methods to detect attacks. In particular, our study uncov- tifiers within the log. Specifically, for an identifier I,
ers previously unknown suspicious cookie-forwarding which may take on a finite set FI = {f1 , f2 , . . . , fn } of
activities, which may have been adopted by attackers to possible values (called fingerprints), we are interested in
evade spamming detection. whether a fingerprint fi uniquely corresponds to a single
The key findings of this paper include: host, among all hosts involved in the log. As we consider
only client hosts in our scenario, we use clients or hosts
• We show that 60%-70% of HTTP user-agent interchangeably throughout the paper.
strings can accurately identify hosts in our datasets. We assume the perspective of a passive observer of
When augmented with coarse-grained IP prefix in- identifiers within application-level events. The common
formation, the accuracy can be improved to 80%, identifiers explored in this work include 1) user-agent
similar to that obtained with cookies. User-agent string (UA), 2) IP address, 3) browser cookie, and 4)
strings combined with IP addresses have an entropy user login ID. We choose these identifiers because they
of 20.29 bits—higher than that of browser plug-ins, are not particular to our datasets, and are available in a
screen resolution, timezone, and system fonts com- wide variety of service logs.
bined [20].
2.1 Host-tracking Graph
• Applying our results to study cookie churn, we find
that a service provider can recognize and track 88%
of the “one-time” cookies as corresponding to users Our host-tracking approach attempts to infer the pres-
who later returned to the service. Among these ence of a host at an IP address during a certain time in-
users, 33% made an effort to preserve their privacy, terval. Upon observing a fingerprint f (and only f ) that
either by clearing cookies through browser options appears at an IP address A over a time interval ∆t, we
or utilizing private browsing mode. can infer a “binding window” for f . Events occurring
within ∆t at A can then be attributed to the host corre-
• Employing general mobility patterns derived by sponding to f . (Hosts behind NATs/proxies can compli-
tracking hosts across network domains, we uncover cate matters; we quantify the occurrence of such hosts
malicious behaviors where cookies are forwarded in our data in Section 3.3.)
from one IP address to distributed locations. In to- Figure 1 illustrates how we infer the binding win-
tal, we identify over 75,000 bot Hotmail accounts dows. In this example, user-agent strings (UA) are the
in this relatively stealthy attack that has not been identifiers, and the events are queries to a web search
detected before. engine. A fingerprint UA1 appears in two consecutive
Figure 2. Example of a host-tracking graph.
Figure 1. Binding windows identified on one Bars with different patterns denote binding
IP. windows corresponding to different finger-
prints.
search queries at time t1 and t2 , followed by queries at user-agent string from the HTTP header (anonymized
time t3 , t4 , and t5 with a different fingerprint UA2 . Thus via hashing), the IP address from which the query was
we can identify binding windows corresponding to two issued, the time of the query, the anonymized cookie
different “hosts” on this IP: one spanning the time range ID assigned by the search engine, and the date that the
[t1 , t2 ], and another spanning [t3 , t5 ]. Having exam- cookie ID was created. Specifically, the anonymized
ined all search query events, we can construct a host- cookie ID is a persistent identifier that does not change
tracking graph as in Figure 2. Note that a fingerprint over time, if users do not clear cookies or use private
may be associated with multiple binding windows (since browsing. We refer to this as the Search dataset. As part
the host may not be up all the time) and across different of the processing performed by the Bing search engine,
IP addresses (e.g., because of DHCP). We refer to the events generated by known bots are filtered in advance.
host-tracking graph that represents hosts by identifier I To validate our client-tracking approach, we lever-
as GI . aged a month-long sampled log of Windows Update
A similar concept of host-tracking graph was also events, also from August 2010. This data contains the
used by HostTracker [38] to support Internet account- time at which the update was performed, the IP address,
ability. HostTracker groups together user login IDs that and the anonymized hardware ID that is unique to the
are likely to be associated with the same host, e.g., fam- host. This is the Validation dataset.
ily members that share a computer at home. It also filters Table 1 shows the fields and the total number of
events related to bots and large proxies. In contrast to unique IPs observed in each dataset. All three datasets
this previous work, we make a broader use of the host- include tens to hundreds of millions of IP addresses,
tracking graph (with a variety of common identifiers), spanning a large IP address space.
and we apply host-tracking to the cookie-churn study (in The published privacy policies for Hotmail, Bing,
Section 4) and the host-mobility analysis (in Section 5). and Windows Update address the storage, use, sharing,
and retention of data collected in the course of the oper-
2.2 Datasets ation of these services. In particular, they indicate that
Microsoft may employ this data for analyzing trends and
The data for our study includes a month-long user lo- for operating and improving its products and services,
gin trace collected by the Hotmail web-mail service in as we aim to do with this work. Since the datasets are
August 2010. The trace contains coarse-grained infor- sensitive, they are not publically available for further re-
mation about the OS and browser type (e.g., Windows, search.
Mozilla), the IP address from which the login was made,
the time of the login event, and the anonymized user ID. 2.3 Validation and Metrics
In the following, we refer to this as the Webmail dataset.
We also obtained a month-long dataset consisting of Without ground truth for the host-IP mappings, we
search query events directed to the Bing search engine evaluate a host-tracking graph GI by overlapping it with
in August 2010. This data includes the fine-grained the Validation dataset. If a fingerprint is able to correctly
Dataset User-agent information IP address Timestamp ID Unique IP addresses
Webmail OS and browser type Yes Yes User ID 308 million
Search User-agent string (UA) Yes Yes Cookie ID 131 million
Validation N/A Yes Yes Hardware ID 74 million
track a host, its bindings should overlap only with Win- on clients of Microsoft services. We acknowledge that
dows Update events associated with a single hardware any dataset will be incomplete and possibly biased.
ID. Conversely, a hardware ID is also expected to over-
lap with bindings associated with only one fingerprint. 3 Client-Tracking Results
We quantify the accuracy of an identifier using pre-
cision and recall. Let hidcount(f ) denote the number
of hardware IDs to which a fingerprint f corresponds, In this section, we construct host-tracking graphs us-
and fpcount(m) the number of fingerprints to which a ing the common identifiers user-agent string (UA), IP
hardware ID m corresponds. Precision is defined as the address, cookie ID, and user login ID, and evaluate
percentage of fingerprints that correspond to one host their precision and recall. In particular, we explore the
(i.e., one hardware ID), while recall is the percentage of distinguishing power of UA by examining the browser
hosts that correspond to one fingerprint. anonymity sets. We also measure the impact of prox-
ies and NATs in our study in Section 3.3, and describe
| {f : hidcount(f ) = 1, f ∈ FI } | the increased accuracy and confidence of tracking stable
PrecisionI =
| FI | hosts in Appendix A.
| {m : fpcount(m) = 1, m ∈ MI } | Our analysis focuses on host-tracking within each
RecallI = network domain, derived using the BGP prefix entries
| MI |
obtained from RouteViews [9]. We investigate the oc-
FI is the finite set of values that identifier I takes in currences of identifiers at multiple network locations in
our dataset, i.e., the fingerprints (after some initial filter- Section 5, in which we also study the security implica-
ing, as described below). MI is the set of hardware IDs tions of host-tracking.
that overlap with the host-tracking graph GI . Roughly
speaking, precision quantifies how accurate an identifier 3.1 Precision and Recall
is at representing a host. Recall quantifies how well an
identifier is able to track the events associated with the Table 2 presents our results on host-tracking. Af-
corresponding host in a log. ter overlapping the Validation dataset with the host-
We also measure the entropy of an identifier, HI , tracking graphs, the number of unique fingerprints and
which is the amount of information identifier I contains hardware IDs included in our evaluation is still large—
that can distinguish hosts. The entropy is defined as on the order of millions.
X Several observations are evident from Table 2. First,
HI = − Pr(f ) log2 (Pr(f ))
f ∈FI
browser information (UA) alone can identify hosts quite
well. Its 62.01% precision is perhaps surprising, as UA
where Pr(f ) is the probability of observing fingerprint strings are commonly regarded as providing insufficient
f in the application log. A higher entropy indicates a information to reveal host identities. Second, a com-
smaller probability that any two clients are associated bination of UA with the IP address (i.e., fingerprinting
with the same fingerprint. hosts by distinct (UA, IP) pairs) can boost the precision
In our validation, we consider only those fingerprints up to 80.62%. In fact, combining UA with only the IP
that overlap with more than one Windows Update event, prefix is sufficient to achieve approximately the same re-
and only those hardware IDs that overlap with more than sult as with UA+IP. This suggests that anonymization
one application-level event pertaining to our identifiers. techniques that store the IP prefix may still retain dis-
These restrictions allow us to focus on the portion of tinguishing information. Third, cookie IDs offer only
data that we can validate, though they can be biased to slightly better precision and recall than UA+IP. The
those clients that access the services consistently (i.e., inaccuracies of cookie IDs can be partly attributed to
multiple times and with the same identifiers). Similarly, cookie churn, a phenomenon we study in more detail
because of the datasets available to us, our study is based in Section 4.
Identifier I Precision (%) Recall (%) Fingerprint count Hardware ID count
UA 62.01% 72.11% 254,762 3,073,690
UA, IP address 80.62% 68.84% 1,685,416 1,771,907
UA, /24 IP prefix 79.33% 69.43% 1,652,546 1,772,104
Cookie ID 82.35% 68.64% 1,340,635 1,375,074
Cookie ID (with HostTracker) 79.74% 99.13% 713,110 1,001,450
User ID (with HostTracker) 92.82% 93.51% 4,608,980 4,820,116
Percentage (%)
60
40
20
8/1/2010
8/2/2010
8/3/2010
8/4/2010
8/5/2010
8/6/2010
8/7/2010
8/8/2010
8/9/2010
8/10/2010
8/11/2010
8/12/2010
8/13/2010
8/14/2010
8/15/2010
8/16/2010
8/17/2010
8/18/2010
8/19/2010
8/20/2010
8/21/2010
8/22/2010
8/23/2010
8/24/2010
8/25/2010
8/26/2010
8/27/2010
8/28/2010
8/29/2010
8/30/2010
Date of Return
Figure 4. For cookie IDs observed on the first day of the month, the cumulative distribution of
the date that old and new cookies appear again in our dataset.
four of the most popular browsers in use today: Firefox the hosts defined in GUID to serve as ground truth for
(version 3.6.11), Safari (version 5.0.2), Chrome (version studying cookie churn. By overlapping GUID with the
7.0.517.41), and Internet Explorer (version 8.0). Table 4 Search dataset, we consider cookies whose query events
shows, for the browsing mode under which a cookie is fall into binding windows associated with the same host
set (the first column), whether the same cookie can be as corresponding to the same user (since user activity
accessed under another browsing mode (the second col- roughly approximates host activity).
umn). In all cases, a cookie set in private mode can be We focus on studying new cookie churn, as it is more
accessed repeatedly in the same private browsing ses- significant than that of old cookies (see Figure 4). We
sion, but not across different private browsing sessions. refer to the set of “one-time” cookie IDs (CIDs) that
No cookies set in private mode can be accessed in pub- are born on the first day but do not return again in our
lic mode. Safari is the only browser that allows private dataset as the churned new cookie IDs. In total, there
mode to access cookies set in public mode. are 437,914 users (or hosts) that overlap with 847,196
In the next subsection, we perform fine-grained clas- churned new CIDs in the Search data. The number of
sification to quantify the above possible causes of cookie hosts is only about half of the number of churned cook-
churn and characterize the corresponding users. ies IDs. We investigate the four cases that result in new
cookie churn, as illustrated in Figure 5, where the break-
down of users belonging to each category is shown in
4.3 Understanding Cookie Churn
Table 5. We elaborate on each of these cases separately
below.
Applying the host-tracking results, we analyze
cookie churn by identifying cookies that are associated
4.3.1 Case 1: Non-Returning Users
with the same client host. In Section 3.1, we show
that the host-tracking graph GUID derived from user lo- If a CID overlaps with one of host h’s binding windows
gin IDs (with HostTracker) achieved over 92% precision at time t, but no other CIDs overlap h’s bindings from
and recall in tracking clients, which are represented by time t onwards, we consider this as corresponding to a
hardware IDs from the Validation dataset. Thus we use user who does not return to the service (Figure 5(a)).
Case 1 Case 2 Case 3 Case 4
Number of users 101,427 77,120 67,310 192,057
Percentage of users (%) 23.16% 17.61% 15.37% 43.86%
Number of churned new CIDs 101,427 77,147 123,757 544,865
Percentage of churned new CIDs (%) 11.97% 9.12% 14.60% 64.31%
Table 5. Breakdown of the churned new cookie IDs into four categories of users.
100
Between same CIDs
Across different CIDs
40
10
8h
1h
5−
<=
>1
m
r−
r−
10
5m
da
in
24
8h
m
−1
y
in
in
hr
r
hr
(c) Case 3: Private browsing mode (one UA).
Case 4: Multiple browsers (multiple UAs).
Figure 6. Distributions of query intervals.
Figure 5. Four cases of cookie churn. C1
is the churned new cookie ID. Horizontal with different CIDs. Figure 6 shows that the former is
bars denote binding windows for a “host” distinctly smaller, with 75% of them below 10 minutes
defined by user IDs. and hence likely to belong to one session. By contrast,
90% of the query intervals between different CIDs are
larger than 8 hours. This suggests that most users clear
cookies per session, e.g., when they close the browser
We find that this case accounts for only 11.97% of the window.
churned new CIDs. Thus, despite the high cookie churn
We also find a small fraction (3.85%) of users whose
rate, the majority (88.03%) of the churned new cookie
cookies are cleared per query, i.e., each of their queries
IDs correspond to returning users who might still be is associated with a different CID. These might be users
tracked. The behaviors of the non-returning users are
who take extreme measures to clear cookies for each
examined in detail in Appendix B. query to preserve privacy. However, such patterns can
become a distinctive feature that makes tracking easier,
4.3.2 Case 2: Users that Clear Cookies despite the user’s intention of remaining anonymous.
Number of cookies
cality pattern can also be observed among cookies that
4
travel across countries. Figure 7 shows the topology of 10
BE GT
suspicious behavior.
EU AE HN
5.2 Identifying Virtual Client Travel
Table 8. Top ASes that are exclusively By examining all the sink ASes with source AS
sinks in the abnormal events. 30736 in these events, we find a total of 9 bot-user
groups, corresponding to 9 sink ASes geographically
distributed over the U.S. The activities between some of
these ASes are subtle, and would not have been detected
AS Pair # Cookies Affiliations without leveraging the normal host mobility patterns de-
AS 766, 34285 308 RedIRIS AS (EU), scribed in Section 5.1.
SANDETEL (ES)
AS 30736, 25761 235 Easyspeedy Net. (DK),
Staminus Comm. (US) 5.3.2 Cookie-Forwarding Bot Users
AS 30736, 40430 201 Colo4jax (US)
Table 10 lists the statistics for the 9 detected bot-user
AS 30736, 1421 198 WANSecurity (US)
AS 30736, 14141 192 WireSix (US)
groups. Each of these groups includes around 190
AS 30736, 29761 192 OC3 Net. & Web Solu- users. A different /24 subnet is associated with each user
tions (US) group that submit requests without explicit login activ-
AS 30736, 19318 188 New Jersey Intl. Inter- ities from the same subnet. For each /24, the sink IP
net Exchange (US) rotates among 10 to 14 addresses.
From a more recent user login dataset collected by
Table 9. Top AS pairs related to abnormal Hotmail in January 2011, we find over 75,000 email
events. accounts associated with the suspicious source IP ad-
dress in Denmark, all exhibiting similar patterns to the 9
groups we discovered. Manual investigation by Hotmail
Combining these two observations, we find that the shows that these accounts were used by attackers for the
dominant sinks in Table 8 significantly overlap with the purpose of receiving and testing spam. After these ac-
sink ASes in Table 9. They share the common source AS counts are logged into from one machine (i.e., one IP ad-
30736, located in Denmark. Upon examination, we find dress), their cookies are forwarded to multiple locations
that there is a single IP address generating login events so that further requests can be submitted in a distributed
for a large number of users, who then submit subsequent fashion during the validity period of the cookies, which
requests from multiple ASes in the U.S., violating the is 24 hours in our case.
geo-locality travel pattern observed in Figure 7 as well. There are at least two possible explanations for such
We find that the user login IDs associated with this malicious cookie-forwarding activities. First, some
particular source IP address contain more suspicious web-mail providers identify an account as suspicious if
patterns. In particular, they are groups of bot-user ac- it performs logins from multiple geographic locations
counts all registered on the same day in November 2010, within a short time interval. By forwarding cookies to
with the same user age, location information (country, other locations through a private communication chan-
state), and scripted naming patterns. Among the top five nel, attackers can successfully offload the requests to
dominantly sink ASes, four of them are used by these distributed hosts without them performing explicit user
bot groups to submit requests. logins, hence reducing the likelihood of detection. Sec-
ond, as a preparation step in launching session-hijacking Yahoo! [13] find that 40% and 60% of users have empty
attacks on real user accounts (e.g., [6]), attackers may browser caches, so they probably have cleared cookies
be testing the effectiveness of forwarding cookies via as well. While our results are consistent with previous
stealthy communication channels. findings, the approach we take requires neither user co-
Although the user accounts we identified were all operation nor special content setup.
newly created, it is possible that attackers can employ Host mobility studies have been performed in the
hijacked cookies stolen from actual users and forward context of wireless [17, 27, 22, 25], ad hoc [24, 26],
them to botnet hosts in the future. Understanding nor- and cellular networks [19] to obtain more accurate de-
mal host mobility patterns can help detect such stealthy vice moving models or to predict user locations. Sim-
attacks. ler et al. [35] studied user mobility in terms of ses-
sion characteristics based on login events to a university
email server in order to generate synthetic traces. Re-
6 Related Work
cent work [33] proposed a technique for classifying IP
addresses into home and travel categories to study host
Many efforts on tracking hosts focus on identify- travel and relocation patterns in the U.S. By studying
ing specific hardware characteristics, such as radio fre- cross-domain cookies, our work focuses on normal host
quency [23, 34, 18] or driver [21]. Identifiers such mobility patterns that enable us to observe uncommon
as network names or the IP addresses of frequently phenomena and detect malicious activities.
accessed services also enable host fingerprinting [32].
However, these approaches require the observer to be in
close physical proximity to the target host. 7 Discussion and Conclusion
Remote host fingerprinting can leverage packet-level
information to identify the differences in software sys- In this paper, we perform a large-scale exploration
tems [2, 4, 5] or hardware devices [28]. Other works of common identifiers and quantify the amount of host-
on tracking web clients require probing hosts’ system identifying information that they reveal. Using month-
configurations [20] or the installation order of browser long datasets from Hotmail and Bing, we show that com-
plug-ins [31]. Persistent browser cookies [3, 36] have mon identifiers can help track hosts with high precision
also been proposed; these systems store several copies and recall.
of a cookie in different locations and formats, so that Our study also informs service providers of the
they cannot be removed by standard methods. potential information leakage when they anonymize
Compared with these efforts, our work focuses on datasets (e.g., replacing IP addresses with IP prefixes)
studying the effectiveness and implications of track- and release data to third-party collaborators or to the
ing hosts using existing identifiers, without requiring public. For example, we show that hashes of browser
new information or probes. Although the issue of pri- information (i.e., the anonymized UA strings) alone can
vacy leakage has been repeatedly discussed, e.g., per- be quite revealing when examined in one network do-
sonally identifiable information in online social net- main. Furthermore, coarse-grained IP prefixes achieve
works [29, 30], there has been limited study using large- similar host-tracking accuracy to that of precise IP ad-
scale datasets. Our work uses month-long datasets from dress information when they are combined with hashed
a large search engine and a popular email provider to UA strings.
quantify the amount of host-identifying information re- Our analysis suggests that users who do not wish to
vealed by a variety of common identifiers. To the best be tracked should do much more than clear cookies. Un-
of our knowledge, we are also the first to demonstrate common behaviors such as clearing cookies for each re-
applications of host tracking to analyze cookie churn in quest may instead distinguish a host from others who do
web services and to detect suspicious cookie-forwarding not do so. Users should take notice of their user-agent
activities. strings (e.g., modify the default setting [10]), consider
Apart from its privacy implications, understanding the use of proxies, and possibly resort to sophisticated
cookie churn is an important topic for estimating web techniques such as anonymous routing [37]. In some
user population and personalization. Previous stud- cases, several of these techniques should be combined
ies mostly rely on user surveys or active user par- to be effective, e.g., clearing cookies in addition to the
ticipation (e.g., by installing a software on user ma- use of proxies or Tor.
chines) [12, 11, 16, 14]. Their findings show that 30% to Finally, despite its privacy implications, we demon-
40% of users clear cookies monthly. A separate study by strate the security benefit of host-tracking. Given the
growing concerns over account hijacking and session hi- [14] Cookie corrected audience data. White paper, Quantcast
jacking, we expect host fingerprinting and tracking tech- Corp., 2008.
niques can help defend against such attacks in the future. [15] Protecting consumer privacy in an era of rapid change.
Federal Trade Commission Staff Report, 2010.
Acknowledgments [16] M. Abraham, C. Meierhoefer, and A. Lipsman. The im-
pact of cookie deletion on the accuracy of site-server and
ad-server metrics: an empirical comScore study. White
We are grateful to Hotmail, Bing, and Windows Up-
paper, comScore, Inc., 2007.
date for providing us with data access that makes this
study possible. We thank Zijian Zheng for his guidance [17] M. Balazinska and P. Castro. Characterizing mobility
and insight on cookie-churn analysis. We thank Keiji and network usage in a corporate wireless local-area net-
work. In Intl. Conf. Mobile Systems, Applications, Ser-
Oenoki and Hersh Dangayach for providing us with data
vices, 2003.
related with cookie-forwarding attacks and for the help
in the subsequent investigation. We thank the reviewers, [18] V. Brik, S. Banerjee, M. Gruteser, and S. Oh. Wireless
and in particular Paul Syverson, for their suggestions of device identification with radiometric signatures. In Intl.
Conf. Mobile Computing and Networking, 2006.
improvements to this paper.
[19] I. Constandache, S. Gaonkar, M. Sayler, R. Choudhury,
and L. Cox. Energy-efficient localization via personal
References mobility profiling. In Intl. Conf. Mobile Computing, Ap-
plications, and Services, 2009.
[1] CookieCooker. http://www.cookiecooker. [20] P. Eckersley. How unique is your web browser? In Pri-
de/. vacy Enhancing Technologies Symp., 2010.
[2] Nmap free security scanner. http://nmap.org. [21] J. Franklin, D. McCoy, P. Tabriz, V. Neagoe, J. V. Rand-
[3] Project details for evercookie. http://samy.pl/ wyk, and D. Sicker. Passive data link layer 802.11 wire-
evercookie/. less device driver fingerprinting. In USENIX Security
[4] Project details for p0f. http://lcamtuf. Symp., 2006.
coredump.cx/p0f.shtml. [22] J. Ghosh, M. Beal, H. Ngo, and C. Qiao. On profiling
mobility and predicting locations of wireless users. In
[5] Project details for xprobe. http://sourceforge.
Intl. Workshop on Multi-hop ad hoc networks, 2006.
net/projects/xprobe/.
[23] J. Hall, M. Barbeau, and E. Kranakis. Detection of
[6] Secure your PC and website from Firesheep
transient in radio frequency fingerprinting using signal
session hijacking. http://www.pcworld.
phase. In Intl. Conf. Wireless and Optical Communica-
com/businesscenter/article/210028/
tions, 2003.
secure your pc and website from
firesheep session hijacking.html. [24] X. Hong, M. Gerla, G. Pei, and C. Chiang. A group
mobility model for ad hoc wireless networks. In ACM
[7] Tor Project: Torbutton. https://www.
Intl. Workshop on Modeling, Analysis and Simulation of
torproject.org/torbutton/.
Wireless and Mobile Systems, 1999.
[8] Tor Proxy List. http://proxy.org/tor.shtml.
[25] N. Husted and S. Myers. Mobile location tracking in
[9] U. Oregon Route Views Project. http://www. metro areas: Malnets and others. In ACM Conf. Com-
routeviews.org/. puter and Communication Security, 2010.
[10] User-agent switcher. https://addons.mozilla. [26] A. Jardosh, E. Belding-Royer, K. Almeroth, and S. Suri.
org/en-US/firefox/addon/59/?id=59. Towards realistic mobility models for mobile ad hoc net-
[11] 40% of consumers zap cookies weekly. http:// works. In Intl. Conf. Mobile Computing and Networking,
www.marketingsherpa.com/!newsletters/ 2003.
bestofweekly-4-22-04.htm#topic1, 2004. [27] M. Kim, D. Kotz, and S. Kim. Extracting a mobility
[12] Measuring unique visitors: Addressing the dramatic de- model from real user traces. In IEEE Infocom, 2006.
cline in accuracy of cookie-based measurement. White [28] T. Kohno, A. Broido, and K. Claffy. Remote physical de-
paper, Jupiter Research, 2005. vice fingerprinting. In IEEE Symp. Security and Privacy,
[13] Yahoo! user interface blog: Performance re- 2005.
search, part 2: Browser cache usage exposed! [29] B. Krishnamurthy and C. E. Wills. Characterizing pri-
http://yuiblog.com/blog/2007/01/04/ vacy in online social networks. In ACM Workshop on
performance-research-part-2/, 2007. Online Social Networks, 2008.
[30] B. Krishnamurthy and C. E. Wills. Privacy leakage in
100
mobile online social networks. In USENIX Conf. Online
99
Social Networks, 2010.
98
Percentage (%)
[31] J. R. Mayer. “Any person... a pamphleteer”: Internet
97
anonymity in the age of Web 2.0. Senior Thesis, Stan-
96
ford University, 2009.
95
[32] J. Pang, B. Greenstein, R. Gummadi, S. Seshan, and
94 Precision
D. Wetherall. 802.11 user fingerprinting. In Intl. Conf. Recall
93
Mobile Computing and Networking, 2007. 0 5 10 15 20
Binding Window Length (Days)
[33] A. Pitsillidis, Y. Xie, F. Yu, M. Abadi, G. Voelker, and (a)
S. Savage. How to tell an airport from a home: Tech-
100
niques and applications. In ACM Workshop on Hot Top-
Table 11. The query and click behaviors of returning and non-returning users from the first day
of the log.
Returning users
90 Not returning users
80
70
60 (a)
50
40
30
20
0 20 40 60 80 100
Percentage of queries clicked (%)