Paper 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Host Fingerprinting and Tracking on the Web:

Privacy and Security Implications

Ting-Fang Yen1∗, Yinglian Xie2 , Fang Yu2 , Roger Peng Yu3 , Martı́n Abadi2†
1
RSA Laboratories
2
Microsoft Research Silicon Valley
3
Microsoft Corporation
[email protected], {yxie, fangyu, rogeryu, [email protected]}

Abstract 1 Introduction

Many web services aim to track clients as a basis It is in the interest of web services and ISPs to track
for analyzing their behavior and providing personalized the mobility and usage patterns of client hosts. This
services. Despite much debate regarding the collection tracking allows them to understand user behavior for
of client information, there have been few quantitative supporting applications such as product suggestions, tar-
studies that analyze the effectiveness of host-tracking geted advertising, and online fraud detection. How-
and the associated privacy risks. ever, clients may not wish that their activities be tracked,
and can intentionally remove stored browser cookies or
In this paper, we perform a large-scale study to quan-
choose not to perform user logins. The growing aware-
tify the amount of information revealed by common host
ness of privacy concerns is exemplified by the recent
identifiers. We analyze month-long anonymized datasets
“do-not-track” initiative from the Federal Trade Com-
collected by the Hotmail web-mail service and the Bing mission [15], which outlines guidelines to which service
search engine, which include millions of hosts across the
providers must adhere in the collection and distribution
global IP address space. In this setting, we compare the
of client information.
use of multiple identifiers, including browser informa-
Several works aim to improve the accuracy of host-
tion, IP addresses, cookies, and user login IDs.
tracking by collecting detailed host information, such
We further demonstrate the privacy and security im- as installed browser plug-ins and system fonts [20, 31]
plications of host-tracking in two contexts. In the first, or packet-level information that reveals subtle hardware
we study the causes of cookie churn in web services, and differences [28]. By comparison, few studies exist
show that many returning users can still be tracked even on the effectiveness and privacy implications of host-
if they clear cookies or utilize private browsing. In the tracking. Previous work tends to be qualitative in na-
second, we show that host-tracking can be leveraged to ture [29, 30] or limited to a single identifier [20].
improve security. Specifically, by aggregating informa- In this paper, we attempt to facilitate the debate re-
tion across hosts, we uncover a stealthy malicious attack garding host-tracking by performing a large-scale study
associated with over 75,000 bot accounts that forward to quantify the amount of identifying information re-
cookies to distributed locations. vealed by common identifiers. Such analysis is criti-
cal to both service providers and end users. For ex-
ample, service providers can determine where existing
identifiers are insufficient and more sophisticated meth-
∗ This work was done while Ting-Fang was an intern at Microsoft
ods may be preferred. Users who do not wish to be
Research.
tracked can learn the circumstances in which they can
† Martı́n Abadi is also affiliated with the University of California, be identified accurately, so that they can take effective
Santa Cruz. measures to protect privacy. Our analysis is based on
month-long anonymized datasets from the Hotmail web- Although our research relies on anonymized datasets
mail service and the Bing search engine, including hun- from Hotmail and Bing, the analyses that we describe
dreds of millions of users across the global Internet IP are a research effort only. Our goal is not to identify
address space. By characterizing hosts’ activities across or study specific individual activities, but rather to un-
time using “binding windows”, we show that common derstand the patterns of the aggregated activities and to
identifiers allow us to track hosts with high accuracy. explore their implications.
We further consider cases where users take initiatives In the following, we first describe the identifiers that
to preserve privacy, e.g., by clearing cookies or switch- we study and our host-tracking methodology in Sec-
ing to private browsing mode. Specifically, we analyze tion 2, and present the evaluation of those identifiers in
“one-time” cookies that do not return again in subse- Section 3. We investigate the privacy and security impli-
quent web requests, a phenomenon known as cookie cations of host-tracking in the context of cookie churn in
churn. These cookies appear to be anonymous. How- Section 4 and of host mobility in Section 5. Finally, we
ever, by applying our host-tracking results, we show that describe related work in Section 6 and conclude in Sec-
a surprisingly large fraction can be recognized as be- tion 7.
longing to returning users.
In addition to its privacy implications, we demon- 2 Exploring Common Identifiers
strate that host-tracking can also be applied to improve
security. We examine the mobility patterns of hosts trav-
eling across multiple IP ranges, and establish normal Given a log of application-level events collected over
user mobility profiles from aggregate host activities. In time, such as requests directed to a web server or user
doing so, we are able to analyze unusual activities, e.g., logins to a service, our goal is to quantify the amount
the use of anonymous routing networks, and develop of host-identifying information that is captured in iden-
methods to detect attacks. In particular, our study uncov- tifiers within the log. Specifically, for an identifier I,
ers previously unknown suspicious cookie-forwarding which may take on a finite set FI = {f1 , f2 , . . . , fn } of
activities, which may have been adopted by attackers to possible values (called fingerprints), we are interested in
evade spamming detection. whether a fingerprint fi uniquely corresponds to a single
The key findings of this paper include: host, among all hosts involved in the log. As we consider
only client hosts in our scenario, we use clients or hosts
• We show that 60%-70% of HTTP user-agent interchangeably throughout the paper.
strings can accurately identify hosts in our datasets. We assume the perspective of a passive observer of
When augmented with coarse-grained IP prefix in- identifiers within application-level events. The common
formation, the accuracy can be improved to 80%, identifiers explored in this work include 1) user-agent
similar to that obtained with cookies. User-agent string (UA), 2) IP address, 3) browser cookie, and 4)
strings combined with IP addresses have an entropy user login ID. We choose these identifiers because they
of 20.29 bits—higher than that of browser plug-ins, are not particular to our datasets, and are available in a
screen resolution, timezone, and system fonts com- wide variety of service logs.
bined [20].
2.1 Host-tracking Graph
• Applying our results to study cookie churn, we find
that a service provider can recognize and track 88%
of the “one-time” cookies as corresponding to users Our host-tracking approach attempts to infer the pres-
who later returned to the service. Among these ence of a host at an IP address during a certain time in-
users, 33% made an effort to preserve their privacy, terval. Upon observing a fingerprint f (and only f ) that
either by clearing cookies through browser options appears at an IP address A over a time interval ∆t, we
or utilizing private browsing mode. can infer a “binding window” for f . Events occurring
within ∆t at A can then be attributed to the host corre-
• Employing general mobility patterns derived by sponding to f . (Hosts behind NATs/proxies can compli-
tracking hosts across network domains, we uncover cate matters; we quantify the occurrence of such hosts
malicious behaviors where cookies are forwarded in our data in Section 3.3.)
from one IP address to distributed locations. In to- Figure 1 illustrates how we infer the binding win-
tal, we identify over 75,000 bot Hotmail accounts dows. In this example, user-agent strings (UA) are the
in this relatively stealthy attack that has not been identifiers, and the events are queries to a web search
detected before. engine. A fingerprint UA1 appears in two consecutive
Figure 2. Example of a host-tracking graph.
Figure 1. Binding windows identified on one Bars with different patterns denote binding
IP. windows corresponding to different finger-
prints.

search queries at time t1 and t2 , followed by queries at user-agent string from the HTTP header (anonymized
time t3 , t4 , and t5 with a different fingerprint UA2 . Thus via hashing), the IP address from which the query was
we can identify binding windows corresponding to two issued, the time of the query, the anonymized cookie
different “hosts” on this IP: one spanning the time range ID assigned by the search engine, and the date that the
[t1 , t2 ], and another spanning [t3 , t5 ]. Having exam- cookie ID was created. Specifically, the anonymized
ined all search query events, we can construct a host- cookie ID is a persistent identifier that does not change
tracking graph as in Figure 2. Note that a fingerprint over time, if users do not clear cookies or use private
may be associated with multiple binding windows (since browsing. We refer to this as the Search dataset. As part
the host may not be up all the time) and across different of the processing performed by the Bing search engine,
IP addresses (e.g., because of DHCP). We refer to the events generated by known bots are filtered in advance.
host-tracking graph that represents hosts by identifier I To validate our client-tracking approach, we lever-
as GI . aged a month-long sampled log of Windows Update
A similar concept of host-tracking graph was also events, also from August 2010. This data contains the
used by HostTracker [38] to support Internet account- time at which the update was performed, the IP address,
ability. HostTracker groups together user login IDs that and the anonymized hardware ID that is unique to the
are likely to be associated with the same host, e.g., fam- host. This is the Validation dataset.
ily members that share a computer at home. It also filters Table 1 shows the fields and the total number of
events related to bots and large proxies. In contrast to unique IPs observed in each dataset. All three datasets
this previous work, we make a broader use of the host- include tens to hundreds of millions of IP addresses,
tracking graph (with a variety of common identifiers), spanning a large IP address space.
and we apply host-tracking to the cookie-churn study (in The published privacy policies for Hotmail, Bing,
Section 4) and the host-mobility analysis (in Section 5). and Windows Update address the storage, use, sharing,
and retention of data collected in the course of the oper-
2.2 Datasets ation of these services. In particular, they indicate that
Microsoft may employ this data for analyzing trends and
The data for our study includes a month-long user lo- for operating and improving its products and services,
gin trace collected by the Hotmail web-mail service in as we aim to do with this work. Since the datasets are
August 2010. The trace contains coarse-grained infor- sensitive, they are not publically available for further re-
mation about the OS and browser type (e.g., Windows, search.
Mozilla), the IP address from which the login was made,
the time of the login event, and the anonymized user ID. 2.3 Validation and Metrics
In the following, we refer to this as the Webmail dataset.
We also obtained a month-long dataset consisting of Without ground truth for the host-IP mappings, we
search query events directed to the Bing search engine evaluate a host-tracking graph GI by overlapping it with
in August 2010. This data includes the fine-grained the Validation dataset. If a fingerprint is able to correctly
Dataset User-agent information IP address Timestamp ID Unique IP addresses
Webmail OS and browser type Yes Yes User ID 308 million
Search User-agent string (UA) Yes Yes Cookie ID 131 million
Validation N/A Yes Yes Hardware ID 74 million

Table 1. Fields in each dataset.

track a host, its bindings should overlap only with Win- on clients of Microsoft services. We acknowledge that
dows Update events associated with a single hardware any dataset will be incomplete and possibly biased.
ID. Conversely, a hardware ID is also expected to over-
lap with bindings associated with only one fingerprint. 3 Client-Tracking Results
We quantify the accuracy of an identifier using pre-
cision and recall. Let hidcount(f ) denote the number
of hardware IDs to which a fingerprint f corresponds, In this section, we construct host-tracking graphs us-
and fpcount(m) the number of fingerprints to which a ing the common identifiers user-agent string (UA), IP
hardware ID m corresponds. Precision is defined as the address, cookie ID, and user login ID, and evaluate
percentage of fingerprints that correspond to one host their precision and recall. In particular, we explore the
(i.e., one hardware ID), while recall is the percentage of distinguishing power of UA by examining the browser
hosts that correspond to one fingerprint. anonymity sets. We also measure the impact of prox-
ies and NATs in our study in Section 3.3, and describe
| {f : hidcount(f ) = 1, f ∈ FI } | the increased accuracy and confidence of tracking stable
PrecisionI =
| FI | hosts in Appendix A.
| {m : fpcount(m) = 1, m ∈ MI } | Our analysis focuses on host-tracking within each
RecallI = network domain, derived using the BGP prefix entries
| MI |
obtained from RouteViews [9]. We investigate the oc-
FI is the finite set of values that identifier I takes in currences of identifiers at multiple network locations in
our dataset, i.e., the fingerprints (after some initial filter- Section 5, in which we also study the security implica-
ing, as described below). MI is the set of hardware IDs tions of host-tracking.
that overlap with the host-tracking graph GI . Roughly
speaking, precision quantifies how accurate an identifier 3.1 Precision and Recall
is at representing a host. Recall quantifies how well an
identifier is able to track the events associated with the Table 2 presents our results on host-tracking. Af-
corresponding host in a log. ter overlapping the Validation dataset with the host-
We also measure the entropy of an identifier, HI , tracking graphs, the number of unique fingerprints and
which is the amount of information identifier I contains hardware IDs included in our evaluation is still large—
that can distinguish hosts. The entropy is defined as on the order of millions.
X Several observations are evident from Table 2. First,
HI = − Pr(f ) log2 (Pr(f ))
f ∈FI
browser information (UA) alone can identify hosts quite
well. Its 62.01% precision is perhaps surprising, as UA
where Pr(f ) is the probability of observing fingerprint strings are commonly regarded as providing insufficient
f in the application log. A higher entropy indicates a information to reveal host identities. Second, a com-
smaller probability that any two clients are associated bination of UA with the IP address (i.e., fingerprinting
with the same fingerprint. hosts by distinct (UA, IP) pairs) can boost the precision
In our validation, we consider only those fingerprints up to 80.62%. In fact, combining UA with only the IP
that overlap with more than one Windows Update event, prefix is sufficient to achieve approximately the same re-
and only those hardware IDs that overlap with more than sult as with UA+IP. This suggests that anonymization
one application-level event pertaining to our identifiers. techniques that store the IP prefix may still retain dis-
These restrictions allow us to focus on the portion of tinguishing information. Third, cookie IDs offer only
data that we can validate, though they can be biased to slightly better precision and recall than UA+IP. The
those clients that access the services consistently (i.e., inaccuracies of cookie IDs can be partly attributed to
multiple times and with the same identifiers). Similarly, cookie churn, a phenomenon we study in more detail
because of the datasets available to us, our study is based in Section 4.
Identifier I Precision (%) Recall (%) Fingerprint count Hardware ID count
UA 62.01% 72.11% 254,762 3,073,690
UA, IP address 80.62% 68.84% 1,685,416 1,771,907
UA, /24 IP prefix 79.33% 69.43% 1,652,546 1,772,104
Cookie ID 82.35% 68.64% 1,340,635 1,375,074
Cookie ID (with HostTracker) 79.74% 99.13% 713,110 1,001,450
User ID (with HostTracker) 92.82% 93.51% 4,608,980 4,820,116

Table 2. Common identifiers in host-tracking, evaluated using the Validation dataset.

As another method to make use of the identifiers, 10


6

we also apply HostTracker [38] to the cookie IDs and UA


UA and IP
user IDs from our Search and Webmail datasets, respec-

Size of Anonymity Set


tively. In the former case, the clients are now tracked 10
4

by a group of correlated cookies, e.g., those belonging


to two browsers running on a machine in parallel. In
the latter case, user login IDs that frequently appear to- 10
2

gether, e.g., family members that share a computer at


home, are used to track clients. We find user IDs achiev-
0
ing high precision and recall (over 92%), demonstrating 10 0 2 4 6
10 10 10 10
that they are strongly tied to individual hosts. Index of Fingerprints
Since HostTracker yields relatively high precision
and recall with user IDs, we have also evaluated the Figure 3. The distribution of the size of
other identifiers against user IDs (instead of hardware the browser anonymity sets, plotted in log-
IDs). Even though hardware IDs and user IDs overlap scale.
with different portions of the datasets, we obtain results
consistent with those of Table 2.
To summarize, we show that common identifiers can
track hosts reasonably well, particularly when they are provided by browser fingerprints, we calculate their en-
used in combination. tropy. In our data, UA has an entropy of 11.59 bits, while
the entropy of UA+IP is 20.29 bits. A study performed
3.2 Browser Anonymity Set by Eckersley et al. [20] probed the remote client for
installed plug-ins, screen resolution, timezone, system
Our evaluation suggests that a large fraction of fonts, and user-agent strings, which altogether yielded
browsers provide enough information to fingerprint an entropy of 18.1 bits. While this suggests that their de-
hosts within each network domain. In this sec- tailed information provides more distinguishing power
tion, we examine in detail the anonymity set of than UA alone, it is interesting to observe that such in-
browser fingerprints, defined as the set of hard- formation may be less distinguishing than simply com-
ware IDs that share the same fingerprint. Even bining UA and IP address.
though 62% of UAs map to unique hosts, popular These results confirm our finding that UA strings aug-
UA strings still have large anonymity sets, i.e., ad- mented with IP addresses can identify hosts well. How-
ditional examination shows that the most common ever, popular UA strings still have large anonymity sets.
fingerprint, Mozilla/4.0(compatible;MSIE6. Changing the default UA string to one that corresponds
0;WindowsNT5.1;SV1), corresponds to 124,355 to a popular browser version may hence allow a client to
(4.05%) of the hardware IDs that overlap with the UA become less distinguishable.
host-tracking graph.
Figure 3 compares the size of the anonymity sets for 3.3 Impacts of Proxies and NATs
UA and UA+IP. We find 98.92% of the UA+IP finger-
prints to be relatively rare, with fewer than five hardware Among the common identifiers we explored, none of
IDs, while this holds for only 89.69% of the UA finger- them performs perfectly. Aside from their inherent am-
prints. biguity (e.g., some UAs are more common than others,
To quantify the amount of identifying information cookies can be removed), proxies and NATs introduce
fundamental difficulties in tracking hosts. The ability to as shown in Section 3, using cookie IDs as client fin-
detect and measure them allows us to understand pre- gerprints can be unreliable. In particular, they have a
cisely where such practical limitations will apply. relatively low recall rate—32% of the hardware IDs in
We first quantify the prevalence of large proxies our evaluation cannot be completely tracked by cookies.
and NATs that are typically configured by ISPs or en- A main source of the low recall rate is cookie churn,
terprises. To do so, we examine “hosts” that corre- which we define as the phenomenon of cookies appear-
spond to a large number of user login IDs or cookie ing at least once but not appearing again in subsequent
IDs. A small fraction of IP addresses in our datasets— web requests received by a server (within some obser-
31,874 and 2,151 from the Webmail and Search dataset, vation time window). For service providers, being able
respectively—is each associated with more than 5,000 to track hosts will allow them to quantify the underly-
unique login IDs and cookie IDs. These are likely large ing causes behind the cookie-churn phenomenon. In
proxies and we filter them in our evaluation. this section, we measure and analyze cookie churn in
Next, we examine small NAT devices that are often the Search dataset. (Among the datasets available to us,
used by home networks. In this case, since it is rela- it is the only one that contains cookie IDs.) By apply-
tively rare for a client to be running multiple operating ing our host-tracking methodology, we show that some
systems in parallel, we leverage the coarse-grained OS client users may still be identified despite cookie churn.
type and IP address recorded for each user login event in
the Webmail dataset. The majority (80.31%) of our lo- 4.1 Cookie Churn Measurement
gin ID fingerprints are associated with only one unique
user ID. When we observe multiple OS types, all from Among cookie IDs that appear on the first day of our
the same IP address, it indicates that the “host” may ac- Search dataset, the rate of cookie churn, i.e., the frac-
tually be a NAT device that masks multiple clients. tion of cookie IDs that never returned again within our
From this experiment, we find 10.60% hosts likely to month-long observation, is 47.86%. On average, the
be NATs. This number is a lower bound, since we cannot daily cookie churn rate is around 45% across month-
distinguish clients that are running the same OS behind long sliding windows.
a NAT device. Table 3 shows that while the large major- Furthermore, 81.98% of the new cookie IDs that are
ity of NATed hosts include multiple Microsoft Windows born on the first day of the Search dataset never returned
OSes, hand-held devices also comprise a large fraction within the month. For all cookie IDs observed on the
(about 16%). With the increasing popularity of multiple first day of the month, Figure 4 shows cumulative dis-
home devices and smart phones, we expect the percent- tributions of the date that old and new cookies appear a
age of NATs to grow further. second time. The churn rate of new cookies is signifi-
cantly higher than that of old cookies—a difference of
OS Types NAT hosts (%)
more than 40%.
Multiple Windows 81.32%
Windows and Hand-held device 15.62%
Windows and Mac OS/Unix 2.19% 4.2 Possible Reasons for Churn
Hand-held and Mac OS/Unix 0.55%
Windows, Hand-held, and Mac OS/Unix 0.31% Clearly, cookie churn can result from users quitting
Multiple Mac OS/Unix 0.01% the service. As shown in Figure 4, engaged users that
Multiple Hand-held devices 0.01%
access the service multiple times (with old cookies) are
more likely to return than new users.
Table 3. Breakdown of the OS types found Another reason for cookie churn is the removal of
to be associated with hosts behind NATs. cookies from the client browser. This removal can hap-
pen in several cases, including when users manually
clear cookies, when they set their browsers to automati-
cally clear cookies on exit, or when users switch into or
4 Application: Cookie Churn Study out of private browsing mode. Supported by all major
web browsers today, private browsing takes a user’s ac-
As the primary method for web sites to track return- tivities off records by removing caches, history, and in
ing users without requiring login-based authentication, particular, cookies that are set during private mode.
browser cookies play an important role in customizing To study how private browsing mode affects the
web services and maintaining user statistics. However, cookie events observed by web services, we examine
100
New Cookies
80 Old Cookies

Percentage (%)
60

40

20

8/1/2010
8/2/2010
8/3/2010
8/4/2010
8/5/2010
8/6/2010
8/7/2010
8/8/2010
8/9/2010
8/10/2010
8/11/2010
8/12/2010
8/13/2010
8/14/2010
8/15/2010
8/16/2010
8/17/2010
8/18/2010
8/19/2010
8/20/2010
8/21/2010
8/22/2010
8/23/2010
8/24/2010
8/25/2010
8/26/2010
8/27/2010
8/28/2010
8/29/2010
8/30/2010
Date of Return

Figure 4. For cookie IDs observed on the first day of the month, the cumulative distribution of
the date that old and new cookies appear again in our dataset.

Cookie Set Cookie Accessed Firefox Safari Chrome IE


Public Private No Yes No No
Private Same private session Yes Yes Yes Yes
Private Different private session No No No No
Private Public No No No No

Table 4. Accessibility of cookies in different browsing modes.

four of the most popular browsers in use today: Firefox the hosts defined in GUID to serve as ground truth for
(version 3.6.11), Safari (version 5.0.2), Chrome (version studying cookie churn. By overlapping GUID with the
7.0.517.41), and Internet Explorer (version 8.0). Table 4 Search dataset, we consider cookies whose query events
shows, for the browsing mode under which a cookie is fall into binding windows associated with the same host
set (the first column), whether the same cookie can be as corresponding to the same user (since user activity
accessed under another browsing mode (the second col- roughly approximates host activity).
umn). In all cases, a cookie set in private mode can be We focus on studying new cookie churn, as it is more
accessed repeatedly in the same private browsing ses- significant than that of old cookies (see Figure 4). We
sion, but not across different private browsing sessions. refer to the set of “one-time” cookie IDs (CIDs) that
No cookies set in private mode can be accessed in pub- are born on the first day but do not return again in our
lic mode. Safari is the only browser that allows private dataset as the churned new cookie IDs. In total, there
mode to access cookies set in public mode. are 437,914 users (or hosts) that overlap with 847,196
In the next subsection, we perform fine-grained clas- churned new CIDs in the Search data. The number of
sification to quantify the above possible causes of cookie hosts is only about half of the number of churned cook-
churn and characterize the corresponding users. ies IDs. We investigate the four cases that result in new
cookie churn, as illustrated in Figure 5, where the break-
down of users belonging to each category is shown in
4.3 Understanding Cookie Churn
Table 5. We elaborate on each of these cases separately
below.
Applying the host-tracking results, we analyze
cookie churn by identifying cookies that are associated
4.3.1 Case 1: Non-Returning Users
with the same client host. In Section 3.1, we show
that the host-tracking graph GUID derived from user lo- If a CID overlaps with one of host h’s binding windows
gin IDs (with HostTracker) achieved over 92% precision at time t, but no other CIDs overlap h’s bindings from
and recall in tracking clients, which are represented by time t onwards, we consider this as corresponding to a
hardware IDs from the Validation dataset. Thus we use user who does not return to the service (Figure 5(a)).
Case 1 Case 2 Case 3 Case 4
Number of users 101,427 77,120 67,310 192,057
Percentage of users (%) 23.16% 17.61% 15.37% 43.86%
Number of churned new CIDs 101,427 77,147 123,757 544,865
Percentage of churned new CIDs (%) 11.97% 9.12% 14.60% 64.31%

Table 5. Breakdown of the churned new cookie IDs into four categories of users.

100
Between same CIDs
Across different CIDs

Percentage of queries (%)


80
(a) Case 1: User left the service.
60

40

(b) Case 2: User clears cookies. 20

10

8h
1h
5−
<=

>1
m

r−
r−
10
5m

da
in

24
8h
m

−1

y
in

in

hr
r
hr
(c) Case 3: Private browsing mode (one UA).
Case 4: Multiple browsers (multiple UAs).
Figure 6. Distributions of query intervals.
Figure 5. Four cases of cookie churn. C1
is the churned new cookie ID. Horizontal with different CIDs. Figure 6 shows that the former is
bars denote binding windows for a “host” distinctly smaller, with 75% of them below 10 minutes
defined by user IDs. and hence likely to belong to one session. By contrast,
90% of the query intervals between different CIDs are
larger than 8 hours. This suggests that most users clear
cookies per session, e.g., when they close the browser
We find that this case accounts for only 11.97% of the window.
churned new CIDs. Thus, despite the high cookie churn
We also find a small fraction (3.85%) of users whose
rate, the majority (88.03%) of the churned new cookie
cookies are cleared per query, i.e., each of their queries
IDs correspond to returning users who might still be is associated with a different CID. These might be users
tracked. The behaviors of the non-returning users are
who take extreme measures to clear cookies for each
examined in detail in Appendix B. query to preserve privacy. However, such patterns can
become a distinctive feature that makes tracking easier,
4.3.2 Case 2: Users that Clear Cookies despite the user’s intention of remaining anonymous.

Cookie churn can also result from users intentionally re-


4.3.3 Case 3: Users with In-Private Browsing Mode
moving cookies. In this case, a host h’s bindings should
overlap with CIDs generated consecutively in time (Fig- Another reason for cookie churn is the use of the
ure 5(b)). Each CID may be associated with multiple browser’s private browsing mode. As illustrated in Fig-
queries that typically belong to a session. Among hosts ure 5(c), upon entering private mode, the old cookie (C2 )
with new cookie churn, we find 77,120 (17.61%) in set under public mode is replaced by a new cookie (C3 ).
this category. Since we observe only cookies issued by Upon exiting the private mode, the old cookie (C2 ) will
the Bing search engine, we cannot distinguish between continue to be used by the same user.
users who clear all cookies and those who selectively We focus on the hosts whose cookies appear inter-
clear cookies from certain domains. leaved in their binding windows, where an old cookie
To find whether users clear cookies on exiting continues to appear after the user submit queries with
browsers, we examine the time intervals between con- a newer cookie. Since entering private browsing mode
secutive queries associated with the same CID, and com- does not change the browser used by the host, we iden-
pare with those between consecutive queries associated tify those associated with a single UA string as users
who utilize private browsing mode, and 15.37% of users To understand the mobile behavior of hosts at a large
belong to this category. scale, we make use of cookie IDs, since they are more
Together with case 2, there are in total around 33%— closely tied to specific devices than other identifiers we
a non-trivial fraction—of users who would like to pre- studied in Section 3. We use the Search dataset for our
serve privacy by either clearing cookies or entering pri- study. Among cookie IDs in this dataset, 7.9 million
vate browsing mode. These users may still be tracked appeared at more than one domain. While the major-
when service providers combine the host-tracking re- ity of these cross-domain activities are associated with
sults from other identifiers (e.g., login IDs) with cookie normal user travel patterns, there also exist unusual or
data. suspicious activities, for example, cookie forwarding of
the kind supported by CookieCooker [1].
In this section, we focus on detecting the following
4.3.4 Case 4: Users with Multiple Browsers two abnormal host mobility patterns:
For the remaining users, we observe multiple cookies
• Some cookie IDs move quickly between multiple
co-existing (as in Section 4.3.3), though they are as-
domains, suggesting that they may not correspond
sociated with different UA strings. Upon examining
to hosts who travel physically. In particular, we
these users more closely, we find around 67% associated
study those cookies that may be associated with
with only two or three UAs. This observation suggests
anonymous routing, such as Tor routing [37].
that these cases correspond to single hosts with multiple
browsers or small home NATs. While it is more difficult
• During an investigation into suspicious user email
to track hosts behind NATs, we note that the anonymity
traffic that do not conform to the general host mo-
sets tend to be too small in such cases to protect user
bility profile, we uncover a stealthy type of mali-
privacy.
cious cookie-forwarding activity.
A small fraction of these cases (3%) are associated
with a large number of UA strings, which suggests that In the following, we first study patterns corresponding to
they are large proxies or NATs. Routing traffic through users traveling across domains in general. We then use
proxies thus provides better means for users who do not those patterns as baseline to identify abnormal activities.
wish to be tracked.
5.1 Host Mobility Patterns
Summary We study the cookie-churn phenomenon
where privacy-aware users may clear cookies or switch Our analysis yields a few key observations on general
to private browsing. We show that by applying host- host mobility patterns. First, as shown in Table 6, ASes
tracking results with other identifiers, service providers associated with cellular networks, i.e., Verizon Wire-
may still be able to identify a large fraction (88%) of the less and Carphone Warehouse Broadband Services, are
“one-time”, churned new cookie IDs as corresponding ranked among the top domains with the largest number
to users who return to the service. of traveling cookies. This fact reflects the proliferation
of smart phones with mobile Internet access. In total, we
5 Application: Host Mobility Study find around 20% of the cookies among the top 500 AS
pairs to be associated with cellular networks (Verizon
Wireless, AT&T Wireless, Vodafone, Sprint, etc.).
In addition to switching between IP addresses within
the same network (for instance, because of DHCP), a AS pair # Cookies Affiliations
host may also travel across different IP ranges. This can AS 17557, 45595 152871 Pakistan Telecom (PK)
occur if the host is a mobile device, or when a virtual AS 6167, 22394 70941 Verizon Wireless (US)
private network (VPN) is used. Above, we track hosts AS 13285, 43234 56600 Opal Telecom, Carphone
within each IP prefix range separately, though it is also Warehouse Broadband (GB)
desirable to study clients that travel across domains, e.g., AS 4134, 4837 52520 ChinaNet (CN)
for traffic engineering or network management. More AS 8228, 15557 36812 Neuf Cegetel (FR)
importantly, host mobility patterns can benefit security
as well. We demonstrate this point by applying our host- Table 6. Top five AS pairs associated with
tracking results to detect abnormal and malicious activi- traveling cookies.
ties.
Second, we find traveling hosts to exhibit strong geo- 8
10
graphic locality. 83% of the cookies move between net-
works within the same country, and this number is even 6
10
higher for the U.S. (95.44%). The strong geographic lo-

Number of cookies
cality pattern can also be observed among cookies that
4
travel across countries. Figure 7 shows the topology of 10

international host travel, also drawn from the top 500


AS pairs. The node “EU” in the figure represents multi- 2
10
regional networks in the European Union, which are not
exclusively part of any European country. The size of 0
10
each node in the figure is proportional to the number 0 20 40 60
Number of ASs traveled
80

of cookies that originated from that country or region.


The edges indicate the direction of travel. The figure Figure 8. The distribution of the number of
shows that host mobility is largely bi-directional, and ASes traveled by each CID. Y-axis is in log
is commonly localized within the same general region scale.
(e.g., Europe).

BE GT
suspicious behavior.

EU AE HN
5.2 Identifying Virtual Client Travel

Although the majority of traveling cookies corre-


spond to physical host mobility, such as those associ-
FR PT NO DE IT ES GR US
ated with cellular networks, some switch between do-
mains faster than seemingly possible for physical travel.
Consecutively appearing from different ASes within a
CN IR AR GB SA MX matter of minutes, the rapid movement of these cook-
ies suggests the presence of some form of virtual client
Figure 7. The topology of host mobility by travel.
country or region (e.g., “EU”), for top 500
AS pairs. 5.2.1 VPN Traffic Patterns
For the large majority of hosts that travel rapidly across
Third, a closer look at the AS topology of host mo- only two or three domains, they likely have used VPNs
bility in the U.S. shows the existence of “hub” ASes that or proxies. Virtual private networks (VPNs) allow traf-
are connected to many smaller “leaf” ASes. The former fic to be privately tunneled between two machines that
are commonly associated with DSL broadband Internet are not in the same subnet. Creating an overlay net-
services, while the latter include institutional and cor- work of clients that belong to the same organization,
porate networks. This star topology could result from they are commonly used to provide corporate resources
clients’ commuting patterns between home and work. to remote employees. From the perspective of a web
Finally, in addition to the source and sink domains, server, a user connecting to her company network from
we are also interested in how far the hosts roam, i.e., a DSL line at home can generate multiple requests with
how many ASes they travel through. Figure 8 plots the the same cookie, though they appear from two domains.
distribution of the number of ASes traveled by each host, Specifically, we find a total 960,885 (12%) mobile
with the Y-axis in log scale. The large majority (90%) cookies that travel between only two ASes, and that ap-
of cookies are associated with only two domains. pear at the ASes consecutively within a short interval
These observations, based on aggregate information (i.e., 10 minutes). We call such cookies VPN-style cook-
across the 7.9 million traveling hosts in the Search ies. Table 7 lists the top five AS pairs with the highest
dataset, reflect general mobility patterns at a large scale. number of these cookies, which include institutional and
In the following, we investigate specific activities that corporate networks, e.g., City University of New York,
fall outside this norm, including those that may involve NTT, and KDDI Corporation. VPN-style cookies com-
prise around 60% of all traveling cookies between a cor- 100
porate network and a DSL broadband service provider. Cookies Overlapping Tor IPs
Cookies Overlapping Tor ASs
This observation indicates that VPNs can be a major ex-

Percentage of Cookies (%)


80
planation for host mobility.
60
AS pair # Cookies Affiliations
AS 6389, 35985 13249 BellSouth, 40
One Ring Net. (US)
AS 702, 2856 8977 Verizon (US), 20
BTnet UK Reg. Net. (GB)
AS 7018, 31822 7878 AT&T, City Univ. N.Y. (US) 0
AS 174, 701 6630 Cogent, MCI Comm.(US) 0 20 40 60 80 100
Percentage of Tor IPs or ASs (%)
AS 4713, 4716 5770 NTT Comm.,
KDDI Corp. (JP)
Figure 9. The percentage of Tor IPs or ASes
on the path of wandering cookies.
Table 7. Top AS pairs associated with VPN
cookies.
the percentage of Tor IP addresses or ASes that a cookie
traverses. All of the cookies spend at least 12% of their
5.2.2 The Use of Anonymous Routing time at a Tor IP address, with the maximum being 83%.
Using the Tor network hides the network origin of a
Examining the tail of the distribution in Figure 8, we user, addressing one aspect of online anonymity. How-
also find a small fraction (0.02%) of cookies that mi- ever, the use of cookies may still reveal user activity pat-
grate across more than 10 different domains. Stopping terns and potentially user network origins, e.g., if a user
in each AS only for short durations, they do not return to does not clear cookies prior to using the Tor network.
a previously visited domain. Focusing on this behavior, To mitigate such privacy threats, users can install Tor-
we identify 309 cookies that travel across more than 10 button [7] to manage their identifying information, for
ASes, and where the time between consecutive “jumps” example.
to different ASes is less than 10 minutes (which is the
default time to use a Tor circuit for new application con- 5.3 Detecting Cookie-Forwarding Attacks
nections). Compared to the AS peering relationship in
Section 5.1, there does not appear to be any clear delin- Based on the host mobility patterns derived from our
eation of geographical regions. analysis, we launch an investigation into abnormal user
The top ASes in this case are dominated by cable net- activities that include 28,208 unique user accounts, pro-
works, with the previously top cellular networks disap- vided by the Hotmail web-mail service. These events
pearing completely from the list. Some university net- are sampled over a 24-hour window in November 2010.
works ranked significantly higher than before (AS 111, In each event, a user submitted requests (e.g., check-
associated with Boston University, is on the path of 9% ing new emails, listing contacts) from an IP range that
of these cookies). One explanation for the behavior of was different from the one she used to log into her Hot-
these cookies is the use of anonymous routing systems, mail account. One would imagine that this behavior can
such as Tor [37]. For a user that routes her traffic in this be attributed to the use of cellular networks, VPNs, or
manner, if her traffic exits from different nodes in the proxies. Surprisingly, we find many users exhibit quite
mixing network, the same cookie may appear at differ- different traveling patterns than those we learned in Sec-
ent domains. tions 5.1 and 5.2.
We obtained a list of active Tor nodes [8], including,
for each node, its IP address, country, ISP, and whether 5.3.1 Detection Methodology
it is an exit node. Among the 309 wandering cookies,
60 of them traverse through at least one Tor node, and We find two distinct patterns in these events that differ
142 of them traverse through at least one AS that is also from those of general mobile hosts:
shared by a Tor node. We also examine ASes since some • One-third of the ASes associated with these events
Tor nodes may already be assigned different IPs at the are exclusively sinks or sources. This is in con-
time of our lookup. Figure 9 plots the distribution of trast to normal host mobility, where the direction
of travel is largely bi-directional. Table 8 lists the Sink AS # IP # Req. # Acct. Location
dominant sink ASes. AS 14141 12 262 192 Atlanta, GA
• Among the AS pairs with the largest number of AS 19194 10 225 174 Unknown
AS 19318 11 242 189 Jersey City, NJ
these abnormal events, seven out of the top ten do
AS 40430 12 269 201 Miami, FL
not appear at all among those associated with nor-
AS 25761 14 324 235 Fullerton, CA
mal hosts. These AS pairs are listed in Table 9. AS 1421 10 265 198 Bordentown, NJ
AS 29761 10 244 192 Los Angeles, CA
AS 30058 10 261 180 Woodstock, IL
Sink AS # Cookies Location AS 18779 10 246 180 San Francisco, CA
AS 34285 308 Seville, Spain
AS 40430 201 Miami, FL, USA
AS 14141 192 Atlanta, GA, USA Table 10. Statistics for detected bot-user
AS 19318 189 Jersey City, NJ, USA groups.
AS 19194 174 Unknown
(Satellite provider)

Table 8. Top ASes that are exclusively By examining all the sink ASes with source AS
sinks in the abnormal events. 30736 in these events, we find a total of 9 bot-user
groups, corresponding to 9 sink ASes geographically
distributed over the U.S. The activities between some of
these ASes are subtle, and would not have been detected
AS Pair # Cookies Affiliations without leveraging the normal host mobility patterns de-
AS 766, 34285 308 RedIRIS AS (EU), scribed in Section 5.1.
SANDETEL (ES)
AS 30736, 25761 235 Easyspeedy Net. (DK),
Staminus Comm. (US) 5.3.2 Cookie-Forwarding Bot Users
AS 30736, 40430 201 Colo4jax (US)
Table 10 lists the statistics for the 9 detected bot-user
AS 30736, 1421 198 WANSecurity (US)
AS 30736, 14141 192 WireSix (US)
groups. Each of these groups includes around 190
AS 30736, 29761 192 OC3 Net. & Web Solu- users. A different /24 subnet is associated with each user
tions (US) group that submit requests without explicit login activ-
AS 30736, 19318 188 New Jersey Intl. Inter- ities from the same subnet. For each /24, the sink IP
net Exchange (US) rotates among 10 to 14 addresses.
From a more recent user login dataset collected by
Table 9. Top AS pairs related to abnormal Hotmail in January 2011, we find over 75,000 email
events. accounts associated with the suspicious source IP ad-
dress in Denmark, all exhibiting similar patterns to the 9
groups we discovered. Manual investigation by Hotmail
Combining these two observations, we find that the shows that these accounts were used by attackers for the
dominant sinks in Table 8 significantly overlap with the purpose of receiving and testing spam. After these ac-
sink ASes in Table 9. They share the common source AS counts are logged into from one machine (i.e., one IP ad-
30736, located in Denmark. Upon examination, we find dress), their cookies are forwarded to multiple locations
that there is a single IP address generating login events so that further requests can be submitted in a distributed
for a large number of users, who then submit subsequent fashion during the validity period of the cookies, which
requests from multiple ASes in the U.S., violating the is 24 hours in our case.
geo-locality travel pattern observed in Figure 7 as well. There are at least two possible explanations for such
We find that the user login IDs associated with this malicious cookie-forwarding activities. First, some
particular source IP address contain more suspicious web-mail providers identify an account as suspicious if
patterns. In particular, they are groups of bot-user ac- it performs logins from multiple geographic locations
counts all registered on the same day in November 2010, within a short time interval. By forwarding cookies to
with the same user age, location information (country, other locations through a private communication chan-
state), and scripted naming patterns. Among the top five nel, attackers can successfully offload the requests to
dominantly sink ASes, four of them are used by these distributed hosts without them performing explicit user
bot groups to submit requests. logins, hence reducing the likelihood of detection. Sec-
ond, as a preparation step in launching session-hijacking Yahoo! [13] find that 40% and 60% of users have empty
attacks on real user accounts (e.g., [6]), attackers may browser caches, so they probably have cleared cookies
be testing the effectiveness of forwarding cookies via as well. While our results are consistent with previous
stealthy communication channels. findings, the approach we take requires neither user co-
Although the user accounts we identified were all operation nor special content setup.
newly created, it is possible that attackers can employ Host mobility studies have been performed in the
hijacked cookies stolen from actual users and forward context of wireless [17, 27, 22, 25], ad hoc [24, 26],
them to botnet hosts in the future. Understanding nor- and cellular networks [19] to obtain more accurate de-
mal host mobility patterns can help detect such stealthy vice moving models or to predict user locations. Sim-
attacks. ler et al. [35] studied user mobility in terms of ses-
sion characteristics based on login events to a university
email server in order to generate synthetic traces. Re-
6 Related Work
cent work [33] proposed a technique for classifying IP
addresses into home and travel categories to study host
Many efforts on tracking hosts focus on identify- travel and relocation patterns in the U.S. By studying
ing specific hardware characteristics, such as radio fre- cross-domain cookies, our work focuses on normal host
quency [23, 34, 18] or driver [21]. Identifiers such mobility patterns that enable us to observe uncommon
as network names or the IP addresses of frequently phenomena and detect malicious activities.
accessed services also enable host fingerprinting [32].
However, these approaches require the observer to be in
close physical proximity to the target host. 7 Discussion and Conclusion
Remote host fingerprinting can leverage packet-level
information to identify the differences in software sys- In this paper, we perform a large-scale exploration
tems [2, 4, 5] or hardware devices [28]. Other works of common identifiers and quantify the amount of host-
on tracking web clients require probing hosts’ system identifying information that they reveal. Using month-
configurations [20] or the installation order of browser long datasets from Hotmail and Bing, we show that com-
plug-ins [31]. Persistent browser cookies [3, 36] have mon identifiers can help track hosts with high precision
also been proposed; these systems store several copies and recall.
of a cookie in different locations and formats, so that Our study also informs service providers of the
they cannot be removed by standard methods. potential information leakage when they anonymize
Compared with these efforts, our work focuses on datasets (e.g., replacing IP addresses with IP prefixes)
studying the effectiveness and implications of track- and release data to third-party collaborators or to the
ing hosts using existing identifiers, without requiring public. For example, we show that hashes of browser
new information or probes. Although the issue of pri- information (i.e., the anonymized UA strings) alone can
vacy leakage has been repeatedly discussed, e.g., per- be quite revealing when examined in one network do-
sonally identifiable information in online social net- main. Furthermore, coarse-grained IP prefixes achieve
works [29, 30], there has been limited study using large- similar host-tracking accuracy to that of precise IP ad-
scale datasets. Our work uses month-long datasets from dress information when they are combined with hashed
a large search engine and a popular email provider to UA strings.
quantify the amount of host-identifying information re- Our analysis suggests that users who do not wish to
vealed by a variety of common identifiers. To the best be tracked should do much more than clear cookies. Un-
of our knowledge, we are also the first to demonstrate common behaviors such as clearing cookies for each re-
applications of host tracking to analyze cookie churn in quest may instead distinguish a host from others who do
web services and to detect suspicious cookie-forwarding not do so. Users should take notice of their user-agent
activities. strings (e.g., modify the default setting [10]), consider
Apart from its privacy implications, understanding the use of proxies, and possibly resort to sophisticated
cookie churn is an important topic for estimating web techniques such as anonymous routing [37]. In some
user population and personalization. Previous stud- cases, several of these techniques should be combined
ies mostly rely on user surveys or active user par- to be effective, e.g., clearing cookies in addition to the
ticipation (e.g., by installing a software on user ma- use of proxies or Tor.
chines) [12, 11, 16, 14]. Their findings show that 30% to Finally, despite its privacy implications, we demon-
40% of users clear cookies monthly. A separate study by strate the security benefit of host-tracking. Given the
growing concerns over account hijacking and session hi- [14] Cookie corrected audience data. White paper, Quantcast
jacking, we expect host fingerprinting and tracking tech- Corp., 2008.
niques can help defend against such attacks in the future. [15] Protecting consumer privacy in an era of rapid change.
Federal Trade Commission Staff Report, 2010.
Acknowledgments [16] M. Abraham, C. Meierhoefer, and A. Lipsman. The im-
pact of cookie deletion on the accuracy of site-server and
ad-server metrics: an empirical comScore study. White
We are grateful to Hotmail, Bing, and Windows Up-
paper, comScore, Inc., 2007.
date for providing us with data access that makes this
study possible. We thank Zijian Zheng for his guidance [17] M. Balazinska and P. Castro. Characterizing mobility
and insight on cookie-churn analysis. We thank Keiji and network usage in a corporate wireless local-area net-
work. In Intl. Conf. Mobile Systems, Applications, Ser-
Oenoki and Hersh Dangayach for providing us with data
vices, 2003.
related with cookie-forwarding attacks and for the help
in the subsequent investigation. We thank the reviewers, [18] V. Brik, S. Banerjee, M. Gruteser, and S. Oh. Wireless
and in particular Paul Syverson, for their suggestions of device identification with radiometric signatures. In Intl.
Conf. Mobile Computing and Networking, 2006.
improvements to this paper.
[19] I. Constandache, S. Gaonkar, M. Sayler, R. Choudhury,
and L. Cox. Energy-efficient localization via personal
References mobility profiling. In Intl. Conf. Mobile Computing, Ap-
plications, and Services, 2009.
[1] CookieCooker. http://www.cookiecooker. [20] P. Eckersley. How unique is your web browser? In Pri-
de/. vacy Enhancing Technologies Symp., 2010.
[2] Nmap free security scanner. http://nmap.org. [21] J. Franklin, D. McCoy, P. Tabriz, V. Neagoe, J. V. Rand-
[3] Project details for evercookie. http://samy.pl/ wyk, and D. Sicker. Passive data link layer 802.11 wire-
evercookie/. less device driver fingerprinting. In USENIX Security
[4] Project details for p0f. http://lcamtuf. Symp., 2006.
coredump.cx/p0f.shtml. [22] J. Ghosh, M. Beal, H. Ngo, and C. Qiao. On profiling
mobility and predicting locations of wireless users. In
[5] Project details for xprobe. http://sourceforge.
Intl. Workshop on Multi-hop ad hoc networks, 2006.
net/projects/xprobe/.
[23] J. Hall, M. Barbeau, and E. Kranakis. Detection of
[6] Secure your PC and website from Firesheep
transient in radio frequency fingerprinting using signal
session hijacking. http://www.pcworld.
phase. In Intl. Conf. Wireless and Optical Communica-
com/businesscenter/article/210028/
tions, 2003.
secure your pc and website from
firesheep session hijacking.html. [24] X. Hong, M. Gerla, G. Pei, and C. Chiang. A group
mobility model for ad hoc wireless networks. In ACM
[7] Tor Project: Torbutton. https://www.
Intl. Workshop on Modeling, Analysis and Simulation of
torproject.org/torbutton/.
Wireless and Mobile Systems, 1999.
[8] Tor Proxy List. http://proxy.org/tor.shtml.
[25] N. Husted and S. Myers. Mobile location tracking in
[9] U. Oregon Route Views Project. http://www. metro areas: Malnets and others. In ACM Conf. Com-
routeviews.org/. puter and Communication Security, 2010.
[10] User-agent switcher. https://addons.mozilla. [26] A. Jardosh, E. Belding-Royer, K. Almeroth, and S. Suri.
org/en-US/firefox/addon/59/?id=59. Towards realistic mobility models for mobile ad hoc net-
[11] 40% of consumers zap cookies weekly. http:// works. In Intl. Conf. Mobile Computing and Networking,
www.marketingsherpa.com/!newsletters/ 2003.
bestofweekly-4-22-04.htm#topic1, 2004. [27] M. Kim, D. Kotz, and S. Kim. Extracting a mobility
[12] Measuring unique visitors: Addressing the dramatic de- model from real user traces. In IEEE Infocom, 2006.
cline in accuracy of cookie-based measurement. White [28] T. Kohno, A. Broido, and K. Claffy. Remote physical de-
paper, Jupiter Research, 2005. vice fingerprinting. In IEEE Symp. Security and Privacy,
[13] Yahoo! user interface blog: Performance re- 2005.
search, part 2: Browser cache usage exposed! [29] B. Krishnamurthy and C. E. Wills. Characterizing pri-
http://yuiblog.com/blog/2007/01/04/ vacy in online social networks. In ACM Workshop on
performance-research-part-2/, 2007. Online Social Networks, 2008.
[30] B. Krishnamurthy and C. E. Wills. Privacy leakage in
100
mobile online social networks. In USENIX Conf. Online
99
Social Networks, 2010.
98

Percentage (%)
[31] J. R. Mayer. “Any person... a pamphleteer”: Internet
97
anonymity in the age of Web 2.0. Senior Thesis, Stan-
96
ford University, 2009.
95
[32] J. Pang, B. Greenstein, R. Gummadi, S. Seshan, and
94 Precision
D. Wetherall. 802.11 user fingerprinting. In Intl. Conf. Recall
93
Mobile Computing and Networking, 2007. 0 5 10 15 20
Binding Window Length (Days)
[33] A. Pitsillidis, Y. Xie, F. Yu, M. Abadi, G. Voelker, and (a)
S. Savage. How to tell an airport from a home: Tech-
100
niques and applications. In ACM Workshop on Hot Top-

Percentage of fingerprints (%)


ics in Networks, 2010. 80

[34] K. Rasmussen and S. Capkun. Implications of radio fin-


60
gerprinting on the security of sensor networks. In Intl.
Conf. Security and Privacy in Comm. Networks, 2007. 40

[35] K. Simler, S. Czerwinski, and A. Joseph. Analysis 20


ofwide area user mobility patterns. In IEEE Workshop
0
on Mobile Computing Systems and Applications, 2004. 0 5 10 15 20
Binding Window Length (Days)
[36] A. Soltani, S. Canty, Q. Mayo, L. Thomas, and C. Hoof- (b)
nagle. Flash cookies and privacy. SSRN preprint, 2009.
[37] P. Syversion, D. Goldschlag, and M. Reed. Anonymous Figure 10. Binding length and accuracy
connections and onion routing. In IEEE Symp. Security tradeoff.
and Privacy, 1997.
[38] Y. Xie, F. Yu, and M. Abadi. De-anonymizing the inter-
net using unreliable IDs. In ACM SIGCOMM, 2009. tracking hosts using the binding window length as an
adjustable parameter. In particular, Figure 10 suggests
Appendix that using a binding window length of five days in prac-
tice can achieve both high precision and recall without
losing significant coverage.
A Tracking Stable Hosts

In the presence of NATs, proxies, and dynamic IP B Non-returning Users


addresses, the mapping between a host and an IP ad-
dress can be extremely volatile. Service providers that For those 101,427 “one-time” non-returning users
are interested in fingerprinting stable hosts may trade that were observed only on the first day of the Search
coverage for accuracy. We show that the binding win- dataset we are interested in whether this is because they
dow length can serve as a confidence measure for this stopped using the service or because they cannot be
purpose. tracked. We expect users who leave the service to be
Intuitively, stable and active hosts should have longer less engaged than returning users. To test this hypoth-
binding windows that make them easier to track than esis, we examine the average number of queries sub-
hosts that appear infrequently or that change IP ad- mitted by each CID and the percentage of CIDs that
dresses often. Indeed, using UA+IP as an example, Fig- have clicked on the query results. We compare these
ure 10(a) shows the increase in precision and recall with two statistics between the set of returning users and the
longer binding windows. set of non-returning users. We consider only churned
However, as we impose increasingly strict require- new CIDs in this comparison. For example, if a return-
ments on the binding window length, the percentage of ing user has queries associated first with CID1 and later
fingerprints remaining decreases roughly proportionally, with CID2 , we consider the subset of queries that cor-
as shown in Figure 10(b). Half of the fingerprints have respond to CID1 only, as they represent first-time user
binding windows no longer than one week. We can thus experience.
explore a tradeoff between accuracy and coverage of Table 11 shows that returning users indeed appear to
Non-returning users Returning users
Average number of queries per CID 4.7 7.0
Percentage of CIDs with clicks 60.73% 77.85%

Table 11. The query and click behaviors of returning and non-returning users from the first day
of the log.

be more engaged in the service, generating more queries


on average and are also more likely to make clicks.
Overall, 77.85% of the churned new CIDs that belong
to returning users have clicks, while only 60.73% of the
churned new CIDs from non-returning users have clicks.
We further examine, for each CID, the percentage of
search queries that resulted in clicks. For CIDs that be-
long to returning users, Figure 11 shows a larger per-
centage of queries have clicks than CIDs that belong
to non-returning users. Half of the CIDs associated
with returning users have clicks on 80% of their queries,
while half of those associated with non-returning users
have them on less than 50%.
100
Pecentage of churned cookie IDs (%)

Returning users
90 Not returning users

80

70

60 (a)
50

40

30

20
0 20 40 60 80 100
Percentage of queries clicked (%)

Figure 11. Cumulative distribution of the


fraction of queries per CID that resulted in
clicks.
(b)

Another question of interest is whether users stop us-


ing the service because they are less active and have in- Figure 12. (a) The time between the last
frequent online activities. To quantify the degree of ac- Bing search query and last Hotmail login.
tivity of the non-returning users, we measure the time (b) The number of Hotmail logins after the
interval and the number of login events between the last last Bing search query.
Bing search query and the last Hotmail login event that
fall within the host’s binding windows, shown in Fig-
ure 12. We find that though users in our data may have
left the search service, many of them have continued on-
line activities. More than 80% of these users are active
even after 25 days (Figure 12(a)), and around 60% of
them logged in more than 40 times (Figure 12(b)).

You might also like