Resident Evil: Understanding Residential IP Proxy As A Dark Service
Resident Evil: Understanding Residential IP Proxy As A Dark Service
Resident Evil: Understanding Residential IP Proxy As A Dark Service
as a Dark Service
Xianghang Mi∗ , Xuan Feng∗ , Xiaojing Liao∗ , Baojun Liu† ,
XiaoFeng Wang∗ , Feng Qian∗ , Zhou Li‡ , Sumayah Alrwais§ , Limin Sun¶ , Ying Liu†
∗ Indiana University Bloomington, † Tsinghua University, ‡ IEEE Member,
§ King Saud University, ¶ Institute of Information Engineering, CAS
∗ {xmi, xf1, xliao, xw7, fengqian}@indiana.edu, † [email protected],
Abstract—An emerging Internet business is residential proxy owners) as intermediaries to circumvent the restrictions imposed
(RESIP) as a service, in which a provider utilizes the hosts within by target services, for the purposes such as aggressive resource
residential networks (in contrast to those running in a datacenter) access (e.g., registering multiple accounts), data scraping, and
to relay their customers’ traffic, in an attempt to avoid server-
side blocking and detection. With the prominent roles the services others. This emerging market gives rise to a new service
could play in the underground business world, little has been done model we call Residential IP Proxy as a Service (RPaaS),
to understand whether they are indeed involved in Cybercrimes offered by companies like Luminati [3], StormProxies [49],
and how they operate, due to the challenges in identifying their Microleaves [38], etc. These providers all control a large
RESIPs, not to mention any in-depth analysis on them. number of residential hosts, which they claim joined their
In this paper, we report the first study on RESIPs, which
sheds light on the behaviors and the ecosystem of these elusive services willingly, to proxy their customers’ communication
gray services. Our research employed an infiltration framework, with any Internet target. Once abused, these residential proxies
including our clients for RESIP services and the servers they can outperform conventional public proxies or even anonymity
visited, to detect 6 million RESIP IPs across 230+ countries networks to help their clients masquerade as clean and benign
and 52K+ ISPs. The observed addresses were analyzed and sources to communicate with the targets. Such communication
the hosts behind them were further fingerprinted using a new
profiling system. Our effort led to several surprising findings may violate the target’s service terms at the very least (e.g.,
about the RESIP services unknown before. Surprisingly, despite data scraping, blackhat Search Engine Optimization(SEO))
the providers’ claim that the proxy hosts are willingly joined, and is likely associated with more sinister events such as the
many proxies run on likely compromised hosts including IoT aforementioned DDoS, due to the permissiveness of the RPaaS
devices. Through cross-matching the hosts we discovered and providers in terms of what can be done through their proxies.
labeled PUP (potentially unwanted programs) logs provided by
a leading IT company, we uncovered various illicit operations With their importance to the illicit activities, residential
RESIP hosts performed, including illegal promotion, Fast fluxing,proxy (RESIP) services, however, are still less understood.
phishing, malware hosting, and others. We also reverse engi-
One may ask whether these services indeed use residential
neered RESIP services’ internal infrastructures, uncovered their
potential rebranding and reselling behaviors. Our research takes hosts as they claim, and if so, how they recruit these hosts,
the first step toward understanding this new Internet service, and whether they are involved in malicious activities. Also
contributing to the effective control of their security risks. unclear are their infrastructures and ecosystems, particularly
I. I NTRODUCTION the ways they promote, operate their businesses and also work
In October 2016, a spree of massive distributed denial-of- with each other. Answers to these questions are critical for
service (DDoS) attacks temporarily brought down the Domain determining the role these services play in Cybercrimes, which
Name System (DNS) operated by Dyn, a leading DNS provider, could potentially help identify an effective way to mitigate the
causing major Internet platforms and services (such as Amazon, threats we are facing today, for example, through controlling
Netflix, Paypal, Twitter et al.) to be unavailable across Europe accesses to these services.
and North America. What is remarkable about this attack is that Our study. Understanding RESIP service is by no means
the traffic observed was found to originate from 65,000 infected trivial. Unlike open proxies, which can be easily found online,
residential hosts, including home routers, web cameras, and RESIP IPs are not publicized directly and can only be reached
digital video recorders [55]. Not only did these hosts jointly through the mediation of a RESIP provider. Even given a
produce an overwhelming volume at 600 Gbps, one of the proxy’s IPs, no existing techniques can tell us whether they
largest on record, but their residential IP addresses made the are indeed residential, not to mention finding out whether their
attack requests they issued less differentiable from legitimate hosts are indeed willing participants or just controlled bots.
ones, and therefore hard to detect and block by the victim. Even more challenging is to determine whether these proxies
Residential IP Proxy as a Service. Recent years have are malicious and to understand their illicit activities, since all
witnessed increasing demands for such residential IPs (those be- we can observe are just dynamic IPs shared by a set of hosts.
longing to ISP’s dynamically assigned IPs, particularly to home As a result, the traffic associated with the IPs describes those
hosts’ collective activities and it is less clear how to separate • We identified 67 different programs running as RESIPs.
the good behaviors (when the IP is assigned to a legitimate Among them, 50 are reported as malicious by anti-virus tools.
host) from the bad ones (when it is given to a compromised • Unlike the bots as reported in prior studies [65], even
host). Further without observing the internal operations of a the RESIPs running PUPs, as discovered in our research,
RESIP service, understanding its infrastructure and connections exhibit very different behavior in terms of their traffic patterns,
with other services is difficult. indicating new challenges in detecting them.
In our research, we addressed these challenges with a suite • We found the traffic relayed by RESIPs involves ad clicking,
of innovative techniques, which enabled us to perform a large- promotion, or malicious activities. 9.36% traffic destinations
scale study, first of its kind, to understand the way RESIP were detected as malicious by popular detection engines.
service is utilized for illicit purposes. Our study was based Also surprisingly, we observed other monetizing services also
upon a novel framework for automatic discovery of RESIP IPs running on the hosts of RESIPs. Examples include Fast fluxing
from related services. More specifically, we first purchased the and malicious content services.
services from commercial RESIP service providers and ran a
set of clients to communicate with our web servers through • We observed some RESIP providers likely reselling services
these services. Traffic in the communication was carefully to (or at least sharing RESIP pools with) other providers. For
marked with unique sub-domains and other parameters to help example, our infiltration traffic from the IAPS proxies was
the servers identify the IPs of the RESIPs, to enable our DNS actually relayed by Hola clients controlled by Luminati. We
system to find the DNS resolvers, and to ensure the proxied found that unlike Luminati, IAPS conducts no background
traffic of RESIPs is captured (§IV-C). The IPs found in this check and accepts bitcoin payments. Malicious IAPS users
way were further analyzed to extract a set of unique Whois might thus be able to abuse Luminati’s network or even to
and DNS features for determining whether they are indeed cause denial-of-service for legitimate Luminati customers.
residential. Further these IPs were probed by a novel, high- • We identified hidden backend gateways in the RESIP service
performance host profiling system that concurrently fingerprints infrastructure, which decouple the clients and RESIPs in their
the hosts behind millions of IPs, both from the clients and the infrastructure to make illicit activities of the RESIP service
servers under our control. Our fingerprinting technique ensures stealthier: some backend gateways were labeled as malicious
that the target of our analysis is always the RESIP, despite its sites and were dropped by the providers, while all of the
highly fluctuating IP and a potential NAT box standing in the frontend gateways were clean and enjoyed a long lifetime.
way of a direct profiling. Also we used a set of potentially Contributions. The contributions of the paper are as follows.
unwanted programs (PUP) and their traffic logs obtained from • New findings. Our findings revealed the infrastructure, scale,
a major security company to correlate our clients’ traffic with malice, and stealthiness of RESIP services. They highlight the
these PUPs’ activities, leading to the discovery of the RESIP’s security implications of this emerging service and the urgency
illicit operations and their providers’ hidden infrastructural to regulate its market.
components.
• New methodologies. We designed novel techniques for finding
Findings. Using our framework, we analyzed 5 leading RESIP RESIPs, profiling their behaviors, and analyzing the providers.
providers including Luminati [3], Proxies Online [5], Geosurf They can be integrated into a holistic system for monitoring
[1], IAPS Security [2] and ProxyRack [6], from which we RESIPs and detecting/preventing its malicious activities.
found 6.18 million unique IPs in a 4-month span. As a result,
II. BACKGROUND
we were able to conduct the first study on RESIP service. Our
analysis reveals the abused RESIPs as attack intermediaries as Residential proxy. Residential IP proxy services are a thriving
well as illicit and collusive RESIP service providers. Our key business today. During our study in 2017, we continued to
findings are as follows. witness the emergence of new RESIP services and a boom in
existing businesses: e.g., Proxies Online [5], the first RESIP
• Our discovered RESIPs are distributed across 238 countries service we found, has increased their price from $3/GB to
and regions, 28,035 /16 network prefixes and 52,905 ISPs. $25/GB in 6 months. Like traditional proxy services such
A vast majority of them (95.22%) are believed to be indeed as virtual private network (VPN), anonymity networks, and
residential and very few of them (2.20%) are reported by public HTTP/SOCKS proxies, RESIP service is promoted as an
blacklists or emerging threat intelligence platforms. anonymity channel, but also characterized by its resilience
• We discovered the presence of likely compromised hosts against server-side detection and blocking. More specifically,
as RESIPs, among which, 237,029 IoT devices and 4,141 residential IPs are often more trusted by the server than
RESIP hosts running PUP programs were identified, although those from a data center [4]. Also, they tend to be dynamic,
RESIP service providers typically claim that their proxies are with RESIP services usually running in a back-connect proxy
all common users willingly joining their networks. In fact, mode, making malicious clients nimble and capable of quickly
none of the 5 RESIP providers is a completely consent-based migrating to other IPs when detected.
anonymity system and even the most prominent companies Figure 1 illustrates the RESIP service model discussed in the
like Luminati were found to use suspiciously compromised prior works [58], [59], which involves three parties interacting
residential hosts. with each other: the main service component including a proxy
Feature
Purchase Extraction Training
RESIP services
Dataset
Residential Host Fingerprinter
host Whois
DNS server RESIP
service
DNS
Controlled Liveness Checker
Proxy Residential RESIP DNS servers
service
gateway host
Relay Profiler
RESIP Targeted Controlled RESIP
service
Controlled
RESIP
Candidates
Classifier RESIP
clients web servers
client Residential
server
RESIP host The Infiltration Framework Residential IP Classifier Host Profiling
service Fig. 2: Our methodology framework.
Fig. 1: The RESIP service from an outsider’s perspective.
leaf inetnum object whose IP range covers that IP, its direct
gateway and residential hosts, the client, and the server to owner as the organization and person objects associated with
be visited (the target). Once a client signs up with a RESIP its direct inetnum, and its loose owner as all organizations and
service, it receives a gateway’s IP address or URL for accessing persons who share the same contact information as the direct
the service. During the communication, the gateway forwards owner. In our research, we collected the IP Whois databases
the client’s requests to different residential hosts, which further from all 5 RIRs everyday since December 2015 using their
send them to the target and get responses back. Figure 1 RDAP and bulk access APIs [40] [46][23][24][45][44]. Those
describes what can be observed from the outside, from the historical IP Whois databases were used to generate features
client and target’s perspective. The inside view, however, can for our residential IP classifier (§III-B).
be more complicated, as discovered later in §V-B.
III. M ETHODOLOGY AND DATASET
There are many RESIP providers on the market, such as
Luminati and Geosurf. They offer a variety of service plans As shown in Figure 2, the methodology behind our study
with different levels of flexibilities, which can be leveraged on RESIP consists of three important parts: an infiltration
to launch cyber attacks. For example, the client is given three framework (§III-A) for gaining insider’s views of RESIP
different ways to determine how proxies are chosen, based services, a classifier (§III-B) for identifying residential IPs, and
upon whether the gateway attempts to use the same RESIP a host profiling system (§III-C) for fingerprinting the proxy
to send multiple requests to the target: sticky (S), non-sticky hosts. We elaborate them as follows.
(NS), and half-sticky (HS). A sticky gateway always tries to A. Infiltration Framework
use the same RESIP for communication whenever it can, and Our infiltration framework includes a client, which is a web
when it has to give up on the proxy (when the RESIP gets crawler sending labeled requests through a RESIP service to
off-line), the gateway attempts to switch to the next one. The its target site, a target server, which is a website receiving
client can also specify the “sticky time”, e.g., changing to a the client’s requests forwarded by RESIPs, and our own
different RESIP after 1 minute. In the non-sticky model, the authoritative DNS server, which is utilized to find out whether
gateway changes RESIP each time after a request is forwarded. DNS resolving happens on the RESIP hosts or on the gateway,
The half-sticky service allows the client to switch between the and further discover these resolvers. This framework is also
S and the NS models by adjusting parameters (e.g., a session illustrated in Figure 2.
ID) during the communication. Another service option is to We found 17 RESIP services either through search engines
decide where the domain name of the target to be resolved, by or from Blackhat SEO forums [31]. Among them, 5 (Table I)
the RESIP or the gateway. This is important since the resolver were picked out based upon their claimed scale (> 100K
can be observed by the target’s DNS server and may need to be IPs), service models (SOCKS or not, pay by month or traffic,
covered under some circumstances. As an example, the RESIP etc.), popularity (heavily promoted online), and the time they
provider Luminati allows its client to move the DNS resolving were discovered (earliest ones). All 5 services support relaying
to the RESIP by using the -dns-remote parameter. HTTP/HTTPS traffic and ProxyRack also supports SOCKS4
IP Whois Database. The Internet Assigned Numbers Authority and SOCKS5 protocols. We then purchased those five RESIP
(IANA) allocates IP addresses in large chunks to one of five services, and ran our crawler to periodically visit our server
Regional Internet Registries (RIRs), including ARIN, APNIC, with pre-registered domains through these services. Our server
AFRINIC, LACNIC and RIPE. Each RIR operates a Whois recorded each labeled request and extracted its source IP, which
directory service to manage the registration of IP addresses in was considered to be the address of the RESIP provided by the
their regions (e.g., Europe region for RIPE). A Whois directory service. For this purpose, each request produced by our crawler
is organized in an object-oriented way, containing four types was labeled to avoid recording the requests from other parties,
of objects with each assigned a unique ID: inetnum, person, since they may not carry RESIP IPs (e.g., Man in the Middle
organization, and ASN. Here an inetnum object describes an players record our traffic and replay it ). Also, this approach
IP address range and all its attributes; organization and person forces the RESIP to query our DNS server, exposing its resolver.
objects are used to represent the ownership of IP blocks with In our framework, a client sends requests to specially crafted
a set of attributes like email addresses; and ASN identifies the subdomains (as part of the HTTP request URL) with the
autonomous system an IP address belongs to. All inetnums following pattern: uuid.timestamp.providerId.gwId.raap-xx.site,
are created in a hierarchical manner and therefore form an where uuid is a dynamically generated UUID, timestamp is the
inetnum tree. Given an IP, we define its direct inetnum as the client’s current Unix timestamp, providerId uniquely identifies
Provider Price Payment Date(s) Gateway DNS
the RESIP service provider, gwId represents the type of the
Proxies Online $25/Gb Paypal 07/06-11/24 HS R
proxy gateway (S, NS or HS) and raap-xx.site represents a Geosurf $300/month Paypal 09/17-10/22 S/HS R
set of domains registered for our website, with xx describing ProxyRack $40/month Bitcoin 09/18-11/24 S/NS R
various geo-locations (us, eu, etc.). In this way, each request Luminati $500/month Paypal 09/25-11/01 HS R/G
IAPS Security $500/month Bitcoin 09/23-11/01 HS R
targets at a unique subdomain. Moreover, such crafted requests,
once being proxied by the RESIP device, became more likely TABLE I: RESIP services purchasing details. HS: half-sticky; S:
to be captured by our industry partner’s anomalous traffic sticky; NS: non-stick; R: RESIP; G: gateway.
gathering module (data collected by the module elaborated Source Label # IPs # /16 # /8 Training
in §III-D) due to their newly registered domains carrying the Manual resi-clean 79 25 19 79
Device Search Engine resi-clean 89,345 13,525 195 9,921
patterns produced by DGA (Domain Generation Algorithms). Trace My IP resi-noisy 37,480 11,402 213 0
Through such collected data, we were able to locate the RESIP Filtered IP Whois resi-noisy 23,264,961 394 31 0
devices and analyze the traffic they proxied (See §IV-C). IoT Botnets resi-noisy 1,699,291 20,112 200 0
Public Clouds non-resi-clean 53,716,321 968 99 5,000
Upon receiving a DNS query for such a domain, our DNS Alexa Top1M non-resi-clean 442,989 14,365 213 4,481
server employed a regular expression to check the pattern of Commercial Proxies non-resi-clean 519 71 44 519
the subdomain, and if correct, resolved it to the IP addresses of Public Proxies non-resi-noisy 148,509 14,004 204 0
our controlled servers. In this way, for each successful request, TABLE II: Datasets for training and testing the residential IP classifier.
three log records were generated by the entities under our to Nov. 24 2017. Our study captured 6,183,876 different
control: the client (our crawler), the target server, and the DNS RESIP IPs by issuing 62 million requests. Before Sep. 15,
server as illustrated in Figure 2. Here the client recorded the we only ran 2 crawling jobs on a single service, Proxies
labeled request URL, the target server kept the RESIP’s IP, Online. Then starting from Sep. 17, we gradually purchased
and also the DNS server logged the RESIP’s DNS resolver. at least one-month service from all 5 RESIP providers and
Correlating those logs provides us a comprehensive view of a ran up to 20 crawling jobs daily using 200+ threads to collect
RESIP’s operations, and can also help discover related traffic RESIP information from all of them. After one month, we
traces from other sources when they were captured by network have gathered enough RESIPs from Luminati. Meanwhile, our
monitors (see §IV). As shown in Table I, all RESIP services measurement results revealed that IAPS Security was just a
except Luminati resolve domain names on RESIPs rather than reseller of Luminati’s service, and Geosurf and Proxies Online
gateways while Luminati can do this on either site through actually share the same infrastructure. Given the above findings,
configuration. We came to this conclusion since our DNS server we then stopped crawling the expensive providers, including
received queries issued by over 82K DNS resolvers from these IAPS Security, Geosurf, and Luminati, but still kept the jobs
RESIP services in our study. on Proxies Online and ProxyRack until Nov. 24. Overall, we
During our study, we carefully designed our methodology spent $2800 in purchasing and infiltrating those services.
to ensure that our infiltration and profiling are less detectable
by the RESIP services. For this purpose, we deployed multiple B. Residential IP Classifier
crawlers and target servers on Amazon EC2 instances and While RESIP service providers claim to utilize residential
Aliyun instances located in European, US, South America, hosts for relaying their customers’ traffic, little is known about
Singapore and China, to generate traffic from diverse sources. whether the proxies they use are indeed located in residential
Further, we used AES-CBC with a 128-bit key to encrypt networks. Determining whether an IP is residential can be
all traffic between our crawlers and the targets, to prevent complicated, particularly when the same ISP can also allocate
potential content inspection. Another implementation issue is IP blocks to data centers. Although some commercial service
the presence of multiple gateways and the different models (e.g., Maxmind GeoIP2 Precision Insights Service [33]) allows
they are running (S, HS and NS; see §II and Table I). For queries on IP’s labels such as residential or cellular for a fee
example, GeoSurf and ProxyRack all run sticky gateways; as (e.g., $50 for 25K IPs), it cannot scale to a large number
a result, our server would not see any new proxy host during of queries (6.2M in our research) and its methodologies are
a given period of time (1 to 10 minutes); therefore our crawler not open (so less known about their reliability). So in our
was implemented to only request once for a while, depending research, we built a new classifier on top of a set of features
on the sticky time given by the service. For the providers that characterize residential IPs. Following we elaborate the
with non-sticky and half-sticky gateways, our implementation technique, particularly, our approaches to collect clean ground
took different strategies to generate requests. When there were truth, select robust features, and train and evaluate the classifier.
multiple gateways, we chose a different one for each request Finding groundtruth. Finding clean labeled residential IPs
in order to reduce redundant requests and cover more RESIPs. is challenging due to the absence of public data and the
Besides, in case RESIP services assigned different gateways dynamic IP allocation performed by ISPs. To address this
to different users, we registered for each service at least two issue, we came up with a series of robust methodologies to
distinct user accounts and found that each account was always obtain 4 labeled datasets: residential-clean (resi-clean), non-
linked to the same set of gateways. residential-clean (non-resi-clean), residential-noisy (resi-noisy),
Result and evaluation. In total, we ran up to 20 daily crawling and non-residential-noisy (non-resi-noisy). Such groundtruth is
jobs, each producing about 50,000 requests, from Jun. 06 summarized in Table II.
The resi-clean set contained 79 IPs of the personal devices Feature selection and extraction. We selected a set of
under our control, which were connected to 11 ISPs in 3 unique features to train a classifier to identify residential IPs.
countries for identifying these addresses. To find other “clean” Unlike non-residential IPs, residential IPs are typically directly
IPs, we came up with an idea that leverages device search assigned and managed by an ISP (instead of being re-assigned
engines (e.g., Shodan [48], Zoomeye [52] etc.) to search to a business) [66]. Also, ISPs tend to reserve stable IP blocks
for the network devices typically only utilized in residential (belonging to the same inetnum) for home users, while the
environments. Examples include smart home systems such as network blocks given to the business could be more volatile,
Amazon Echo [27], Google Home [35], Philips Hue Lights [41], changing hands over multiple owners during a given period
home-related gateways like residential ADSL gateway and of time [66]. Furthermore, non-residential IPs are more likely
broadband residential gateway, and others. A complete list to host web services. For example, among 442,989 IPs for
of keywords used in such device queries is presented in the Alexa Top 1M domains, 29% (128,531) are found in our
Appendix IX-A. These queries return IPs for both devices Public Cloud dataset while only 0.01% (36) are also in our
discovered online and related applications. The former was resi-clean dataset. Based upon such observations, we leveraged
added to our resi-clean dataset as groundtruth. In this way, we a total 35 features related to IP Whois records or Active DNS
successfully harvested 89,345 residential IPs distributed across records to capture residential IPs’ characteristics. Due to the
13,525 /16 and 195 /8 network blocks. This data collection space limit, we here just elaborate some of them and the rest
was done automatically, which we believe itself is a technical is presented in Appendix IX-A.
contribution. • An Active DNS feature. As an example, the connection
We further applied several weaker heuristics to build the resi- between non-residential IPs and web services can be captured
noisy dataset. Despite being noisy, the dataset is still useful in by the average number of TLD+3 domains per IP in the direct
validating our classifier. Specifically, its data comes from three inetnum (§II). Intuitively, this feature describes the number of
sources. (1) We used the query logs of Trace My IP [51], an IP domains hosted in the direct inetnum of this IP, which were
tracing service helping visitors to find their devices’ IPs. The found from Active DNS dataset [68]. Our evaluation on the
IPs recorded by the logs were selected as potential residential labeled set shows that non-residential IPs have 5.49 as the
IPs when the ISPs involved are known to be residential Internet average feature value while residential IPs only have 0.016.
service providers (e.g., AT&T and Comcast), queries are from
the OSes for consumer devices (e.g., Android and IOS) and • IP Whois features. We also used phone numbers and email
common browsers, and the IPs are not labeled as bot or spider. addresses to identify the owners of the inetnum for an IP, and
(2) We looked up the owner objects for the 79 clean residential discovered that residential IPs tend to have much more inetnum
IPs in the IP Whois dataset (see § II), and considered other IPs objects (3,536 on average) than non-residential IPs (1,482 on
under those owner objects as residential IPs. This is because as average). This could happen when the ISP assigns large chunks
a common practice, ISPs (such as AT&T) typically register the of continuous IPs to their organizational users. Additionally,
same set of owner objects to manage the IP blocks serving the we designed the features to profile the size and stability of the
same purposes. For example, AT&T registers the owner object direct inetnum of a given IP. Specifically, we retrieved the IP’s
ATTMO-3 [28] for AT&T Mobility LLC [29] to manage all historical direct inetnums from 24 IP whois snapshots in the
IPs for mobile usage. (3) We also included the IPs detected last 2 years, and identified their sizes, depths on the inetnum
from two emerging botnet campaigns Hajime [12] and IoT tree, and further calculated the variations of these parameters
Reaper [13] that utilize compromised IoT devices (see §III-D), to capture their changes in the past 24 months. We observed
as home IoT devices are much more likely to be compromised that 70% of the residential IPs have a size (of historical direct
than enterprise IoT devices. In total, the resi-noisy dataset inetnums) below 105 , while 58% of non-residential IPs have a
contained 25,001,529 IPs. size above 105 . Also residential IPs are much more stable in
The non-resi-clean data were collected from cloud providers, their depths on the inetnum tree, with a variation below 0.16.
high-profile websites (Alexa top 1M websites), and commercial Evaluation and results. Over 10K residential IPs and 10K
proxies (details in Appendix IX-A). We gathered 54,031,298 non-residential IPs, we trained a Random Forest (RF) classifier,
such IPs distributed across 14,610 /16 and 213 /8 network which achieved an excellent performance in a 5-Fold cross
blocks. The non-resi-noisy dataset involved the IPs from validation (precision of 95.61% and recall of 97.12%). We
publicly available proxies (e.g., Tor relays and public free further evaluated the model over the four labeled datasets as
proxies) as detailed in § III-D. The data is noisy since some well as the unlabeled dataset (6.2M RESIP IPs we collected)
such proxy services like Tor also recruit home servers to relay with sampled manual validation. Our study shows that this
traffic [50]. This dataset included 148,509 IPs in 14,004 /16 model made the predictions in line with the natures of these sets
and 204 /8 networks. (more leaning toward residential or non-residential IPs in the
From the above datasets, we built a labeled set with 10K cases of the noisy datasets) and particularly on the unlabeled
residential IPs and 10K non-residential IPs randomly sampled set, it achieved a precision of 95.80%. When applying the
from resi-clean and non-resi-clean, respectively (see Table II). model on 6.2M RESIP IPs we collected, it detected 5.9M
They were used in feature evaluation and classifier training (95.22%) residential IPs and 0.3M (4.78%) non-residential IPs.
while the rest datasets were applied to evaluate our classifier. More details about the evaluation process and results can be
Client Gateway running with the sticky or half-sticky gateway. Figure 3(a)
RESIP IP Web Server
request
Public Private Infiltration raap-xx.site request
raap-xx.site
illustrates these fingerprinting processes, with IoT devices
request
network network raap-xx.site
RESIP IP
(printer) being RESIPs in the private network.
if request from
OutsideFP controlled client
RESIP IP
To achieve a high performance when profiling a large
OutsideFP
Router/NAT RESIP IP banners number of IPs, our system will not conduct insideFP for a
OutsideFP is
InsideFP router or NAT request
127.0.0.1 request RESIP unless its outsideFP result reveals a router/NAT. This
InsideFP 127.0.0.1
Gateway RESIP
(printer)
RESIP IP banners is because that insideFP has a larger request latency than the
outsideFP, and is constrained by the rate limitation from RESIP
(a) InsideFP vs OutsideFP (b) Host fingerprinter’s analysis pipeline.
service providers. If the insideFP and outsideFP cannot reach
Fig. 3: Host fingerprinting. a consensus, we regard insideFP’s result as the final: e.g.,
found in Appendix IX-A. a RESIP was considered to be a printer when its insideFP
revealed the printer and outsideFP showed a NAT. We outline
C. Host Profiling host fingerprinter’s analysis pipeline in Figure 3(b).
To further understand RESIPs, it is very important to profile The IP liveness checker and the relay profiler scanned a
their host devices in addition to their IPs. As mentioned earlier, given IP every 30 seconds. The former simply “pinged” the IP
residential IPs tend to be assigned in a dynamic manner. Then, through typical TCP and UDP ports to find out periods when
once a RESIP IP is captured, host profiling must be conducted the IP was online. And the latter sent “heartbeat” requests via
and finished before the RESIP host has moved to another a connected RESIP gateway to our web servers to measure the
IP, otherwise, the result will be invalid. To achieve this, we relaying time of a given RESIP IP. This information also helped
designed a real-time profiling system that can simultaneously us improve the accuracy of RESIP fingerprinting: we consider
fingerprint newly captured RESIP hosts, measure their relaying the fingerprinting result as valid only when the relaying time
time (periods when serving as RESIPs), and detect when they of a given RESIP covers the fingerprinting period.
get offline (stop serving as RESIPs) or their IPs change. As Evaluation and results. Running on an Amazon EC2 instance
illustrated in Figure 2, the system consists of three modules: a with a bandwidth of 60 Mbps, 1GB memory and one-core CPU
host fingerprinter, an IP liveness checker and a relaying time at 2.40GHz, our system was capable of profiling 800K IPs/h,
profiler, which work on a given RESIP simultaneously. with each IP being fingerprinted in 63.57 seconds. In total, our
In a nutshell, the host fingerprinter will compose and send profiling system acquired banners from 728,528 (11.78% out
various probes to a given RESIP IP on commonly opened of 6.2 million) IPs and identified the device types and vendor
TCP/UDP ports including 80 for HTTP, 22 for SSH, 23 for information for 547,497 of them. Interestingly, 237,029 (43%)
Telnet, 443 for HTTPS, 554 for RTSP and 5000 for UPNP. of these IPs turned out to belong to IoTs like web camera,
Once response received and banners grabbed, the Nmap service DVR, and printer. Details of the study are in §IV-B.
detection probe list [16] will be applied to identify device type
and vendor information. D. Datasets
This process turns out to be more complicated than it Our study leverages various data sources to characterize
appears to be. A challenge comes from the fact that an IP multiple dimensions of the RESIP ecosystem. Recall that by
can be frequently re-assigned to different hosts, often not now, we have produced or used several datasets: our infiltration
the RESIP we are interested in. To address this problem, generated a large RESIP IP dataset (§III-A). To construct and
our profiling system immediately started fingerprinting an IP evaluate our residential IP classifier, we collected several other
address after it was observed by our web server. This was datasets containing residential and non-residential IPs (§III-B);
further confirmed, in the presence of both sticky and half- we also leveraged datasets of IP Whois and Active DNS for
sticky gateways, through sending another request right after the classifier’s feature generation (§III-B). In our host profiling
the banners were grabbed: if the same IP was seen by our framework, the Nmap service detection probe list is applied to
server again, we were confident that the banner belonged to infer devices’ types (§III-C). We next elaborate other datasets
the same RESIP. We call this process “outside fingerprinting” to be used in our study. These datasets are jointly leveraged
(outsideFP) as the probing targets at the RESIP IP from the to characterize both individual RESIPs and RESIP services.
outside. Another issue is caused by the presence of a private PUP traffic. We collaborated with our industry partner (one of
network the RESIP host often stays in. So a probe to its public leading IT companies) to utilize the PUP traffic they gathered
IP only gets to the gateway NATs and may not reach the from their customers’ devices (under proper consent) from June
actual RESIP host. Our solution is based upon the observation 2017 to November 2017 for our RESIP analysis. The consent
that many RESIP providers do not inspect the target IP that was given from the users who agreed to the terms of service
the client visits, which allows our client to probe the proxy’s when they installed our industry partner’s security software.
loopback address 127.0.0.1 through its connection with the The users can revoke this consent in the software settings. Each
gateway. Our study found that 3 out of the 5 RESIP service record in the dataset logged a suspicious traffic flow (inbound
providers (Proxies Online, Geosurf and ProxyRack) let this and outbound) associated with a PUP they detected. For each
“inside fingerprinting” (insideFP) go through. Note that both suspicious flow, PUP’s MD5, device ID, timestamp, and the
inside and outside fingerprinting require the RESIP service flow’s 5-tuple (src IP, src port, dest IP, dest port, transport-layer
4/1/2018 jVectorMap demo
Passive DNS. Another dataset we utilized is Passive DNS Fig. 4: Global Distribution of RESIPs
20000
40000
60000
80000
100000
120000
from 360 Netlab [17], which enabled us to identify Fast flux conclusions to be drawn. More specifically, the vantage points
activities on RESIP IPs, and reveal the hidden infrastructural of our study were limited to five RESIP service providers. Also,
components inside the RESIP services. Each of the records from them, only about 10% (still more than 500K) of all the
includes queried domain names, time periods, their aggregated IPs we observed could be fingerprinted and analyzed. Further,
lookup volumes in the given time period. our analysis on relayed traffic of RESIPs was based on the
file:///Users/think/Desktop/GEO/world-map/maps_all_responsed_IP.html 1/1
IP geolocation. IP2Location DB8 [14] is a commercial IP PUP traffic logs collected by our industry partner. Even though
geolocation database provided by IP2Location. Using this the PUP traffic logs were linked to 8,886 RESIP IPs (more
dataset, we retrieved the geolocation information (country, city, than 5 millions traffic traces) in our research, their coverage is
latitude, longitude, ISP) for given IPs. clearly limited. Availability of more comprehensive datasets
Public available proxies. We also collected the IPs related to will certainly help better understand RESIPs and their security
public network proxies, whose traffic can be easily blocked or implications. In the meantime, note that the RESIP providers
degraded by the server-side protection [62]. Specifically, we we studied are representative and we did find PUPs running
treated Tor relays (both exit and middle relays) as network behind the RESIP IPs we could not fingerprint. This indicates
proxies and crawled their lists hourly from both the Tor that some of our results could be applied more broadly, which
official website [19] and a third-party provider dan.me.uk [20]. however needs to be determined by the future research.
We used two different ways to collect publicly available Ethical issues. To conduct our study, we paid RESIP providers
proxies for HTTP/HTTPS/SOCKS4/SOCKS5. We purchased a to access their services. During the study, we followed all their
service called KuaiDaili, which collects proxies from multiple terms of service, and took great care to make sure that our study
popular proxy aggregators [7], and provides APIs for those would not harm the owners of RESIP hosts by visiting just our
still working to its users. In the meantime, we also crawled own domains. Also the users of our industry partner agreed to
other popular proxy aggregators [11] [22] to get the working share related information in exchange for free services. Lastly,
proxies KuaiDaili does not include. This dataset was further regarding our host profiling operations, we limited probing
complemented using IP2Proxy LITE [15], a service that rates to avoid overheads incurred on the remote hosts. Also
runs proprietary algorithms to detect the IPs serving VPN we only report aggregated statistics to avoid identity leakage.
anonymizers, open proxies, web proxies and Tor exits. All the studies were approved by our organization’s IRB.
Dark IPs. Also utilized in our research are popular IP blacklists IV. R ESIDENTIAL IP P ROXY
for identifying RESIP-related malicious activities. Specifically, We here report a measurement study on the core component
to track the potential relation between RESIPs and two of the RESIP service – the residential IP proxy. We analyzed
emerging botnet campaigns Hajime [12] and IoT Reaper [13], why these RESIPs were used, how they were recruited, and
our industry partner ran a detector from Sep 15, 2017 to Nov what they served.
07, 2017 to gather bot IPs of these campaigns on a daily
A. Proxy Detection Evasion
basis. Further, we collected 62 Spamhaus EDROP [18] records
every day for the last two years. Also, APIs of three threat IP source analysis. In total, we collected 6,183,876 unique
intelligence platforms were leveraged to retrieve IP indicators of RESIP IPs from the five RESIP service providers via the
compromise.: VirusTotal [21], Cymon OTX [10] and AlienValut infiltration framework (see §III-A). Our study reveals that
OTX [9]. Given the dynamic nature of RESIPs, we only focused RESIP IPs are spread across the world, across 238 countries and
on IP indicators whose timestamps are consistent with those regions, 28,035 /16 network prefixes and 52K+ ISPs. Overall,
of RESIP IPs we observed. we found that top 100 ISPs cover 57.4% of the RESIP IPs we
discovered with the ISP involving most RESIP IPs being Turk
E. Discussion Telekom (5.7%). Figure 4(a) illustrates the distribution of the
Potential bias. Due to the challenges in comprehensively RESIP IPs over countries, as determined by their geolocations.
identifying RESIP hosts and analyzing their illicit behaviors, The number of RESIP IPs in each country is ranked and
our study was based upon the data we were able to get (RESIP illustrated with various shades of darkness in the figure. As
IPs observed by our system, hosts we could fingerprint and we can see here, most of RESIP IPs stay in India (9.42%),
the PUP data available to us, etc), which could bring in bias followed by Turkey (8.64%) and Ukraine (6.42%).
to the study. While we believe that as the first large-scale As described in §III-B, we trained a classifier to identify
research on RESIP services, our study offers valuable insights residential IPs. Figure 5(a) illustrates the percentage of non-
into this new business, we are nevertheless cautious about the residential IPs in each RESIP service provider. Overall, 95.22%
8.82% 1.0 1.0 5000
Non-Resident IoT
8.00% Blacklisted Alive 4033k
Public Proxy 0.8 0.8 4000
Total
# of devices (K)
6.00% 5.81%
0.6 0.6 3000
CDF
CDF
4.00% 3.73% 0.4 0.4 2000
2.98% Overall
2.32% 2.54%
VirusTotal 1257k
2.00% 1.72% 0.2 0.2 Cymon 1000 857k
1.17% 433k 309k
AlienVault 272k 129k
0.08% 0.12% 0.04% 0.16% 0.0 0.0150 100 50 0 79k 5k18k 107k 46k
0.00% 50 100 150 0
PO GS LU PR 101 102 103
Time(s)
104 105 Delay in Days PO GS LU PR
(a) % of non-residential, blacklisted, pub- (b) The CDF of the relaying time (c) Time lag of RESIPs between (d) # of IoT devices observed from
lished proxy IPs in RESIP services per RESIP. being blacklisted and being captured. each RESIP service provider.
Fig. 5: Characterizing RESIPs. In (a) and (d), PO: Proxies Online; GS: Geosurf; LU: Luminati; PR: ProxyRack.
Top 1-5 # RESIPs % Top 6-10 # RESIPs %
lists (see §III-D). The percentage of published RESIP IPs in
Spam 8,299 36.55% Malicious Sample 438 1.93%
Malicious URL 7,305 32.17% each service provider is presented in Figure 5(a). In total, only
Zombie 277 1.22%
Bruteforce 3,325 14.64% 0.06% (3,767) of the 6.2 million RESIP IPs discovered in our
Telnet 249 1.10%
Suspicious 629 2.77% Trojan 171 0.75%
research are among the 148,509 public proxies. Among all 5
Dionaea 618 2.72% EDROP 164 0.72%
TABLE III: Malicious activities related to RESIPs.
providers we investigated, even the one with the most reported
proxies, ProxyRack, has just 0.16% on these lists.
of the collected RESIP IPs are indeed residential. Also, B. Proxy Recruitment
ProxyRack was found to have the highest fraction of non-
Volunteer recruitment. If RESIP services are recruiting volun-
residential IPs (8.82%). Such non-residential IPs tend to be
teers, there must be related web pages and software stacks that
re-assigned by small ISPs to hosting providers.
are accessible to common users. For each service, we carefully
We further explored the dynamics of RESIPs by examining went through their websites, read through search engine results
their IPs’ relaying time (see §III-C), whose cumulative distri- for keywords such as luminati recruit, proxyrack volunteer,
butions are presented in Figure 5(b). As we can see from the and geosurf software. Overall, only Luminati was found to
figure, a significant portion (90%) of the RESIP IPs exhibit a explicitly recruit common users [36]. By joining Luminati’s
short relaying time (870 seconds), which renders IP-blacklist network, users can get their traffic relayed by other members
based defense on the server side less effective. at the cost of proxying others’ traffic. To join the network,
Blacklisting. We further checked whether these residential IPs users need to install the hola client [30], which has versions
were ever blacklisted, which would allow the target server to available for multiple platforms including mobile. For other
easily block them. In our study, we looked up these addresses services, we found no recruitment channels or software stacks.
on the IP blacklists introduced in §III-D. In total, we observed Fingerprinting analysis. To further explore how RESIP
2.20% of RESIP IPs were reported by at least one blacklist. services recruit proxies, we analyzed devices behind RESIPs
Figure 5(a) shows the percentage of blacklisted RESIP IPs through our real-time profiling system described in §III-C.
in each service provider. We found that the portion of the Specifically, in our study, our profiling system acquired
blacklisted RESIP IPs is fairly small. Among these services, banners from 728,528 (11.78% out of 6.2 million) IPs observed,
ProxyRack has the most blacklisted RESIP IPs (2.54%), which indicating that these were the hosts with some ports open
is followed by Luminati (2.32%) and Geosurf (1.73%). When for probing. Among these responding hosts, 547,497 of them
analyzing the malicious activities they were involved in, we returned device types identified together with their vendor
found that spamming and malicious website hosting were two information. Interestingly, 237,029 of them turned out to be
mostly reported malicious activities. Also interesting, we found IoT systems, such as web camera, DVR, and printer. Figure 5(d)
that 1, 248 RESIP IPs (see Appendix IX-B) were served in presents the percentage of the IoT devices observed from each
two IoT botnet campaigns Hajime [12] and IoT reaper [13]. RESIP provider’s network. Luminati was found to have the
Figure 5(c) shows the cumulative distribution of the delay most IoT devices (45%), followed by Proxies Online (33%)
(in days) between when a RESIP IP was observed in our and ProxyRack (19%).
research and when it was blacklisted. We found that 11.57% of Table IV presents the top 10 device types and top 10 vendors
blacklisted RESIPs were captured by our infiltration framework for the RESIPs identified. We found that most of these RESIPs
before blacklisted, so their lifetime could be (conservatively) (69.32%) were profiled as routers, gateways, or WAP. The
estimated. The average delay we observed is 22 days, with the manufacturers for most of the RESIP devices were MikroTik,
longest being 136 days. Huawei, Technicolor, ZTE, and Dahua. Particularly, the device
Unpublished proxies. When a RESIP IP is on public proxy vendor MikroTik, Huawei, and BusyBox were associated with
lists such as Tor Relay list and public proxy aggregator, it can 59.93% of the IoT devices involved.
be easily blocked by the target server. To find out whether Note that the aforementioned result is a combination of both
these proxies were published online, we inspected 4 proxy outside fingerprinting (outsideFP) and inside fingerprinting
Device Type Num (%) Device Vendor Num (%)
(insideFP) results. As mentioned in §III-C, services including
router 114,768 48.42 MikroTik 86,593 36.53
Geosurf, Proxies Online, and ProxyRack support insideFP firewall 25,088 10.58 Huawei 37,545 15.84
for their sticky and half-sticky gateways. For RESIP IPs WAP 24,470 10.32 BusyBox 18,337 7.74
captured from those channels, insideFP was performed on gateway 22,003 9.28 Technicolor 16,866 7.12
broadband router 17,358 7.32 SonicWALL 14,122 5.96
a RESIP IP once its outsideFP revealed a NAT device (router, webcam 13,024 5.49 Fortinet 9,190 3.88
WAP, etc.). Overall, we ran insideFPs on 35,808 RESIP security-misc 10,608 4.48 Dahua 6,258 2.64
IPs, 12, 497 responded to our probings, and 10,964 further DVR 4,249 1.79 ZyXEL 5,601 2.36
media device 2,589 1.09 AVM 5,272 2.22
had their associated devices identified. Among them, 5,981, storage-misc 1,988 0.84 Cyberoam 4,558 1.92
which was found to relate to gateways by outsideFP, were TABLE IV: List of the top 10 device vendors and device types.
considered to host non-gateway devices according to insideFP. Name Providers # IPs # Devices
One interesting point here is that although outsideFPs on those hola svc.exe LU, IAPS 2.7K 1.1K
35,808 RESIP IPs all received responses, only 12, 497 replied to csrss.exe PR 241 126
our insideFPs (using similar probings as outsideFP), indicating svchostwork.exe GS, PO 226 32
swufeb17.exe PO 171 28
those unresponsive RESIP hosts may actually reside behind netmedia.exe GS, PO 170 95
NAT devices. We therefore expect that the actual proportion start.vbs PO 76 1
of non-gateway devices to be higher than that in Table IV. cloudnet.exe PR 55 42
hola plugin.exe LU 50 43
Also conflicting devices could be found on the same RESIP produpd.exe PR 21 8
IP, particularly during host re-profiling. Re-profiling happened pprx.exe PO 2 2
rarely in our study, since we did not re-profile the same IP TABLE V: List of the top 10 PUPs with most infected RESIPs.
found in 15 days. Still we observed 195 RESIP IPs hosting traffic data (see §III-D) to find the illicit activities the PUP-
different devices, indicating that multiple RESIPs possibly hosting RESIP devices were involved in. Specifically, we
share the same IP. Besides, even in a single fingerprinting, the first analyzed the traffic logs of these PUPs, searching for
banners grabbed from different ports associated with the same the domains (those the PUP communicated with) matching
IP may reveal different devices. However the scenario is very the pattern of our labeled infiltration traffic. As mentioned
rare: only 1,083 RESIP IPs (0.20% out of 547, 497) found in §III-A, the packets sent by our client to our target
in our study. When this happened, we simply assigned the IP web server through a RESIP service were constructed in
most popular device identified when studying the distribution a unique way: uuid.timestamp.providerId.gwId.raap-xx.site.
of the devices across IPs (Table IV). This labeling approach ensures that even when all other
One potential concern is the representativeness of our payload content of these packets was discarded, still we could
profiling results as only 11.75% RESIP IPs responded to identify the communication as long as the target domains were
our probings and overall 8.85% RESIP IPs had their de- recorded. This was exactly the case for the PUP traffic logging,
vice information identified. However, as shown in previous which only kept the domains, and another small amount of
studies [77] [63] [64] [61], such low identification rate is information, including the time when the communication was
quite common. For example, according to the latest large-scale observed. In our study, we correlated the PUP communication
probing conducted by CENSYS [43], among their probes on with our infiltration traffic based upon the matched one-time
0.37 billion alive IPs, only 50 million (13.5%) produced HTTP domain, their timestamps (within 1 minute), and the log on
responses, 3 million (0.8%) produced TELNET responses, 10 the client side, which is supposed to record the request sent
million (2.7%) triggered FTP responses, and 13 million (3.5%) out, and the log on the server side, which should receive the
led to SSH responses, etc. Besides, as shown in Figure 4(b), request only once. These checks ensure that there would not be
RESIP IPs with devices identified are distributed globally in 215 any false hit caused by, for example, traffic replay. In the end,
countries and regions (16,516 /16 and 196 /8 networks). This we discovered from the PUP dataset 5,895 traffic records that
also indicates that our host profiling results are representative. accurately matched the records on our sides. Those records
In summary, our host profiling results indicate that rather cover 67 different PUPs. To better understand the 67 PUPs, we
than joining RESIP services willingly, at least some RESIP scanned their MD5 using VirusTotal and found that 50 of them
devices are likely “recruited” through stealthy compromise. On were flagged by at least one anti-virus engine, and each PUP
one hand, none of the five RESIP services except for Luminati on average received 24.71 alarms. We then submitted these
provides software stacks for recruiting users. On the other VirusTotal reports to AVClass [75] to get the PUPs’ families.
hand, many IPs fingerprinted were found to host IoT devices. In the end, 17 were labeled as cryptos, 10 as glupteba, and 5
Although some devices like WAPs and routers may serve as as one of elex, bandit, zusy, wcryg and razy, and the families
the NAT front that covers other hosts behind the scene, others of the remaining PUPs were not identified.
such as cameras, printers, DVRs and media devices, etc., are
very unlikely to voluntarily join the services by their owners. For all these 67 PUPs, we collected their traffic logs from
June 2017 to Nov 2017: totally, 5 million of them covering
C. Proxy Traffic Analysis 8,886 RESIP IPs and 4,141 devices. Table V presents 10 PUP
Proxy traffic collection. In order to understand how the examples from different RESIP providers. Their MD5s are
compromised RESIP devices operated, we leveraged the PUP included in Table XIII of Appendix IX. The 5 million PUP
Domain Usage # RESIPs # Subdomains
traffic logs were further used in our traffic analysis (elaborated
noip.com/ddns.net Dynamic DNS provider 217 225
below). Note that the above numbers are only the “lower opengw.net P2P VPN 206 509
bounds” for the pervasiveness of PUPs across RESIP services, Hopto.org Dynamic DNS provider 54 73
given the limited device accesses our industry partner has. no-ip.biz Dynamic DNS provider 35 172
duckdns.org Dynamic DNS provider 28 42
Surprisingly, we found that all 5 services studied in our TABLE VI: List of the top 5 domains resolved to most RESIP IPs.
research utilized PUPs to relay traffic: 33 for ProxyRack, 9
for Luminati, 24 for Proxies Online, 10 for Geosurf and 2 for like Google Safebrowsing, BitDefender, CLEAN MX, etc.
IAPS Security. Particularly, our traffic from Proxies Online and Fast fluxing. Also surprisingly, we discovered that RESIPs
Geosurf went though 9 shared PUPs, which together with other serve as Fast flux proxies for malicious websites to evade IP
findings (see §V-B) indicates that these services are likely all based detection. In a fast flux, numerous IP addresses associated
affiliated with the same company. Also surprisingly, the proxy with a malicious domain are swapped in and out with high
program used by Luminati, Hola, was marked as PUPs, and frequency. Applying Passive DNS data and VirusTotal APIs to
some of them (2 out of 9) were forwarding our infiltration the sampled 600K RESIPs, we discovered that 1.14% of the
traffic sent to a different RESIP provider, IAPS. This combined proxy IPs were once mapped to malicious domains during the
with further analysis in §V-B indicates that IAPS is very likely periods when they were RESIPs, and on average, the mapping
a reseller for Luminati’s RESIP service. from these malicious domains to the proxy IPs lasted 86.8
Traffic Target analysis. Our access to the PUP traffic log days. However, the median was only 2 days. Table VI lists
helped us learn more about other illicit activities performed the top 5 domains resolved to most proxy IPs. Except for
by RESIPs. Specifically, from the 5-million traffic logs of opengw.net which allows volunteers to serve as VPNs for
67 PUPs, we extracted destination domains, URLs and IPs others, all other four are dynamic DNS providers. Some of
of their communication, as well as related traffic volume. them are previously reported being abused by the miscreant to
Manual analysis of top 1,000 destinations with the largest conduct various illicit activities [8], which are also confirmed
traffic volume shows most of them reside in the following 5 by us, as many subdomains of them are labeled by VirusTotal
categories: ad (75%), searching engines (8%), shopping (7%), as malicious such as yohoy.no-ip.biz, darkjabir.no-ip.info, and
malicious websites (5%) and social networks (2%). Among 595685744.duckdns.org.
ads-related domains, the majority are affiliate networks such as
tracking.sumatoad.com, click.howdoesin.net, www.alexacn.cc, D. RESIP vs. Bots
and click.gowadogo.com. Others are dedicated to different Another interesting question is how RESIPs relate to bots,
ad services such as mobile advertising, in-app advertising, especially, whether RESIPs are bots, and whether methodolo-
video advertising, ad exchanges. Many of those ad domains gies for detecting bots work for RESIPs. Regarding whether
are reported to install adware on users’ devices such as RESIPs are bots, we identified connections between them. In
ads.stickyadstv.com, counter.yadro.ru, and adskpak.com. Those particular, 1,248 IPs were blacklisted as bots of Hajime or
adware altered browser homepages, generated various forms of IoT Reaper on the same day when they offered proxy services
ads. Further, analysis of corresponding URLs of those domains (see Appendix IX-B); in addition, we also identified devices
shows that most of them are in the forms of ads provided that were likely recruited through stealthy compromise, as
by those domains. Examples include click.howdoesin.net, detailed in §IV-B. Both indicate the existence of bots acting as
tracking.sumatoad.com/aff c?, click.gowadogo.com/click? and RESIPs. Nevertheless, we also identified channels for volunteer
proleadsmedia.afftrack.com/click?. We also observed lots of recruitment, suggesting willingly joined users are also part of
search queries are sent to different search engines including the RESIP networks.
Google Search, Bing Search, Baidu Search, Yandex, and also Meanwhile, compared to bots, RESIPs are observed to
visits to various shopping websites including amazon.com, exhibit different characteristics that indicate new challenges
ebay.com, sears.com and tmall.com. Given that those proxy for detection. Unlike a bot, a RESIP is a proxy to help users
services are rather expensive, with 1 GB costing at least $15, access web services in a seemingly legitimate way. Although
using them for daily shopping and online search does not seem RESIP services recruit hosts in a highly suspicious manner,
to be reasonable. More likely were the activities related to they likely also include legitimate volunteer participants. A
blackhat SEO or other online promotion operations. What is prominent example is Luminati, which has a recruitment system.
more, some websites such as lenzmx.com and csgob0t.online Furthermore, identified RESIP programs, including the PUPs,
were found to be malicious in our manual analysis, in line all have limited privileges, while bots usually acquire the
with the results reported by VirusTotal. highest privilege [74]. Also, unlike the botnet exclusively
Further we found from the PUP logs the traffic to known mali- serving cybercrimes, RESIP services are promoted publicly
cious domains. Specifically, 9.36% of the destination addresses and are likely also utilized by legitimate users. In addition,
were reported to be malicious by VirusTotal (68.92% are labeled botnets are found to flux the addresses (IPs and domains) of
as malware sites, 29.97% being malicious sites and 2.24% being their C&C servers or run them on bulletproof hosting to evade
phishing sites). Examples include ntkrnlpa.cn, gwf-bd.com, detection and blocking [76][54]. In contrast, RESIP services
fadergolf.com, www.2345jiasu.com, and www.pf11.com, which only involve a limited number of server IPs and domains, and
have been reported by the most detection engines on VirusTotal most of them belong to popular hosting providers (See §V-B).
Source (# Machine Hours) Flows IPs Ports IP-Ports Provider # RESIP # /24 # /16 # /8 # ASN
Bots (241) 1,365.97 328.34 10.12 330.40 Proxies Online 1,257,418 483,310 19,654 196 7,701
Normal (461) 762.38 30.41 6.41 37.44 Geosurf 432,975 221,747 15,143 194 4,971
RESIPs (64,833) 96.37 53.54 6.27 58.59 ProxyRack 857,178 345,648 19,520 196 8,751
TABLE VII: Comparison of bots, normal hosts and RESIPs. All the Luminati 4,033,418 1,183,841 22,467 197 17,820
statistics here are averaged over the number of machine hours. TABLE VIII: Distribution of RESIPs.
1.0 Bots UTC-7 Top Top
Normal Provider % Top ISPs % %
RESIPs
Countries ASNs
0.8
UTC-5
Proxies India 32.2 BSNL 6.5 9829 8.1
0.6
Online USA 7.8 Uninet S.A. de C.V. 5.2 8151 5.4
UTC+5
0.4 Mexico 6.7 Deutsche Telekom AG 2.8 24560 4.9
0.2 Geosurf India 27.9 Uninet S.A. de C.V. 6.9 8151 7.2
UTC+7
Brazil 9.2 BSNL 4.7 9829 5.8
0.0 9.1 Deutsche Telekom AG
100 101 102 103 104 0 5 10 15 20 Mexico 2.8 55836 4.5
Fig. 6: CDF of # of (IP, Port) Fig. 7: # of RESIPs in each lo- ProxyRack Russia 8.6 PT Telkom Indonesia 5.4 17974 5.3
pairs visited each machine hour cal hour of various time zones. Indonesia 8.1 Pakistan Telecom 3.7 8452 4.7
Egypt 6.3 Republican Unitary 3.3 45595 4.0
Therefore, intuitively the collective behaviors of a RESIP Luminati Turkey 12.7 Turk Telekom 8.5 9121 8.5
service can be very different from these of a botnet, which was Ukraine 7.9 JSC Ukrtelecom 1.7 25019 1.8
UK 6.1 BT 1.7 34984 1.8
confirmed by our study based on the RESIP traffic logs (§III-D) TABLE IX: Top 3 countries, ASNs and ISPs with most RESIPs
and a representative botnet traffic dataset (CTU-13 [65]) with
the network flows of both normal hosts and 7 different types of a small fraction of countries, ASNs and ISPs contribute the
bots. In the study, we looked at the network flow features majority of RESIPs, respectively. For example, we find that
commonly used for botnet detection [57] [84] [82] [67] . even though Luminati is located in the United States, most
Examples include unique flows per machine hour, unique of its RESIPs are from Turkey, possibly because of Turkey’s
destination IPs per machine hour, and unique destinations network censorship which makes Hola clients a good option to
(IP/Port pairs) per machine hour. Figure 6 illustrates the visit blocked websites there. An interesting finding here is that
CDFs of the unique destinations visited every machine hour despite Luminati’s claim of having 30 million IPs, we only
by bots, normal hosts and RESIPs: compared to the bot found 4 millions using 16-million probings. It is unclear where
traffic, the RESIP traffic looks more similar to the normal this gap comes from.
one, as also observed when comparing other features across We also measured how many RESIPs a time zone contributes
the RESIP and botnet datasets (Table VII). This indicates during its different local hours. As shown in Figure 7, the
that the mixture of legitimate and illicit traffic of the RESIP peak hours across time zones indeed exhibit diurnal patterns,
service moves its statistical features closer to these of the confirming our previous findings that the majority devices of
legitimate communication. Despite the above findings, we must RESIPs are indeed residential hosts that are more likely to be
acknowledge the limitations of our approaches. For example, we powered off or disconnected during the night.
are not able to exhaustively consider all bot and RESIP types; Figure 8(a) shows the evolution of the RESIP pools by
the traffic data containing only the network flow information plotting the cumulative number of unique RESIP IPs. We
does not allow us to experiment detection methodologies such observe that a large number of RESIP IPs newly appear every
as those based on deep packet inspection (DPI). Therefore, we day with an average increase rate of 44%. However, when
leave more detailed comparison analysis between RESIPs and considering the increase of fresh /16 IP prefixes, we observe
bots as our future work. a much smaller rise (11%) in Figure 8(b). This is reasonable
V. T HE RESIP E COSYSTEM because a given RESIP host is less likely to migrate from one
A. Landscape of RESIP Service /16 IP prefix to another than to change from one IP to another.
Through infiltrating RESIP services, we were able to collect
B. Infrastructure and Service
a pool of RESIP IP addresses. Specifically, everyday during
the infiltration period, we launched multiple RESIP crawling Backend (hidden) gateways. Under the known infrastructure
jobs running across different hours in the whole day from of the RESIP service as illustrated in Figure 1, we found that
different locations and accounts, trying to reveal the landscape there are a series of hidden backend servers intermediating
of the RESIP pool. Overall, we captured 6 million RESIP IPs by between the frontend gateways and RESIPs, as shown in
sending 62 million requests. Note that due to the IP churn issue Figure 8(d). Since those servers can be regarded as gateways
especially in mobile networks, the number of RESIP IPs here from the perspective of RESIPs, we call them backend (hidden)
should only be considered as an upper bound of the number gateways. These gateways were discovered from the connec-
of RESIP hosts. Table VIII shows the RESIPs distribution in tions between the proxy gateway and the RESIP, as documented
different network blocks and ASes for each RESIP service by our traffic logs, PUP traffic, and Passive DNS datasets.
provider. We can observe that Luminati has the largest RESIP Specifically, using Proxies Online as an example, we observed
pool, followed by Proxies Online and ProxyRack. that before relaying our infiltration traffic, the PUP-hosted
Table IX lists the top 3 countries, ASNs and ISPs with RESIPs always communicate with lb-api.lambda.servers.jetstar.
most RESIPs. They all exhibit long-tailed distributions where media, report-v3.pprx.work, or report-v3.junk.uno instead of
PO GS IP LU PR
(a) Cumulative number of RESIPs. (b) Cumulative number of /16 RE- (c) RESIP IP overlap between (d) Build up the connection between the frontend
SIPs. different service providers. gateways and backend gateways.
Fig. 8: The evolution of RESIP pools (a)(b) and the collusion of the service providers (c). In (c), “PO” stands for Proxies Online; “GS”
stands for Geosurf; “IP” stands for IAPS; “LU” stands for Luminati; “PR” stands for ProxyRack.
0.2 Resi 0.2 Resi 0.2 Resi 0.2 Resi 0.2 Resi
Non-Resi Non-Resi Non-Resi Non-Resi Non-Resi
0.0 0.0 0.0 0.0 0.0
0 5 10 15 20 25 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0 1 2 3 4 0.0 0.1 0.2 0.3 0.4
(a) F-2: # of TLD+3 domains (b) F-3: Percentage of IPs in (c) F-4: Mean number of (d) F-6: Mean number of (e) F-8: Percentage of IPs in
resolved to the given IP. current direct inetnum with TLD+3 domains resolved to TLD+3 domains resolved to /24 IP prefix with DNS records.
DNS records. IPs in current direct inetnum. IPs in current direct inetnum.
1.0 1.0 Resi Resi Resi
0.8 Non-Resi 0.8 Non-Resi 0.8 Non-Resi
0.8 0.8
0.6 0.6 0.6
0.6 0.6
(f) F-9: Mean number of (g) F-11: Mean number of (h) F-17: log10 Mean value of (i) F-21: Mean value of the (j) F-25: Assignment type of
TLD+3 domains resolved to TLD+3 domains resolved to the sizes of historical direct depths of historical direct inet- the current direct inetnum
IPs in /24 IP prefix. IPs in current direct inetnum. inetnums nums
(k) F-29: # of direct inetnums (l) F-30: log10 # of IPs of the (m) F-33: the percent of cur- (n) F-34: # of direct inetnums (o) F-35: log10 # of IPs of the
of the current direct owners current direct owners rent loose owners over histori- of the current loose owners current loose owners
cal loose owners
Fig. 9: Cumulative distribution functions of example features on our labeled training dataset.