Resident Evil: Understanding Residential IP Proxy As A Dark Service

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Resident Evil: Understanding Residential IP Proxy

as a Dark Service
Xianghang Mi∗ , Xuan Feng∗ , Xiaojing Liao∗ , Baojun Liu† ,
XiaoFeng Wang∗ , Feng Qian∗ , Zhou Li‡ , Sumayah Alrwais§ , Limin Sun¶ , Ying Liu†
∗ Indiana University Bloomington, † Tsinghua University, ‡ IEEE Member,
§ King Saud University, ¶ Institute of Information Engineering, CAS
∗ {xmi, xf1, xliao, xw7, fengqian}@indiana.edu, † [email protected],

[email protected], ‡ [email protected], § [email protected], ¶ [email protected],

Abstract—An emerging Internet business is residential proxy owners) as intermediaries to circumvent the restrictions imposed
(RESIP) as a service, in which a provider utilizes the hosts within by target services, for the purposes such as aggressive resource
residential networks (in contrast to those running in a datacenter) access (e.g., registering multiple accounts), data scraping, and
to relay their customers’ traffic, in an attempt to avoid server-
side blocking and detection. With the prominent roles the services others. This emerging market gives rise to a new service
could play in the underground business world, little has been done model we call Residential IP Proxy as a Service (RPaaS),
to understand whether they are indeed involved in Cybercrimes offered by companies like Luminati [3], StormProxies [49],
and how they operate, due to the challenges in identifying their Microleaves [38], etc. These providers all control a large
RESIPs, not to mention any in-depth analysis on them. number of residential hosts, which they claim joined their
In this paper, we report the first study on RESIPs, which
sheds light on the behaviors and the ecosystem of these elusive services willingly, to proxy their customers’ communication
gray services. Our research employed an infiltration framework, with any Internet target. Once abused, these residential proxies
including our clients for RESIP services and the servers they can outperform conventional public proxies or even anonymity
visited, to detect 6 million RESIP IPs across 230+ countries networks to help their clients masquerade as clean and benign
and 52K+ ISPs. The observed addresses were analyzed and sources to communicate with the targets. Such communication
the hosts behind them were further fingerprinted using a new
profiling system. Our effort led to several surprising findings may violate the target’s service terms at the very least (e.g.,
about the RESIP services unknown before. Surprisingly, despite data scraping, blackhat Search Engine Optimization(SEO))
the providers’ claim that the proxy hosts are willingly joined, and is likely associated with more sinister events such as the
many proxies run on likely compromised hosts including IoT aforementioned DDoS, due to the permissiveness of the RPaaS
devices. Through cross-matching the hosts we discovered and providers in terms of what can be done through their proxies.
labeled PUP (potentially unwanted programs) logs provided by
a leading IT company, we uncovered various illicit operations With their importance to the illicit activities, residential
RESIP hosts performed, including illegal promotion, Fast fluxing,proxy (RESIP) services, however, are still less understood.
phishing, malware hosting, and others. We also reverse engi-
One may ask whether these services indeed use residential
neered RESIP services’ internal infrastructures, uncovered their
potential rebranding and reselling behaviors. Our research takes hosts as they claim, and if so, how they recruit these hosts,
the first step toward understanding this new Internet service, and whether they are involved in malicious activities. Also
contributing to the effective control of their security risks. unclear are their infrastructures and ecosystems, particularly
I. I NTRODUCTION the ways they promote, operate their businesses and also work
In October 2016, a spree of massive distributed denial-of- with each other. Answers to these questions are critical for
service (DDoS) attacks temporarily brought down the Domain determining the role these services play in Cybercrimes, which
Name System (DNS) operated by Dyn, a leading DNS provider, could potentially help identify an effective way to mitigate the
causing major Internet platforms and services (such as Amazon, threats we are facing today, for example, through controlling
Netflix, Paypal, Twitter et al.) to be unavailable across Europe accesses to these services.
and North America. What is remarkable about this attack is that Our study. Understanding RESIP service is by no means
the traffic observed was found to originate from 65,000 infected trivial. Unlike open proxies, which can be easily found online,
residential hosts, including home routers, web cameras, and RESIP IPs are not publicized directly and can only be reached
digital video recorders [55]. Not only did these hosts jointly through the mediation of a RESIP provider. Even given a
produce an overwhelming volume at 600 Gbps, one of the proxy’s IPs, no existing techniques can tell us whether they
largest on record, but their residential IP addresses made the are indeed residential, not to mention finding out whether their
attack requests they issued less differentiable from legitimate hosts are indeed willing participants or just controlled bots.
ones, and therefore hard to detect and block by the victim. Even more challenging is to determine whether these proxies
Residential IP Proxy as a Service. Recent years have are malicious and to understand their illicit activities, since all
witnessed increasing demands for such residential IPs (those be- we can observe are just dynamic IPs shared by a set of hosts.
longing to ISP’s dynamically assigned IPs, particularly to home As a result, the traffic associated with the IPs describes those
hosts’ collective activities and it is less clear how to separate • We identified 67 different programs running as RESIPs.
the good behaviors (when the IP is assigned to a legitimate Among them, 50 are reported as malicious by anti-virus tools.
host) from the bad ones (when it is given to a compromised • Unlike the bots as reported in prior studies [65], even
host). Further without observing the internal operations of a the RESIPs running PUPs, as discovered in our research,
RESIP service, understanding its infrastructure and connections exhibit very different behavior in terms of their traffic patterns,
with other services is difficult. indicating new challenges in detecting them.
In our research, we addressed these challenges with a suite • We found the traffic relayed by RESIPs involves ad clicking,
of innovative techniques, which enabled us to perform a large- promotion, or malicious activities. 9.36% traffic destinations
scale study, first of its kind, to understand the way RESIP were detected as malicious by popular detection engines.
service is utilized for illicit purposes. Our study was based Also surprisingly, we observed other monetizing services also
upon a novel framework for automatic discovery of RESIP IPs running on the hosts of RESIPs. Examples include Fast fluxing
from related services. More specifically, we first purchased the and malicious content services.
services from commercial RESIP service providers and ran a
set of clients to communicate with our web servers through • We observed some RESIP providers likely reselling services
these services. Traffic in the communication was carefully to (or at least sharing RESIP pools with) other providers. For
marked with unique sub-domains and other parameters to help example, our infiltration traffic from the IAPS proxies was
the servers identify the IPs of the RESIPs, to enable our DNS actually relayed by Hola clients controlled by Luminati. We
system to find the DNS resolvers, and to ensure the proxied found that unlike Luminati, IAPS conducts no background
traffic of RESIPs is captured (§IV-C). The IPs found in this check and accepts bitcoin payments. Malicious IAPS users
way were further analyzed to extract a set of unique Whois might thus be able to abuse Luminati’s network or even to
and DNS features for determining whether they are indeed cause denial-of-service for legitimate Luminati customers.
residential. Further these IPs were probed by a novel, high- • We identified hidden backend gateways in the RESIP service
performance host profiling system that concurrently fingerprints infrastructure, which decouple the clients and RESIPs in their
the hosts behind millions of IPs, both from the clients and the infrastructure to make illicit activities of the RESIP service
servers under our control. Our fingerprinting technique ensures stealthier: some backend gateways were labeled as malicious
that the target of our analysis is always the RESIP, despite its sites and were dropped by the providers, while all of the
highly fluctuating IP and a potential NAT box standing in the frontend gateways were clean and enjoyed a long lifetime.
way of a direct profiling. Also we used a set of potentially Contributions. The contributions of the paper are as follows.
unwanted programs (PUP) and their traffic logs obtained from • New findings. Our findings revealed the infrastructure, scale,
a major security company to correlate our clients’ traffic with malice, and stealthiness of RESIP services. They highlight the
these PUPs’ activities, leading to the discovery of the RESIP’s security implications of this emerging service and the urgency
illicit operations and their providers’ hidden infrastructural to regulate its market.
components.
• New methodologies. We designed novel techniques for finding
Findings. Using our framework, we analyzed 5 leading RESIP RESIPs, profiling their behaviors, and analyzing the providers.
providers including Luminati [3], Proxies Online [5], Geosurf They can be integrated into a holistic system for monitoring
[1], IAPS Security [2] and ProxyRack [6], from which we RESIPs and detecting/preventing its malicious activities.
found 6.18 million unique IPs in a 4-month span. As a result,
II. BACKGROUND
we were able to conduct the first study on RESIP service. Our
analysis reveals the abused RESIPs as attack intermediaries as Residential proxy. Residential IP proxy services are a thriving
well as illicit and collusive RESIP service providers. Our key business today. During our study in 2017, we continued to
findings are as follows. witness the emergence of new RESIP services and a boom in
existing businesses: e.g., Proxies Online [5], the first RESIP
• Our discovered RESIPs are distributed across 238 countries service we found, has increased their price from $3/GB to
and regions, 28,035 /16 network prefixes and 52,905 ISPs. $25/GB in 6 months. Like traditional proxy services such
A vast majority of them (95.22%) are believed to be indeed as virtual private network (VPN), anonymity networks, and
residential and very few of them (2.20%) are reported by public HTTP/SOCKS proxies, RESIP service is promoted as an
blacklists or emerging threat intelligence platforms. anonymity channel, but also characterized by its resilience
• We discovered the presence of likely compromised hosts against server-side detection and blocking. More specifically,
as RESIPs, among which, 237,029 IoT devices and 4,141 residential IPs are often more trusted by the server than
RESIP hosts running PUP programs were identified, although those from a data center [4]. Also, they tend to be dynamic,
RESIP service providers typically claim that their proxies are with RESIP services usually running in a back-connect proxy
all common users willingly joining their networks. In fact, mode, making malicious clients nimble and capable of quickly
none of the 5 RESIP providers is a completely consent-based migrating to other IPs when detected.
anonymity system and even the most prominent companies Figure 1 illustrates the RESIP service model discussed in the
like Luminati were found to use suspiciously compromised prior works [58], [59], which involves three parties interacting
residential hosts. with each other: the main service component including a proxy
Feature
Purchase Extraction Training
RESIP services
Dataset
Residential Host Fingerprinter
host Whois
DNS server RESIP
service
DNS
Controlled Liveness Checker
Proxy Residential RESIP DNS servers
service
gateway host
Relay Profiler
RESIP Targeted Controlled RESIP
service
Controlled
RESIP
Candidates
Classifier RESIP
clients web servers
client Residential
server
RESIP host The Infiltration Framework Residential IP Classifier Host Profiling
service Fig. 2: Our methodology framework.
Fig. 1: The RESIP service from an outsider’s perspective.
leaf inetnum object whose IP range covers that IP, its direct
gateway and residential hosts, the client, and the server to owner as the organization and person objects associated with
be visited (the target). Once a client signs up with a RESIP its direct inetnum, and its loose owner as all organizations and
service, it receives a gateway’s IP address or URL for accessing persons who share the same contact information as the direct
the service. During the communication, the gateway forwards owner. In our research, we collected the IP Whois databases
the client’s requests to different residential hosts, which further from all 5 RIRs everyday since December 2015 using their
send them to the target and get responses back. Figure 1 RDAP and bulk access APIs [40] [46][23][24][45][44]. Those
describes what can be observed from the outside, from the historical IP Whois databases were used to generate features
client and target’s perspective. The inside view, however, can for our residential IP classifier (§III-B).
be more complicated, as discovered later in §V-B.
III. M ETHODOLOGY AND DATASET
There are many RESIP providers on the market, such as
Luminati and Geosurf. They offer a variety of service plans As shown in Figure 2, the methodology behind our study
with different levels of flexibilities, which can be leveraged on RESIP consists of three important parts: an infiltration
to launch cyber attacks. For example, the client is given three framework (§III-A) for gaining insider’s views of RESIP
different ways to determine how proxies are chosen, based services, a classifier (§III-B) for identifying residential IPs, and
upon whether the gateway attempts to use the same RESIP a host profiling system (§III-C) for fingerprinting the proxy
to send multiple requests to the target: sticky (S), non-sticky hosts. We elaborate them as follows.
(NS), and half-sticky (HS). A sticky gateway always tries to A. Infiltration Framework
use the same RESIP for communication whenever it can, and Our infiltration framework includes a client, which is a web
when it has to give up on the proxy (when the RESIP gets crawler sending labeled requests through a RESIP service to
off-line), the gateway attempts to switch to the next one. The its target site, a target server, which is a website receiving
client can also specify the “sticky time”, e.g., changing to a the client’s requests forwarded by RESIPs, and our own
different RESIP after 1 minute. In the non-sticky model, the authoritative DNS server, which is utilized to find out whether
gateway changes RESIP each time after a request is forwarded. DNS resolving happens on the RESIP hosts or on the gateway,
The half-sticky service allows the client to switch between the and further discover these resolvers. This framework is also
S and the NS models by adjusting parameters (e.g., a session illustrated in Figure 2.
ID) during the communication. Another service option is to We found 17 RESIP services either through search engines
decide where the domain name of the target to be resolved, by or from Blackhat SEO forums [31]. Among them, 5 (Table I)
the RESIP or the gateway. This is important since the resolver were picked out based upon their claimed scale (> 100K
can be observed by the target’s DNS server and may need to be IPs), service models (SOCKS or not, pay by month or traffic,
covered under some circumstances. As an example, the RESIP etc.), popularity (heavily promoted online), and the time they
provider Luminati allows its client to move the DNS resolving were discovered (earliest ones). All 5 services support relaying
to the RESIP by using the -dns-remote parameter. HTTP/HTTPS traffic and ProxyRack also supports SOCKS4
IP Whois Database. The Internet Assigned Numbers Authority and SOCKS5 protocols. We then purchased those five RESIP
(IANA) allocates IP addresses in large chunks to one of five services, and ran our crawler to periodically visit our server
Regional Internet Registries (RIRs), including ARIN, APNIC, with pre-registered domains through these services. Our server
AFRINIC, LACNIC and RIPE. Each RIR operates a Whois recorded each labeled request and extracted its source IP, which
directory service to manage the registration of IP addresses in was considered to be the address of the RESIP provided by the
their regions (e.g., Europe region for RIPE). A Whois directory service. For this purpose, each request produced by our crawler
is organized in an object-oriented way, containing four types was labeled to avoid recording the requests from other parties,
of objects with each assigned a unique ID: inetnum, person, since they may not carry RESIP IPs (e.g., Man in the Middle
organization, and ASN. Here an inetnum object describes an players record our traffic and replay it ). Also, this approach
IP address range and all its attributes; organization and person forces the RESIP to query our DNS server, exposing its resolver.
objects are used to represent the ownership of IP blocks with In our framework, a client sends requests to specially crafted
a set of attributes like email addresses; and ASN identifies the subdomains (as part of the HTTP request URL) with the
autonomous system an IP address belongs to. All inetnums following pattern: uuid.timestamp.providerId.gwId.raap-xx.site,
are created in a hierarchical manner and therefore form an where uuid is a dynamically generated UUID, timestamp is the
inetnum tree. Given an IP, we define its direct inetnum as the client’s current Unix timestamp, providerId uniquely identifies
Provider Price Payment Date(s) Gateway DNS
the RESIP service provider, gwId represents the type of the
Proxies Online $25/Gb Paypal 07/06-11/24 HS R
proxy gateway (S, NS or HS) and raap-xx.site represents a Geosurf $300/month Paypal 09/17-10/22 S/HS R
set of domains registered for our website, with xx describing ProxyRack $40/month Bitcoin 09/18-11/24 S/NS R
various geo-locations (us, eu, etc.). In this way, each request Luminati $500/month Paypal 09/25-11/01 HS R/G
IAPS Security $500/month Bitcoin 09/23-11/01 HS R
targets at a unique subdomain. Moreover, such crafted requests,
once being proxied by the RESIP device, became more likely TABLE I: RESIP services purchasing details. HS: half-sticky; S:
to be captured by our industry partner’s anomalous traffic sticky; NS: non-stick; R: RESIP; G: gateway.
gathering module (data collected by the module elaborated Source Label # IPs # /16 # /8 Training
in §III-D) due to their newly registered domains carrying the Manual resi-clean 79 25 19 79
Device Search Engine resi-clean 89,345 13,525 195 9,921
patterns produced by DGA (Domain Generation Algorithms). Trace My IP resi-noisy 37,480 11,402 213 0
Through such collected data, we were able to locate the RESIP Filtered IP Whois resi-noisy 23,264,961 394 31 0
devices and analyze the traffic they proxied (See §IV-C). IoT Botnets resi-noisy 1,699,291 20,112 200 0
Public Clouds non-resi-clean 53,716,321 968 99 5,000
Upon receiving a DNS query for such a domain, our DNS Alexa Top1M non-resi-clean 442,989 14,365 213 4,481
server employed a regular expression to check the pattern of Commercial Proxies non-resi-clean 519 71 44 519
the subdomain, and if correct, resolved it to the IP addresses of Public Proxies non-resi-noisy 148,509 14,004 204 0
our controlled servers. In this way, for each successful request, TABLE II: Datasets for training and testing the residential IP classifier.
three log records were generated by the entities under our to Nov. 24 2017. Our study captured 6,183,876 different
control: the client (our crawler), the target server, and the DNS RESIP IPs by issuing 62 million requests. Before Sep. 15,
server as illustrated in Figure 2. Here the client recorded the we only ran 2 crawling jobs on a single service, Proxies
labeled request URL, the target server kept the RESIP’s IP, Online. Then starting from Sep. 17, we gradually purchased
and also the DNS server logged the RESIP’s DNS resolver. at least one-month service from all 5 RESIP providers and
Correlating those logs provides us a comprehensive view of a ran up to 20 crawling jobs daily using 200+ threads to collect
RESIP’s operations, and can also help discover related traffic RESIP information from all of them. After one month, we
traces from other sources when they were captured by network have gathered enough RESIPs from Luminati. Meanwhile, our
monitors (see §IV). As shown in Table I, all RESIP services measurement results revealed that IAPS Security was just a
except Luminati resolve domain names on RESIPs rather than reseller of Luminati’s service, and Geosurf and Proxies Online
gateways while Luminati can do this on either site through actually share the same infrastructure. Given the above findings,
configuration. We came to this conclusion since our DNS server we then stopped crawling the expensive providers, including
received queries issued by over 82K DNS resolvers from these IAPS Security, Geosurf, and Luminati, but still kept the jobs
RESIP services in our study. on Proxies Online and ProxyRack until Nov. 24. Overall, we
During our study, we carefully designed our methodology spent $2800 in purchasing and infiltrating those services.
to ensure that our infiltration and profiling are less detectable
by the RESIP services. For this purpose, we deployed multiple B. Residential IP Classifier
crawlers and target servers on Amazon EC2 instances and While RESIP service providers claim to utilize residential
Aliyun instances located in European, US, South America, hosts for relaying their customers’ traffic, little is known about
Singapore and China, to generate traffic from diverse sources. whether the proxies they use are indeed located in residential
Further, we used AES-CBC with a 128-bit key to encrypt networks. Determining whether an IP is residential can be
all traffic between our crawlers and the targets, to prevent complicated, particularly when the same ISP can also allocate
potential content inspection. Another implementation issue is IP blocks to data centers. Although some commercial service
the presence of multiple gateways and the different models (e.g., Maxmind GeoIP2 Precision Insights Service [33]) allows
they are running (S, HS and NS; see §II and Table I). For queries on IP’s labels such as residential or cellular for a fee
example, GeoSurf and ProxyRack all run sticky gateways; as (e.g., $50 for 25K IPs), it cannot scale to a large number
a result, our server would not see any new proxy host during of queries (6.2M in our research) and its methodologies are
a given period of time (1 to 10 minutes); therefore our crawler not open (so less known about their reliability). So in our
was implemented to only request once for a while, depending research, we built a new classifier on top of a set of features
on the sticky time given by the service. For the providers that characterize residential IPs. Following we elaborate the
with non-sticky and half-sticky gateways, our implementation technique, particularly, our approaches to collect clean ground
took different strategies to generate requests. When there were truth, select robust features, and train and evaluate the classifier.
multiple gateways, we chose a different one for each request Finding groundtruth. Finding clean labeled residential IPs
in order to reduce redundant requests and cover more RESIPs. is challenging due to the absence of public data and the
Besides, in case RESIP services assigned different gateways dynamic IP allocation performed by ISPs. To address this
to different users, we registered for each service at least two issue, we came up with a series of robust methodologies to
distinct user accounts and found that each account was always obtain 4 labeled datasets: residential-clean (resi-clean), non-
linked to the same set of gateways. residential-clean (non-resi-clean), residential-noisy (resi-noisy),
Result and evaluation. In total, we ran up to 20 daily crawling and non-residential-noisy (non-resi-noisy). Such groundtruth is
jobs, each producing about 50,000 requests, from Jun. 06 summarized in Table II.
The resi-clean set contained 79 IPs of the personal devices Feature selection and extraction. We selected a set of
under our control, which were connected to 11 ISPs in 3 unique features to train a classifier to identify residential IPs.
countries for identifying these addresses. To find other “clean” Unlike non-residential IPs, residential IPs are typically directly
IPs, we came up with an idea that leverages device search assigned and managed by an ISP (instead of being re-assigned
engines (e.g., Shodan [48], Zoomeye [52] etc.) to search to a business) [66]. Also, ISPs tend to reserve stable IP blocks
for the network devices typically only utilized in residential (belonging to the same inetnum) for home users, while the
environments. Examples include smart home systems such as network blocks given to the business could be more volatile,
Amazon Echo [27], Google Home [35], Philips Hue Lights [41], changing hands over multiple owners during a given period
home-related gateways like residential ADSL gateway and of time [66]. Furthermore, non-residential IPs are more likely
broadband residential gateway, and others. A complete list to host web services. For example, among 442,989 IPs for
of keywords used in such device queries is presented in the Alexa Top 1M domains, 29% (128,531) are found in our
Appendix IX-A. These queries return IPs for both devices Public Cloud dataset while only 0.01% (36) are also in our
discovered online and related applications. The former was resi-clean dataset. Based upon such observations, we leveraged
added to our resi-clean dataset as groundtruth. In this way, we a total 35 features related to IP Whois records or Active DNS
successfully harvested 89,345 residential IPs distributed across records to capture residential IPs’ characteristics. Due to the
13,525 /16 and 195 /8 network blocks. This data collection space limit, we here just elaborate some of them and the rest
was done automatically, which we believe itself is a technical is presented in Appendix IX-A.
contribution. • An Active DNS feature. As an example, the connection
We further applied several weaker heuristics to build the resi- between non-residential IPs and web services can be captured
noisy dataset. Despite being noisy, the dataset is still useful in by the average number of TLD+3 domains per IP in the direct
validating our classifier. Specifically, its data comes from three inetnum (§II). Intuitively, this feature describes the number of
sources. (1) We used the query logs of Trace My IP [51], an IP domains hosted in the direct inetnum of this IP, which were
tracing service helping visitors to find their devices’ IPs. The found from Active DNS dataset [68]. Our evaluation on the
IPs recorded by the logs were selected as potential residential labeled set shows that non-residential IPs have 5.49 as the
IPs when the ISPs involved are known to be residential Internet average feature value while residential IPs only have 0.016.
service providers (e.g., AT&T and Comcast), queries are from
the OSes for consumer devices (e.g., Android and IOS) and • IP Whois features. We also used phone numbers and email
common browsers, and the IPs are not labeled as bot or spider. addresses to identify the owners of the inetnum for an IP, and
(2) We looked up the owner objects for the 79 clean residential discovered that residential IPs tend to have much more inetnum
IPs in the IP Whois dataset (see § II), and considered other IPs objects (3,536 on average) than non-residential IPs (1,482 on
under those owner objects as residential IPs. This is because as average). This could happen when the ISP assigns large chunks
a common practice, ISPs (such as AT&T) typically register the of continuous IPs to their organizational users. Additionally,
same set of owner objects to manage the IP blocks serving the we designed the features to profile the size and stability of the
same purposes. For example, AT&T registers the owner object direct inetnum of a given IP. Specifically, we retrieved the IP’s
ATTMO-3 [28] for AT&T Mobility LLC [29] to manage all historical direct inetnums from 24 IP whois snapshots in the
IPs for mobile usage. (3) We also included the IPs detected last 2 years, and identified their sizes, depths on the inetnum
from two emerging botnet campaigns Hajime [12] and IoT tree, and further calculated the variations of these parameters
Reaper [13] that utilize compromised IoT devices (see §III-D), to capture their changes in the past 24 months. We observed
as home IoT devices are much more likely to be compromised that 70% of the residential IPs have a size (of historical direct
than enterprise IoT devices. In total, the resi-noisy dataset inetnums) below 105 , while 58% of non-residential IPs have a
contained 25,001,529 IPs. size above 105 . Also residential IPs are much more stable in
The non-resi-clean data were collected from cloud providers, their depths on the inetnum tree, with a variation below 0.16.
high-profile websites (Alexa top 1M websites), and commercial Evaluation and results. Over 10K residential IPs and 10K
proxies (details in Appendix IX-A). We gathered 54,031,298 non-residential IPs, we trained a Random Forest (RF) classifier,
such IPs distributed across 14,610 /16 and 213 /8 network which achieved an excellent performance in a 5-Fold cross
blocks. The non-resi-noisy dataset involved the IPs from validation (precision of 95.61% and recall of 97.12%). We
publicly available proxies (e.g., Tor relays and public free further evaluated the model over the four labeled datasets as
proxies) as detailed in § III-D. The data is noisy since some well as the unlabeled dataset (6.2M RESIP IPs we collected)
such proxy services like Tor also recruit home servers to relay with sampled manual validation. Our study shows that this
traffic [50]. This dataset included 148,509 IPs in 14,004 /16 model made the predictions in line with the natures of these sets
and 204 /8 networks. (more leaning toward residential or non-residential IPs in the
From the above datasets, we built a labeled set with 10K cases of the noisy datasets) and particularly on the unlabeled
residential IPs and 10K non-residential IPs randomly sampled set, it achieved a precision of 95.80%. When applying the
from resi-clean and non-resi-clean, respectively (see Table II). model on 6.2M RESIP IPs we collected, it detected 5.9M
They were used in feature evaluation and classifier training (95.22%) residential IPs and 0.3M (4.78%) non-residential IPs.
while the rest datasets were applied to evaluate our classifier. More details about the evaluation process and results can be
Client Gateway running with the sticky or half-sticky gateway. Figure 3(a)
RESIP IP Web Server
request
Public Private Infiltration raap-xx.site request
raap-xx.site
illustrates these fingerprinting processes, with IoT devices
request
network network raap-xx.site
RESIP IP
(printer) being RESIPs in the private network.
if request from
OutsideFP controlled client
RESIP IP
To achieve a high performance when profiling a large
OutsideFP
Router/NAT RESIP IP banners number of IPs, our system will not conduct insideFP for a
OutsideFP is
InsideFP router or NAT request
127.0.0.1 request RESIP unless its outsideFP result reveals a router/NAT. This
InsideFP 127.0.0.1
Gateway RESIP
(printer)
RESIP IP banners is because that insideFP has a larger request latency than the
outsideFP, and is constrained by the rate limitation from RESIP
(a) InsideFP vs OutsideFP (b) Host fingerprinter’s analysis pipeline.
service providers. If the insideFP and outsideFP cannot reach
Fig. 3: Host fingerprinting. a consensus, we regard insideFP’s result as the final: e.g.,
found in Appendix IX-A. a RESIP was considered to be a printer when its insideFP
revealed the printer and outsideFP showed a NAT. We outline
C. Host Profiling host fingerprinter’s analysis pipeline in Figure 3(b).
To further understand RESIPs, it is very important to profile The IP liveness checker and the relay profiler scanned a
their host devices in addition to their IPs. As mentioned earlier, given IP every 30 seconds. The former simply “pinged” the IP
residential IPs tend to be assigned in a dynamic manner. Then, through typical TCP and UDP ports to find out periods when
once a RESIP IP is captured, host profiling must be conducted the IP was online. And the latter sent “heartbeat” requests via
and finished before the RESIP host has moved to another a connected RESIP gateway to our web servers to measure the
IP, otherwise, the result will be invalid. To achieve this, we relaying time of a given RESIP IP. This information also helped
designed a real-time profiling system that can simultaneously us improve the accuracy of RESIP fingerprinting: we consider
fingerprint newly captured RESIP hosts, measure their relaying the fingerprinting result as valid only when the relaying time
time (periods when serving as RESIPs), and detect when they of a given RESIP covers the fingerprinting period.
get offline (stop serving as RESIPs) or their IPs change. As Evaluation and results. Running on an Amazon EC2 instance
illustrated in Figure 2, the system consists of three modules: a with a bandwidth of 60 Mbps, 1GB memory and one-core CPU
host fingerprinter, an IP liveness checker and a relaying time at 2.40GHz, our system was capable of profiling 800K IPs/h,
profiler, which work on a given RESIP simultaneously. with each IP being fingerprinted in 63.57 seconds. In total, our
In a nutshell, the host fingerprinter will compose and send profiling system acquired banners from 728,528 (11.78% out
various probes to a given RESIP IP on commonly opened of 6.2 million) IPs and identified the device types and vendor
TCP/UDP ports including 80 for HTTP, 22 for SSH, 23 for information for 547,497 of them. Interestingly, 237,029 (43%)
Telnet, 443 for HTTPS, 554 for RTSP and 5000 for UPNP. of these IPs turned out to belong to IoTs like web camera,
Once response received and banners grabbed, the Nmap service DVR, and printer. Details of the study are in §IV-B.
detection probe list [16] will be applied to identify device type
and vendor information. D. Datasets
This process turns out to be more complicated than it Our study leverages various data sources to characterize
appears to be. A challenge comes from the fact that an IP multiple dimensions of the RESIP ecosystem. Recall that by
can be frequently re-assigned to different hosts, often not now, we have produced or used several datasets: our infiltration
the RESIP we are interested in. To address this problem, generated a large RESIP IP dataset (§III-A). To construct and
our profiling system immediately started fingerprinting an IP evaluate our residential IP classifier, we collected several other
address after it was observed by our web server. This was datasets containing residential and non-residential IPs (§III-B);
further confirmed, in the presence of both sticky and half- we also leveraged datasets of IP Whois and Active DNS for
sticky gateways, through sending another request right after the classifier’s feature generation (§III-B). In our host profiling
the banners were grabbed: if the same IP was seen by our framework, the Nmap service detection probe list is applied to
server again, we were confident that the banner belonged to infer devices’ types (§III-C). We next elaborate other datasets
the same RESIP. We call this process “outside fingerprinting” to be used in our study. These datasets are jointly leveraged
(outsideFP) as the probing targets at the RESIP IP from the to characterize both individual RESIPs and RESIP services.
outside. Another issue is caused by the presence of a private PUP traffic. We collaborated with our industry partner (one of
network the RESIP host often stays in. So a probe to its public leading IT companies) to utilize the PUP traffic they gathered
IP only gets to the gateway NATs and may not reach the from their customers’ devices (under proper consent) from June
actual RESIP host. Our solution is based upon the observation 2017 to November 2017 for our RESIP analysis. The consent
that many RESIP providers do not inspect the target IP that was given from the users who agreed to the terms of service
the client visits, which allows our client to probe the proxy’s when they installed our industry partner’s security software.
loopback address 127.0.0.1 through its connection with the The users can revoke this consent in the software settings. Each
gateway. Our study found that 3 out of the 5 RESIP service record in the dataset logged a suspicious traffic flow (inbound
providers (Proxies Online, Geosurf and ProxyRack) let this and outbound) associated with a PUP they detected. For each
“inside fingerprinting” (insideFP) go through. Note that both suspicious flow, PUP’s MD5, device ID, timestamp, and the
inside and outside fingerprinting require the RESIP service flow’s 5-tuple (src IP, src port, dest IP, dest port, transport-layer
4/1/2018 jVectorMap demo

protocol) are recorded, with additional information added to the


5-tuple for plaintext traffic like HTTP, and FTP. For example,
for HTTP traffic, the host and (truncated) URL fields were
recorded. This dataset served three purposes in our research:
identifying the usage of PUPs as RESIPs, investigating RESIP
traffic, and revealing the hidden infrastructural components
inside the RESIP services. (a) All RESIPs. (b) RESIPs responded to our probings.
+

0

Passive DNS. Another dataset we utilized is Passive DNS Fig. 4: Global Distribution of RESIPs
20000
40000
60000
80000
100000
120000

from 360 Netlab [17], which enabled us to identify Fast flux conclusions to be drawn. More specifically, the vantage points
activities on RESIP IPs, and reveal the hidden infrastructural of our study were limited to five RESIP service providers. Also,
components inside the RESIP services. Each of the records from them, only about 10% (still more than 500K) of all the
includes queried domain names, time periods, their aggregated IPs we observed could be fingerprinted and analyzed. Further,
lookup volumes in the given time period. our analysis on relayed traffic of RESIPs was based on the
file:///Users/think/Desktop/GEO/world-map/maps_all_responsed_IP.html 1/1

IP geolocation. IP2Location DB8 [14] is a commercial IP PUP traffic logs collected by our industry partner. Even though
geolocation database provided by IP2Location. Using this the PUP traffic logs were linked to 8,886 RESIP IPs (more
dataset, we retrieved the geolocation information (country, city, than 5 millions traffic traces) in our research, their coverage is
latitude, longitude, ISP) for given IPs. clearly limited. Availability of more comprehensive datasets
Public available proxies. We also collected the IPs related to will certainly help better understand RESIPs and their security
public network proxies, whose traffic can be easily blocked or implications. In the meantime, note that the RESIP providers
degraded by the server-side protection [62]. Specifically, we we studied are representative and we did find PUPs running
treated Tor relays (both exit and middle relays) as network behind the RESIP IPs we could not fingerprint. This indicates
proxies and crawled their lists hourly from both the Tor that some of our results could be applied more broadly, which
official website [19] and a third-party provider dan.me.uk [20]. however needs to be determined by the future research.
We used two different ways to collect publicly available Ethical issues. To conduct our study, we paid RESIP providers
proxies for HTTP/HTTPS/SOCKS4/SOCKS5. We purchased a to access their services. During the study, we followed all their
service called KuaiDaili, which collects proxies from multiple terms of service, and took great care to make sure that our study
popular proxy aggregators [7], and provides APIs for those would not harm the owners of RESIP hosts by visiting just our
still working to its users. In the meantime, we also crawled own domains. Also the users of our industry partner agreed to
other popular proxy aggregators [11] [22] to get the working share related information in exchange for free services. Lastly,
proxies KuaiDaili does not include. This dataset was further regarding our host profiling operations, we limited probing
complemented using IP2Proxy LITE [15], a service that rates to avoid overheads incurred on the remote hosts. Also
runs proprietary algorithms to detect the IPs serving VPN we only report aggregated statistics to avoid identity leakage.
anonymizers, open proxies, web proxies and Tor exits. All the studies were approved by our organization’s IRB.
Dark IPs. Also utilized in our research are popular IP blacklists IV. R ESIDENTIAL IP P ROXY
for identifying RESIP-related malicious activities. Specifically, We here report a measurement study on the core component
to track the potential relation between RESIPs and two of the RESIP service – the residential IP proxy. We analyzed
emerging botnet campaigns Hajime [12] and IoT Reaper [13], why these RESIPs were used, how they were recruited, and
our industry partner ran a detector from Sep 15, 2017 to Nov what they served.
07, 2017 to gather bot IPs of these campaigns on a daily
A. Proxy Detection Evasion
basis. Further, we collected 62 Spamhaus EDROP [18] records
every day for the last two years. Also, APIs of three threat IP source analysis. In total, we collected 6,183,876 unique
intelligence platforms were leveraged to retrieve IP indicators of RESIP IPs from the five RESIP service providers via the
compromise.: VirusTotal [21], Cymon OTX [10] and AlienValut infiltration framework (see §III-A). Our study reveals that
OTX [9]. Given the dynamic nature of RESIPs, we only focused RESIP IPs are spread across the world, across 238 countries and
on IP indicators whose timestamps are consistent with those regions, 28,035 /16 network prefixes and 52K+ ISPs. Overall,
of RESIP IPs we observed. we found that top 100 ISPs cover 57.4% of the RESIP IPs we
discovered with the ISP involving most RESIP IPs being Turk
E. Discussion Telekom (5.7%). Figure 4(a) illustrates the distribution of the
Potential bias. Due to the challenges in comprehensively RESIP IPs over countries, as determined by their geolocations.
identifying RESIP hosts and analyzing their illicit behaviors, The number of RESIP IPs in each country is ranked and
our study was based upon the data we were able to get (RESIP illustrated with various shades of darkness in the figure. As
IPs observed by our system, hosts we could fingerprint and we can see here, most of RESIP IPs stay in India (9.42%),
the PUP data available to us, etc), which could bring in bias followed by Turkey (8.64%) and Ukraine (6.42%).
to the study. While we believe that as the first large-scale As described in §III-B, we trained a classifier to identify
research on RESIP services, our study offers valuable insights residential IPs. Figure 5(a) illustrates the percentage of non-
into this new business, we are nevertheless cautious about the residential IPs in each RESIP service provider. Overall, 95.22%
8.82% 1.0 1.0 5000
Non-Resident IoT
8.00% Blacklisted Alive 4033k
Public Proxy 0.8 0.8 4000
Total

# of devices (K)
6.00% 5.81%
0.6 0.6 3000

CDF
CDF
4.00% 3.73% 0.4 0.4 2000
2.98% Overall
2.32% 2.54%
VirusTotal 1257k
2.00% 1.72% 0.2 0.2 Cymon 1000 857k
1.17% 433k 309k
AlienVault 272k 129k
0.08% 0.12% 0.04% 0.16% 0.0 0.0150 100 50 0 79k 5k18k 107k 46k
0.00% 50 100 150 0
PO GS LU PR 101 102 103
Time(s)
104 105 Delay in Days PO GS LU PR
(a) % of non-residential, blacklisted, pub- (b) The CDF of the relaying time (c) Time lag of RESIPs between (d) # of IoT devices observed from
lished proxy IPs in RESIP services per RESIP. being blacklisted and being captured. each RESIP service provider.
Fig. 5: Characterizing RESIPs. In (a) and (d), PO: Proxies Online; GS: Geosurf; LU: Luminati; PR: ProxyRack.
Top 1-5 # RESIPs % Top 6-10 # RESIPs %
lists (see §III-D). The percentage of published RESIP IPs in
Spam 8,299 36.55% Malicious Sample 438 1.93%
Malicious URL 7,305 32.17% each service provider is presented in Figure 5(a). In total, only
Zombie 277 1.22%
Bruteforce 3,325 14.64% 0.06% (3,767) of the 6.2 million RESIP IPs discovered in our
Telnet 249 1.10%
Suspicious 629 2.77% Trojan 171 0.75%
research are among the 148,509 public proxies. Among all 5
Dionaea 618 2.72% EDROP 164 0.72%
TABLE III: Malicious activities related to RESIPs.
providers we investigated, even the one with the most reported
proxies, ProxyRack, has just 0.16% on these lists.
of the collected RESIP IPs are indeed residential. Also, B. Proxy Recruitment
ProxyRack was found to have the highest fraction of non-
Volunteer recruitment. If RESIP services are recruiting volun-
residential IPs (8.82%). Such non-residential IPs tend to be
teers, there must be related web pages and software stacks that
re-assigned by small ISPs to hosting providers.
are accessible to common users. For each service, we carefully
We further explored the dynamics of RESIPs by examining went through their websites, read through search engine results
their IPs’ relaying time (see §III-C), whose cumulative distri- for keywords such as luminati recruit, proxyrack volunteer,
butions are presented in Figure 5(b). As we can see from the and geosurf software. Overall, only Luminati was found to
figure, a significant portion (90%) of the RESIP IPs exhibit a explicitly recruit common users [36]. By joining Luminati’s
short relaying time (870 seconds), which renders IP-blacklist network, users can get their traffic relayed by other members
based defense on the server side less effective. at the cost of proxying others’ traffic. To join the network,
Blacklisting. We further checked whether these residential IPs users need to install the hola client [30], which has versions
were ever blacklisted, which would allow the target server to available for multiple platforms including mobile. For other
easily block them. In our study, we looked up these addresses services, we found no recruitment channels or software stacks.
on the IP blacklists introduced in §III-D. In total, we observed Fingerprinting analysis. To further explore how RESIP
2.20% of RESIP IPs were reported by at least one blacklist. services recruit proxies, we analyzed devices behind RESIPs
Figure 5(a) shows the percentage of blacklisted RESIP IPs through our real-time profiling system described in §III-C.
in each service provider. We found that the portion of the Specifically, in our study, our profiling system acquired
blacklisted RESIP IPs is fairly small. Among these services, banners from 728,528 (11.78% out of 6.2 million) IPs observed,
ProxyRack has the most blacklisted RESIP IPs (2.54%), which indicating that these were the hosts with some ports open
is followed by Luminati (2.32%) and Geosurf (1.73%). When for probing. Among these responding hosts, 547,497 of them
analyzing the malicious activities they were involved in, we returned device types identified together with their vendor
found that spamming and malicious website hosting were two information. Interestingly, 237,029 of them turned out to be
mostly reported malicious activities. Also interesting, we found IoT systems, such as web camera, DVR, and printer. Figure 5(d)
that 1, 248 RESIP IPs (see Appendix IX-B) were served in presents the percentage of the IoT devices observed from each
two IoT botnet campaigns Hajime [12] and IoT reaper [13]. RESIP provider’s network. Luminati was found to have the
Figure 5(c) shows the cumulative distribution of the delay most IoT devices (45%), followed by Proxies Online (33%)
(in days) between when a RESIP IP was observed in our and ProxyRack (19%).
research and when it was blacklisted. We found that 11.57% of Table IV presents the top 10 device types and top 10 vendors
blacklisted RESIPs were captured by our infiltration framework for the RESIPs identified. We found that most of these RESIPs
before blacklisted, so their lifetime could be (conservatively) (69.32%) were profiled as routers, gateways, or WAP. The
estimated. The average delay we observed is 22 days, with the manufacturers for most of the RESIP devices were MikroTik,
longest being 136 days. Huawei, Technicolor, ZTE, and Dahua. Particularly, the device
Unpublished proxies. When a RESIP IP is on public proxy vendor MikroTik, Huawei, and BusyBox were associated with
lists such as Tor Relay list and public proxy aggregator, it can 59.93% of the IoT devices involved.
be easily blocked by the target server. To find out whether Note that the aforementioned result is a combination of both
these proxies were published online, we inspected 4 proxy outside fingerprinting (outsideFP) and inside fingerprinting
Device Type Num (%) Device Vendor Num (%)
(insideFP) results. As mentioned in §III-C, services including
router 114,768 48.42 MikroTik 86,593 36.53
Geosurf, Proxies Online, and ProxyRack support insideFP firewall 25,088 10.58 Huawei 37,545 15.84
for their sticky and half-sticky gateways. For RESIP IPs WAP 24,470 10.32 BusyBox 18,337 7.74
captured from those channels, insideFP was performed on gateway 22,003 9.28 Technicolor 16,866 7.12
broadband router 17,358 7.32 SonicWALL 14,122 5.96
a RESIP IP once its outsideFP revealed a NAT device (router, webcam 13,024 5.49 Fortinet 9,190 3.88
WAP, etc.). Overall, we ran insideFPs on 35,808 RESIP security-misc 10,608 4.48 Dahua 6,258 2.64
IPs, 12, 497 responded to our probings, and 10,964 further DVR 4,249 1.79 ZyXEL 5,601 2.36
media device 2,589 1.09 AVM 5,272 2.22
had their associated devices identified. Among them, 5,981, storage-misc 1,988 0.84 Cyberoam 4,558 1.92
which was found to relate to gateways by outsideFP, were TABLE IV: List of the top 10 device vendors and device types.
considered to host non-gateway devices according to insideFP. Name Providers # IPs # Devices
One interesting point here is that although outsideFPs on those hola svc.exe LU, IAPS 2.7K 1.1K
35,808 RESIP IPs all received responses, only 12, 497 replied to csrss.exe PR 241 126
our insideFPs (using similar probings as outsideFP), indicating svchostwork.exe GS, PO 226 32
swufeb17.exe PO 171 28
those unresponsive RESIP hosts may actually reside behind netmedia.exe GS, PO 170 95
NAT devices. We therefore expect that the actual proportion start.vbs PO 76 1
of non-gateway devices to be higher than that in Table IV. cloudnet.exe PR 55 42
hola plugin.exe LU 50 43
Also conflicting devices could be found on the same RESIP produpd.exe PR 21 8
IP, particularly during host re-profiling. Re-profiling happened pprx.exe PO 2 2
rarely in our study, since we did not re-profile the same IP TABLE V: List of the top 10 PUPs with most infected RESIPs.
found in 15 days. Still we observed 195 RESIP IPs hosting traffic data (see §III-D) to find the illicit activities the PUP-
different devices, indicating that multiple RESIPs possibly hosting RESIP devices were involved in. Specifically, we
share the same IP. Besides, even in a single fingerprinting, the first analyzed the traffic logs of these PUPs, searching for
banners grabbed from different ports associated with the same the domains (those the PUP communicated with) matching
IP may reveal different devices. However the scenario is very the pattern of our labeled infiltration traffic. As mentioned
rare: only 1,083 RESIP IPs (0.20% out of 547, 497) found in §III-A, the packets sent by our client to our target
in our study. When this happened, we simply assigned the IP web server through a RESIP service were constructed in
most popular device identified when studying the distribution a unique way: uuid.timestamp.providerId.gwId.raap-xx.site.
of the devices across IPs (Table IV). This labeling approach ensures that even when all other
One potential concern is the representativeness of our payload content of these packets was discarded, still we could
profiling results as only 11.75% RESIP IPs responded to identify the communication as long as the target domains were
our probings and overall 8.85% RESIP IPs had their de- recorded. This was exactly the case for the PUP traffic logging,
vice information identified. However, as shown in previous which only kept the domains, and another small amount of
studies [77] [63] [64] [61], such low identification rate is information, including the time when the communication was
quite common. For example, according to the latest large-scale observed. In our study, we correlated the PUP communication
probing conducted by CENSYS [43], among their probes on with our infiltration traffic based upon the matched one-time
0.37 billion alive IPs, only 50 million (13.5%) produced HTTP domain, their timestamps (within 1 minute), and the log on
responses, 3 million (0.8%) produced TELNET responses, 10 the client side, which is supposed to record the request sent
million (2.7%) triggered FTP responses, and 13 million (3.5%) out, and the log on the server side, which should receive the
led to SSH responses, etc. Besides, as shown in Figure 4(b), request only once. These checks ensure that there would not be
RESIP IPs with devices identified are distributed globally in 215 any false hit caused by, for example, traffic replay. In the end,
countries and regions (16,516 /16 and 196 /8 networks). This we discovered from the PUP dataset 5,895 traffic records that
also indicates that our host profiling results are representative. accurately matched the records on our sides. Those records
In summary, our host profiling results indicate that rather cover 67 different PUPs. To better understand the 67 PUPs, we
than joining RESIP services willingly, at least some RESIP scanned their MD5 using VirusTotal and found that 50 of them
devices are likely “recruited” through stealthy compromise. On were flagged by at least one anti-virus engine, and each PUP
one hand, none of the five RESIP services except for Luminati on average received 24.71 alarms. We then submitted these
provides software stacks for recruiting users. On the other VirusTotal reports to AVClass [75] to get the PUPs’ families.
hand, many IPs fingerprinted were found to host IoT devices. In the end, 17 were labeled as cryptos, 10 as glupteba, and 5
Although some devices like WAPs and routers may serve as as one of elex, bandit, zusy, wcryg and razy, and the families
the NAT front that covers other hosts behind the scene, others of the remaining PUPs were not identified.
such as cameras, printers, DVRs and media devices, etc., are
very unlikely to voluntarily join the services by their owners. For all these 67 PUPs, we collected their traffic logs from
June 2017 to Nov 2017: totally, 5 million of them covering
C. Proxy Traffic Analysis 8,886 RESIP IPs and 4,141 devices. Table V presents 10 PUP
Proxy traffic collection. In order to understand how the examples from different RESIP providers. Their MD5s are
compromised RESIP devices operated, we leveraged the PUP included in Table XIII of Appendix IX. The 5 million PUP
Domain Usage # RESIPs # Subdomains
traffic logs were further used in our traffic analysis (elaborated
noip.com/ddns.net Dynamic DNS provider 217 225
below). Note that the above numbers are only the “lower opengw.net P2P VPN 206 509
bounds” for the pervasiveness of PUPs across RESIP services, Hopto.org Dynamic DNS provider 54 73
given the limited device accesses our industry partner has. no-ip.biz Dynamic DNS provider 35 172
duckdns.org Dynamic DNS provider 28 42
Surprisingly, we found that all 5 services studied in our TABLE VI: List of the top 5 domains resolved to most RESIP IPs.
research utilized PUPs to relay traffic: 33 for ProxyRack, 9
for Luminati, 24 for Proxies Online, 10 for Geosurf and 2 for like Google Safebrowsing, BitDefender, CLEAN MX, etc.
IAPS Security. Particularly, our traffic from Proxies Online and Fast fluxing. Also surprisingly, we discovered that RESIPs
Geosurf went though 9 shared PUPs, which together with other serve as Fast flux proxies for malicious websites to evade IP
findings (see §V-B) indicates that these services are likely all based detection. In a fast flux, numerous IP addresses associated
affiliated with the same company. Also surprisingly, the proxy with a malicious domain are swapped in and out with high
program used by Luminati, Hola, was marked as PUPs, and frequency. Applying Passive DNS data and VirusTotal APIs to
some of them (2 out of 9) were forwarding our infiltration the sampled 600K RESIPs, we discovered that 1.14% of the
traffic sent to a different RESIP provider, IAPS. This combined proxy IPs were once mapped to malicious domains during the
with further analysis in §V-B indicates that IAPS is very likely periods when they were RESIPs, and on average, the mapping
a reseller for Luminati’s RESIP service. from these malicious domains to the proxy IPs lasted 86.8
Traffic Target analysis. Our access to the PUP traffic log days. However, the median was only 2 days. Table VI lists
helped us learn more about other illicit activities performed the top 5 domains resolved to most proxy IPs. Except for
by RESIPs. Specifically, from the 5-million traffic logs of opengw.net which allows volunteers to serve as VPNs for
67 PUPs, we extracted destination domains, URLs and IPs others, all other four are dynamic DNS providers. Some of
of their communication, as well as related traffic volume. them are previously reported being abused by the miscreant to
Manual analysis of top 1,000 destinations with the largest conduct various illicit activities [8], which are also confirmed
traffic volume shows most of them reside in the following 5 by us, as many subdomains of them are labeled by VirusTotal
categories: ad (75%), searching engines (8%), shopping (7%), as malicious such as yohoy.no-ip.biz, darkjabir.no-ip.info, and
malicious websites (5%) and social networks (2%). Among 595685744.duckdns.org.
ads-related domains, the majority are affiliate networks such as
tracking.sumatoad.com, click.howdoesin.net, www.alexacn.cc, D. RESIP vs. Bots
and click.gowadogo.com. Others are dedicated to different Another interesting question is how RESIPs relate to bots,
ad services such as mobile advertising, in-app advertising, especially, whether RESIPs are bots, and whether methodolo-
video advertising, ad exchanges. Many of those ad domains gies for detecting bots work for RESIPs. Regarding whether
are reported to install adware on users’ devices such as RESIPs are bots, we identified connections between them. In
ads.stickyadstv.com, counter.yadro.ru, and adskpak.com. Those particular, 1,248 IPs were blacklisted as bots of Hajime or
adware altered browser homepages, generated various forms of IoT Reaper on the same day when they offered proxy services
ads. Further, analysis of corresponding URLs of those domains (see Appendix IX-B); in addition, we also identified devices
shows that most of them are in the forms of ads provided that were likely recruited through stealthy compromise, as
by those domains. Examples include click.howdoesin.net, detailed in §IV-B. Both indicate the existence of bots acting as
tracking.sumatoad.com/aff c?, click.gowadogo.com/click? and RESIPs. Nevertheless, we also identified channels for volunteer
proleadsmedia.afftrack.com/click?. We also observed lots of recruitment, suggesting willingly joined users are also part of
search queries are sent to different search engines including the RESIP networks.
Google Search, Bing Search, Baidu Search, Yandex, and also Meanwhile, compared to bots, RESIPs are observed to
visits to various shopping websites including amazon.com, exhibit different characteristics that indicate new challenges
ebay.com, sears.com and tmall.com. Given that those proxy for detection. Unlike a bot, a RESIP is a proxy to help users
services are rather expensive, with 1 GB costing at least $15, access web services in a seemingly legitimate way. Although
using them for daily shopping and online search does not seem RESIP services recruit hosts in a highly suspicious manner,
to be reasonable. More likely were the activities related to they likely also include legitimate volunteer participants. A
blackhat SEO or other online promotion operations. What is prominent example is Luminati, which has a recruitment system.
more, some websites such as lenzmx.com and csgob0t.online Furthermore, identified RESIP programs, including the PUPs,
were found to be malicious in our manual analysis, in line all have limited privileges, while bots usually acquire the
with the results reported by VirusTotal. highest privilege [74]. Also, unlike the botnet exclusively
Further we found from the PUP logs the traffic to known mali- serving cybercrimes, RESIP services are promoted publicly
cious domains. Specifically, 9.36% of the destination addresses and are likely also utilized by legitimate users. In addition,
were reported to be malicious by VirusTotal (68.92% are labeled botnets are found to flux the addresses (IPs and domains) of
as malware sites, 29.97% being malicious sites and 2.24% being their C&C servers or run them on bulletproof hosting to evade
phishing sites). Examples include ntkrnlpa.cn, gwf-bd.com, detection and blocking [76][54]. In contrast, RESIP services
fadergolf.com, www.2345jiasu.com, and www.pf11.com, which only involve a limited number of server IPs and domains, and
have been reported by the most detection engines on VirusTotal most of them belong to popular hosting providers (See §V-B).
Source (# Machine Hours) Flows IPs Ports IP-Ports Provider # RESIP # /24 # /16 # /8 # ASN
Bots (241) 1,365.97 328.34 10.12 330.40 Proxies Online 1,257,418 483,310 19,654 196 7,701
Normal (461) 762.38 30.41 6.41 37.44 Geosurf 432,975 221,747 15,143 194 4,971
RESIPs (64,833) 96.37 53.54 6.27 58.59 ProxyRack 857,178 345,648 19,520 196 8,751
TABLE VII: Comparison of bots, normal hosts and RESIPs. All the Luminati 4,033,418 1,183,841 22,467 197 17,820
statistics here are averaged over the number of machine hours. TABLE VIII: Distribution of RESIPs.
1.0 Bots UTC-7 Top Top
Normal Provider % Top ISPs % %
RESIPs
Countries ASNs
0.8
UTC-5
Proxies India 32.2 BSNL 6.5 9829 8.1
0.6
Online USA 7.8 Uninet S.A. de C.V. 5.2 8151 5.4
UTC+5
0.4 Mexico 6.7 Deutsche Telekom AG 2.8 24560 4.9
0.2 Geosurf India 27.9 Uninet S.A. de C.V. 6.9 8151 7.2
UTC+7
Brazil 9.2 BSNL 4.7 9829 5.8
0.0 9.1 Deutsche Telekom AG
100 101 102 103 104 0 5 10 15 20 Mexico 2.8 55836 4.5
Fig. 6: CDF of # of (IP, Port) Fig. 7: # of RESIPs in each lo- ProxyRack Russia 8.6 PT Telkom Indonesia 5.4 17974 5.3
pairs visited each machine hour cal hour of various time zones. Indonesia 8.1 Pakistan Telecom 3.7 8452 4.7
Egypt 6.3 Republican Unitary 3.3 45595 4.0
Therefore, intuitively the collective behaviors of a RESIP Luminati Turkey 12.7 Turk Telekom 8.5 9121 8.5
service can be very different from these of a botnet, which was Ukraine 7.9 JSC Ukrtelecom 1.7 25019 1.8
UK 6.1 BT 1.7 34984 1.8
confirmed by our study based on the RESIP traffic logs (§III-D) TABLE IX: Top 3 countries, ASNs and ISPs with most RESIPs
and a representative botnet traffic dataset (CTU-13 [65]) with
the network flows of both normal hosts and 7 different types of a small fraction of countries, ASNs and ISPs contribute the
bots. In the study, we looked at the network flow features majority of RESIPs, respectively. For example, we find that
commonly used for botnet detection [57] [84] [82] [67] . even though Luminati is located in the United States, most
Examples include unique flows per machine hour, unique of its RESIPs are from Turkey, possibly because of Turkey’s
destination IPs per machine hour, and unique destinations network censorship which makes Hola clients a good option to
(IP/Port pairs) per machine hour. Figure 6 illustrates the visit blocked websites there. An interesting finding here is that
CDFs of the unique destinations visited every machine hour despite Luminati’s claim of having 30 million IPs, we only
by bots, normal hosts and RESIPs: compared to the bot found 4 millions using 16-million probings. It is unclear where
traffic, the RESIP traffic looks more similar to the normal this gap comes from.
one, as also observed when comparing other features across We also measured how many RESIPs a time zone contributes
the RESIP and botnet datasets (Table VII). This indicates during its different local hours. As shown in Figure 7, the
that the mixture of legitimate and illicit traffic of the RESIP peak hours across time zones indeed exhibit diurnal patterns,
service moves its statistical features closer to these of the confirming our previous findings that the majority devices of
legitimate communication. Despite the above findings, we must RESIPs are indeed residential hosts that are more likely to be
acknowledge the limitations of our approaches. For example, we powered off or disconnected during the night.
are not able to exhaustively consider all bot and RESIP types; Figure 8(a) shows the evolution of the RESIP pools by
the traffic data containing only the network flow information plotting the cumulative number of unique RESIP IPs. We
does not allow us to experiment detection methodologies such observe that a large number of RESIP IPs newly appear every
as those based on deep packet inspection (DPI). Therefore, we day with an average increase rate of 44%. However, when
leave more detailed comparison analysis between RESIPs and considering the increase of fresh /16 IP prefixes, we observe
bots as our future work. a much smaller rise (11%) in Figure 8(b). This is reasonable
V. T HE RESIP E COSYSTEM because a given RESIP host is less likely to migrate from one
A. Landscape of RESIP Service /16 IP prefix to another than to change from one IP to another.
Through infiltrating RESIP services, we were able to collect
B. Infrastructure and Service
a pool of RESIP IP addresses. Specifically, everyday during
the infiltration period, we launched multiple RESIP crawling Backend (hidden) gateways. Under the known infrastructure
jobs running across different hours in the whole day from of the RESIP service as illustrated in Figure 1, we found that
different locations and accounts, trying to reveal the landscape there are a series of hidden backend servers intermediating
of the RESIP pool. Overall, we captured 6 million RESIP IPs by between the frontend gateways and RESIPs, as shown in
sending 62 million requests. Note that due to the IP churn issue Figure 8(d). Since those servers can be regarded as gateways
especially in mobile networks, the number of RESIP IPs here from the perspective of RESIPs, we call them backend (hidden)
should only be considered as an upper bound of the number gateways. These gateways were discovered from the connec-
of RESIP hosts. Table VIII shows the RESIPs distribution in tions between the proxy gateway and the RESIP, as documented
different network blocks and ASes for each RESIP service by our traffic logs, PUP traffic, and Passive DNS datasets.
provider. We can observe that Luminati has the largest RESIP Specifically, using Proxies Online as an example, we observed
pool, followed by Proxies Online and ProxyRack. that before relaying our infiltration traffic, the PUP-hosted
Table IX lists the top 3 countries, ASNs and ISPs with RESIPs always communicate with lb-api.lambda.servers.jetstar.
most RESIPs. They all exhibit long-tailed distributions where media, report-v3.pprx.work, or report-v3.junk.uno instead of
PO GS IP LU PR

Cumulative # of /16 networks (K)


4000 Luminati 22.5 PO 12.5% 0% 0.06% 0.09%
Luminati
Cumulative # of RESIPs (k)

3500 Geosurf gw.proxies.online


Proxies Online 20.0 Geosurf
Frontend gateway
3000 ProxyRack Proxies Online Residential
17.5 ProxyRack GS 36.3% 0% 0.23% 1.7%
RESIP host
2500 15.0 dist.jetstar.media servers.jetstar.media
client
2000 12.5 IP 0% 0% 66% 0.07%
Backend gateway Residential
host
1500 10.0 report-v2.pprx.work report-v3.pprx.work
1000 7.5 LU 0.02% 0.02% 0.07% Backend gateway
0.04%
500 5.0 junk.uno report-v3.junk.uno Residential
Backend gateway host
0 2.5 PR 0.14% 0.86% 0% 0.2% 173.244.163.58 107.23.85.127
0706 0726 0815 0904 0924 1015 1104 1124 0706 0726 0815 0904 0924 1015 1104 1124 52.0.109.110

(a) Cumulative number of RESIPs. (b) Cumulative number of /16 RE- (c) RESIP IP overlap between (d) Build up the connection between the frontend
SIPs. different service providers. gateways and backend gateways.
Fig. 8: The evolution of RESIP pools (a)(b) and the collusion of the service providers (c). In (c), “PO” stands for Proxies Online; “GS”
stands for Geosurf; “IP” stands for IAPS; “LU” stands for Luminati; “PR” stands for ProxyRack.

Frontend gateway Backend gateway


Provider
to 48 identified domains and got 915 IPs. Then we ran periodic
servers.jetstar.media; pprx.work;
Proxies Online gw.proxies.online
junk.uno port scanning on those IPs and found that those frontend
servers.jetstar.media; pprx.work; and backend gateways tend to open lots of consecutive ports.
Geosurf gw1.geosurf.io
junk.uno Specifically, Luminati has 23000-23999, 52225 and 52951
Luminati zproxy.luminati.io zserver.hola.org
ports opened for frontend gateways and 6861-7009 for backend
TABLE X: Frontend and backend gateways of RESIP services.
gateways. Geosurf/Proxies Online have 8010-8237 for frontend
gw.proxies.online, which is the frontend gateway. We then gateways and 11211 for backend gateways. Also, ProxyRack
investigated the PassiveDNS and found that the subdomains of opens 1200-1250 and 1500-1750 for frontend gateways. We
jetstar.media, pprx.work, junk.uno, and proxies.online share a also randomly scanned the IPs of popular web services and
set of IPs as shown in Figure 8(d). This strongly indicates that found that none of them open such unusual ports. These ports
jetstar.media, pprx.work, and junk.uno also belong to Proxies are related to different proxy services provided by PrxoxyRack
Online, and some of its subdomains act as backend gateways and Geosurf/Proxies Online. However, we do not know how
to communicate with the RESIPs. Table X lists the hidden Luminati uses those consecutive ports.
backend gateways obtained from PUP traffic for all providers. C. Case Study: Luminati
Interestingly, we found that some hidden backend gateways Luminati claimed to be a network where users join willingly
(pprx.com) were labeled by VirusTotal as malicious sites (at by installing client software such as browser extensions or
least three indicators) while all of the frontend gateways were Hola VPN, in order to contribute their network resources while
clean. This indicates that decoupling different components enjoying traffic relaying through other participants. Actually,
actually makes the ecosystem more robust. when we purchased their service, Luminati indeed performed
Collusion. The study of RESIP traffic in §IV-C reveals that a background check that asked for photo ID and explained
RESIP service providers Proxies Online and Geosurf shared 9 to us their traffic policy through a video chat (although only
PUPs. Here we further explore the relations among different crawling Google is stated to be forbidden). Surprisingly, we
RESIP service providers in terms of their shared RESIPs. We found that Luminati (1) proxies through IoT devices that do not
calculate the intersection rate ( |A∩B|
|A| ) between the RESIPs support Hola client software, (2) likely resells services to other
captured from different service providers, and further define providers such as IAPS that conduct no background check,
a very strict criterion to decide whether a RESIP can be and (3) involves RESIPs that host malicious content or are
considered as shared by two providers. Specifically, we consider associated with suspicious domains. Specifically, leveraging our
a RESIP as shared only if it has ever been captured in the same IP profiling infrastructure as described in §III-A, we performed
hour by independent infiltrations on both providers’ services. a real-time device fingerprinting for newly captured RESIPs
As shown in Figure 8(c), we found a number of RESIPs from Luminati, and identified lots of IoT devices associated
spanning different RESIP service providers. The most popular with Luminati’s RESIPs like webcam (4.31%), DVR (1.93%),
one, Luminati, share 813 RESIPs with Proxies Online, 983 with printer (0.13%), VoIP (0.09%) and NAS (1.24%). As Luminati
Geosurf, 2,783 with IAPS Security, and 1,718 with ProxyRack. did not provide any Hola clients for these types of devices,
Besides, given that Proxies Online and Geosuf share a large our findings undermine its claim to be a network consisting of
portion of their RESIPs, they are likely two brands of the same only willing participants. Instead, IoT devices appear to be an
company, while IAPS is probably a reseller of Luminati as important RESIP source of Luminati.
most of its RESIP IPs come from Luminati. Our findings in §IV-C and §V-B indicate that IAPS likely
Infrastructure Profiling. After identifying the infrastructure resells Luminati’s RESIP service: the PUP traffic logs show
of RESIP services including the frontend websites/gateways that our infiltration traffic from the IAPS proxies was actually
and the backend gateways, we conducted further profiling to relayed by the Hola clients believed to be controlled by
find the potential features for detecting those infrastructures. Luminati; further, 66% of the RESIPs captured from IAPS were
For this purpose, we first collected the IPs associated with those also discovered by our infiltration targeting Luminati during
infrastructures by sending DNS queries from multiple locations the same hour. We found that IAPS conducts no background
check, accepts various payment methods such as bitcoin, and above studies on web proxies and content manipulation, our
applies no traffic restrictions. Therefore, IAPS users might research study an emerging online gray business RESIP service,
be able to abuse Luminati’s network, or even to deny the and focus on the abused RESIPs as attack intermediaries and
services for legitimate Luminati customers. We also found that collusive RESIP service providers.
2.32% of Luminati’s RESIPs were hosting malicious content Compromised Host Detection. How to detect compromised
or having suspicious domains resolved to them while acting host has been studied for long. Techniques have been developed
as proxies. Examples of such domains include the scam site to analyze web content, redirection chains, and traffic pattern.
tummytickle.com and the drive-by-download site www.iwys.cc, Examples of the content-based detection include a system [56]
and malicious samples downloaded from those RESIPs include monitoring the evolution of web content to identify an infection
PUP, Trojan and exploit code. using signatures generated from such modifications, and a
VI. D ISCUSSION framework [70] conducting semantic differential analysis to
Mitigation. Our measurements have identified numerous se- identify the infection of the website. Other studies focus on ma-
curity issues including compromised devices and abusing licious redirectors and attack infrastructures. Examples include
RESIP services for malicious activities. A key prerequisite JsRED [69] that used a differential analysis to automatically
for mitigating such security issues is effective detection of detect malicious redirect scripts hosts, and Shady Path [79]
RESIP services and RESIPs, which we plan to pursue as future that captured a compromised host by looking at its redirection
work. We discuss potential features that are useful for detection. graph. Also, a large number of studies detected compromised
We first consider detecting RESIP services. We propose to hosts using traffic analysis via active or passive probing. [71]
detect three components: their websites, frontend gateways, and detected P2P bots by remotely probing the hosts and analyzing
backend gateways. (1) Based on our experiences, RESIP web- the response traffic. [83] combined binary analysis and traffic
sites typically contain noticeable keywords such as “residential analysis for P2P bot detection. In our study, we perform
IP”, “never blocked” and “HTTP/HTTPS/SOCKS”, which can best-effort identification and characterization of RESIPs using
be used by a search engine or forum crawler for automated novel methods. We also compare RESIPs to other types of
content analysis. (2) Frontend gateways are oftentimes co- compromised hosts such as bots, and reveal several challenges
located with RESIP websites with the same domain names or for accurately detecting the RESIPs on today’s Internet.
even IP addresses. Furthermore, as described in §V, frontend Empirical study of botnet. Botnets have long been studied.
gateways tend to open a large number of TCP ports to serve For example, [53] revealed structural and behavioral features of
traffic with various proxy requirements. This feature can also botnets such as the high churn rate within a botnet. [60] studied
be leveraged as well for detection. (3) Several features can be the relationship between botnet and spamming activities. [78]
possibly leveraged to detect backend gateways: opening a large characterized the personal data theft behavior of the Torpig
number of TCP ports, having globally distributed sources of botnet. In contrast, our study focuses on RESIP services that
DNS queries for a low-reputation domain, and being co-located show different characteristics from botnets in their hosts, users
directly or indirectly with the frontend gateways. and network behaviors, as detailed in §IV-D.
Detecting RESIPs seems challenging. Their discovery can
be facilitated using the detected backend gateways as “step VIII. CONCLUSION
stones”, since RESIPs have to communicate with the backend RESIP service is an emerging online gray business, whose
gateways. Besides, the visiting patterns and targeted domains security implications have never been studied before. In the
of traffic relayed by RESIPs may deviate from those of normal paper, we report the first systematic research on this new
traffic, and can possibly be considered by a detection scheme.service, based upon a suite of techniques that address the
Datasets and Code release. We will release related datasets challenges in collecting RESIP host information and finding
and source code, as detailed in Appendix IX-C. illicit activities these proxies are involved in. Specifically,
VII. R ELATED W ORK through infiltrating 5 representative services, we gathered over
Dark Web Proxy. The security issue on web proxy services 6.2 million RESIP IPs and further successfully profiled more
is attracting increasing attention from researchers. In par- than 500K hosts, identifying more than 200K IoT devices likely
ticular, Weaver et al. [81] conducted a measurement study to be compromised to serve as proxies. Further by linking the
to understand the purpose of free proxy services based on IPs to the PUP traffic data provided by our industry partner,
how they modify traffic. Chung et al. [58] studied a paid we gained a rare look inside the operations of these residential
proxy service to uncover content manipulation in end-to-end proxies. Our study shows that RESIPs tend to be part of
connection. O’Neill et al. [72] measured the prevalence of such illicit activities as blackhat SEO, Fast fluxing, phishing,
TLS proxies and identified thousands of malware intercepting malware hosting, etc. Our infiltration analysis also discovered
TLS communications. Carnavalet et al. [62] released security the hidden layer of their infrastructure and the collusions across
vulnerabilities in TLS proxies, allowing attackers to mount different services. Moving forward, we believe that unregulated
man-in-the-middle attacks. Recently, [80] and [73] showed the RESIP services indeed pose new threats to the Internet users
content modification behavior of Open HTTP proxy services and further research is needed to get a more comprehensive
and free HTTP/HTTPS proxy services. In contrast to the view of the services and develop effective solutions to mitigate
their security risks.
ACKNOWLEDGMENT [45] Request for bulk whois of lacnic. http://www.lacnic.net/en/web/lacnic/
manual-8, 2018.
We are grateful to our shepherd Professor Matthew Smith and [46] Ripe whois apis. https://www.ripe.net/analyse/archived-projects/
the anonymous reviewers for their insightful and helpful com- ris-tools-web-interfaces/riswhois, 2018.
ments. The IU authors are supported in part by NSF 1408874, [47] Salesforce ip ranges. https://help.salesforce.com/articleView?id=
000003652&type=1, 2018.
1527141, 1618493, 1618898 and ARO W911NF1610127. Also, [48] Shodan. https://www.shodan.io/, 2018.
authors from Tsinghua University are supported in part by the [49] Storm proxies. http://stormproxies.com/, 2018.
National Natural Science Foundation of China (grant 61772307) [50] Tor volunteer. https://www.torproject.org/getinvolved/volunteer.html.en,
2018.
and CERNET Innovation Project NGII20160403.
[51] Trace my ip. http://www.tracemyip.org/, 2018.
R EFERENCES [52] Zoomeye. https://www.zoomeye.org/, 2018.
[53] M. Abu Rajab, J. Zarfoss, F. Monrose, and A. Terzis. A multifaceted
[1] Geosurf: Residential and data center proxy network. https://www.geosurf. approach to understanding the botnet phenomenon. In Proceedings of
com/. the 6th ACM SIGCOMM conference on Internet measurement, pages
[2] Iaps security. https://www.intl-alliance.com/. 41–52. ACM, 2006.
[3] Luminati: largest business proxy service. http://luminati.io/. [54] S. Alrwais, X. Liao, X. Mi, P. Wang, X. Wang, F. Qian, R. Beyah, and
[4] The netflix vpn ban can be bypassed – here’s how it can be done D. McCoy. Under the shadow of sunshine: Understanding and detecting
responsibly. bulletproof hosting on legitimate service provider networks. In Security
[5] Proxies online. http://proxies.online. and Privacy (SP), 2017 IEEE Symposium on, pages 805–823. IEEE,
[6] Proxyrack. https://www.proxyrack.com/. 2017.
[7] Public proxy service. www.kuaidaili.com/. [55] M. Antonakakis, T. April, M. Bailey, E. Bursztein, J. Cochran, Z. Du-
[8] On the trail of malicious dynamic dns domains. https://umbrella.cisco. rumeric, J. A. Halderman, D. Menscher, C. Seaman, N. Sullivan, et al.
com/blog/2013/04/15/on-the-trail-of-malicious-dynamic-dns-domains/, Understanding the mirai botnet. 2017.
2013. [56] K. Borgolte, C. Kruegel, and G. Vigna. Delta: automatic identification
[9] Alienvalut otx. https://otx.alienvault.com, 2017. of unknown web-based infection campaigns. In Proceedings of the
[10] Cymon otx. https://cymon.io/, 2017. 2013 ACM SIGSAC conference on Computer & communications security,
[11] Free proxy list. http://www.freeproxylists.com, 2017. pages 109–120. ACM, 2013.
[12] Hajime - netlab opendata project. http://data.netlab.360.com/hajime/, [57] L. Carl et al. Using machine learning technliques to identify botnet traffic.
2017. In Local Computer Networks, Proceedings 2006 31st IEEE Conference
[13] Iot reaper: A rappid spreading new iot botnet. http://blog.netlab.360.com/ on. IEEE, 2006.
iot reaper-a-rappid-spreading-new-iot-botnet-en/, 2017.
[58] T. Chung, D. Choffnes, and A. Mislove. Tunneling for transparency:
[14] Ip2location db8. https://www.ip2location.com/databases/
A large-scale analysis of end-to-end violations in the internet. In
db8-ip-country-region-city-latitude-longitude-isp-domain, 2017.
Proceedings of the 2016 ACM on Internet Measurement Conference,
[15] Ip2proxy lite. https://lite.ip2location.com/database/px1-ip-country, 2017.
pages 199–213. ACM, 2016.
[16] Nmap service detection probe list. https://svn.nmap.org/nmap/
[59] T. Chung, R. van Rijswijk-Deij, B. Chandrasekaran, D. Choffnes,
nmap-service-probes, 2017.
D. Levin, B. M. Maggs, A. Mislove, and C. Wilson. A longitudinal,
[17] Passive dns from 360 netlab. https://passivedns.cn, 2017.
end-to-end view of the dnssec ecosystem. 2017.
[18] Spamhaus edrop. https://www.spamhaus.org/drop/, 2017.
[60] M. P. Collins, T. J. Shimeall, S. Faber, J. Janies, R. Weaver, M. De Shon,
[19] Tor exit nodes. https://check.torproject.org/exit-addresses, 2017.
and J. Kadane. Using uncleanliness to predict future botnet addresses.
[20] Tor node list from dan. https://www.dan.me.uk/tornodes, 2017.
In Proceedings of the 7th ACM SIGCOMM conference on Internet
[21] Virustotal. https://www.virustotal.com, 2017.
measurement, pages 93–104. ACM, 2007.
[22] Webanet free proxy list. https://webanetlabs.net/publ/24, 2017.
[23] Acess to apnic whois data. https://www.apnic.net/manage-ip/using-whois/ [61] A. Cui and S. J. Stolfo. A quantitative analysis of the insecurity of
bulk-access/, 2018. embedded network devices: results of a wide-area scan. In Proceedings
[24] Afrinic bulk whois data. https://www.afrinic.net/library/ of the 26th Annual Computer Security Applications Conference, pages
membership-documents/207-bulk-whois-access-form-, 2018. 97–106. ACM, 2010.
[25] Aliyun ip ranges. https://ipinfo.io/AS37963, 2018. [62] X. d. C. de Carnavalet and M. Mannan. Killed by proxy: Analyzing
[26] Amazon aws ip address ranges. https://docs.aws.amazon.com/general/ client-end tls interception software. In Network and Distributed System
latest/gr/aws-ip-ranges.html, 2018. Security Symposium, 2016.
[27] Amazon echo. https://en.wikipedia.org/wiki/Amazon Echo, 2018. [63] Z. Durumeric, D. Adrian, A. Mirian, M. Bailey, and J. A. Halderman.
[28] At&t mobility llc. https://whois.arin.net/rest/org/ATTMO-3, 2018. A search engine backed by internet-wide scanning. In Proceedings of
[29] At&t mobility llc. https://en.wikipedia.org/wiki/AT%26T Mobility, 2018. the 22nd ACM SIGSAC Conference on Computer and Communications
[30] Available hola clients. https://hola.org/download, 2018. Security, pages 542–553. ACM, 2015.
[31] Blackhat seo forum: Proxies for sal. https://www.blackhatworld.com/ [64] Z. Durumeric, E. Wustrow, and J. A. Halderman. Zmap: Fast internet-
forums/proxies-for-sale.112/, 2018. wide scanning and its security applications. In USENIX Security
[32] Cloudflare ip ranges. https://www.cloudflare.com/ips/, 2018. Symposium, volume 8, pages 47–53, 2013.
[33] Geoip2 precision insights service. https://www.maxmind.com/en/ [65] S. Garcia, M. Grill, J. Stiborek, and A. Zunino. An empirical comparison
geoip2-precision-insights, 2018. of botnet detection methods. computers & security, 45:100–123, 2014.
[34] Google compute engine ip ranges. https://cloud.google.com/compute/ [66] E. J. Hernandez-Valencia. Architectures for broadband residential ip
docs/faq#where can i find product name short ip ranges, 2018. services over catv networks. IEEE Network, 11(1):36–43, 1997.
[35] Google home. https://en.wikipedia.org/wiki/Google Home, 2018. [67] P. Kalaivani and M. Vijaya. Mining based detection of botnet traffic in
[36] Hola faq. https://hola.org/faq#intro-cost, 2018. network flow.
[37] Ibm cloud ip ranges. https://console.bluemix.net/docs/infrastructure/ [68] A. Kountouras, P. Kintis, C. Lever, Y. Chen, Y. Nadji, D. Dagon,
hardware-firewall-dedicated/ips.html#ibm-cloud-ip-ranges, 2018. M. Antonakakis, and R. Joffe. Enabling network security through
[38] Microleaves. https://microleaves.com/, 2018. active dns datasets. In International Symposium on Research in Attacks,
[39] Microsoft azure datacenter ip ranges. https://www.microsoft.com/en-us/ Intrusions, and Defenses, pages 188–208. Springer, 2016.
download/details.aspx?id=41653, 2018. [69] Z. Li, S. Alrwais, X. Wang, and E. Alowaisheq. Hunting the red fox
[40] Obtaining bulk whois data from arin. https://www.arin.net/resources/ online: Understanding and detection of mass redirect-script injections.
request/bulkwhois.html, 2018. In Security and Privacy (SP), 2014 IEEE Symposium on, pages 3–18.
[41] Philips hue lights. https://en.wikipedia.org/wiki/Philips Hue, 2018. IEEE, 2014.
[42] Pure vpn. https://www.purevpn.com/, 2018. [70] X. Liao, K. Yuan, X. Wang, Z. Pei, H. Yang, J. Chen, H. Duan, K. Du,
[43] Raw scan data of censys. https://censys.io/data, 2018. E. Alowaisheq, S. Alrwais, et al. Seeking nonsense, looking for trouble:
[44] Rdap protocol. https://about.rdap.org/, 2018. Efficient promotional-infection detection through semantic inconsistency
search. In Security and Privacy (SP), 2016 IEEE Symposium on, pages of IP CIDRs published by popular cloud providers including
707–723. IEEE, 2016. Amazon AWS [26], Google Cloud [34], Microsoft Azure [39],
[71] A. Nappa, Z. Xu, M. Z. Rafique, J. Caballero, and G. Gu. Cyberprobe:
Towards internet-scale active detection of malicious servers. In In IBM Cloud [37], Aliyun [25], CloudFlare [32], and Salesforce
Proceedings of the 2014 Network and Distributed System Security [47]. All those together contribute 53-million IPs distributed
Symposium (NDSS 2014), pages 1–15, 2014. in 210K /24 and 968 /16 network blocks. We further looked
[72] M. O’Neill, S. Ruoti, K. Seamons, and D. Zappala. Tls proxies: Friend
or foe? In Proceedings of the 2016 ACM on Internet Measurement up the Active DNS database for Alexa top 1 million websites
Conference, pages 551–557. ACM, 2016. and gathered 442K IPs. Another 519 IPs are collected from
[73] D. Perino, M. Varvello, and C. Soriente. Proxytorrent: Untangling the PureVPN[42], a popular commercial VPN service.
free http (s) proxy ecosystem. 2018.
[74] D. Plohmann, E. Gerhards-Padilla, and F. Leder. Botnets: Detection, Features Before going through all 35 features, let’s firstly
measurement, disinfection & defence. European Network and Information refresh you the following definitions (introduced in §II) used
Security Agency (ENISA), 1(1):1–153, 2011.
[75] M. Sebastián, R. Rivera, P. Kotzias, and J. Caballero. Avclass: A tool in our features. For each IP address, we define Direct Inetnum
for massive malware labeling. In International Symposium on Research as the leaf inetnum node where this IP resides in, Inetnum Tree
in Attacks, Intrusions, and Defenses, pages 230–253. Springer, 2016. Path as the inetnum path from the root inetnum node(0.0.0.0/0)
[76] S. Soltani, S. A. H. Seno, M. Nezhadkamali, and R. Budiarto. A survey
on real world botnets and detection mechanisms. International Journal to its Direct Inetnum. We also define two kinds of owners,
of Information and Network Security, 3(2):116, 2014. one is Direct Owner represented by the organization ID or
[77] D. Springall, Z. Durumeric, and J. A. Halderman. Ftp: The forgotten person ID referred in its direct inetnum, the other is Loose
cloud. In Dependable Systems and Networks (DSN), 2016 46th Annual
IEEE/IFIP International Conference on, pages 503–513. IEEE, 2016. Owner represented by all org and person objects sharing with
[78] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski, the direct owner the same contact information including either
R. Kemmerer, C. Kruegel, and G. Vigna. Your botnet is my botnet: phone numbers or email addresses. As introduced in §III-B,
analysis of a botnet takeover. In Proceedings of the 16th ACM conference
on Computer and communications security, pages 635–647. ACM, 2009. 35 features are introduced in our residential classifier and they
[79] G. Stringhini, C. Kruegel, and G. Vigna. Shady paths: Leveraging surfing can be grouped into two categories by the datasets used to
crowds to detect malicious web pages. In Proceedings of the 2013 ACM generate them: IP Whois and Active DNS.
SIGSAC conference on Computer & communications security, pages
133–144. ACM, 2013. Features from Active DNS. We retrieve DNS records from
[80] G. Tsirantonakis, P. Ilia, S. Ioannidis, E. Athanasopoulos, and M. Poly- the latest ActiveDNS database for the following targets: the
chronakis. A large-scale analysis of content modification by open http given IP, its current direct inetnum, its /24 IP prefix. Then,
proxies. 2018.
[81] N. Weaver, C. Kreibich, M. Dam, and V. Paxson. Here be web proxies. we profile each target using TLD+2/TLD+3 domains resolved
In International Conference on Passive and Active Network Measurement, to the IP range of the target. Specifically, we designed the
pages 183–192. Springer, 2014. following 12 features.
[82] U. Wijesinghe, U. Tupakula, and V. Varadharajan. An enhanced model
for network flow based botnet detection. In Proceedings of the 38th • F-1: # of TLD+2 domains resolved to the given IP.
Australasian Computer Science Conference (ACSC 2015), volume 27, • F-2: # of TLD+3 domains resolved to the given IP.
page 30, 2015.
• F-3: Percentage of IPs in current direct inetnum with DNS
[83] Z. Xu, L. Chen, G. Gu, and C. Kruegel. Peerpress: utilizing enemies’
p2p strength against them. In Proceedings of the 2012 ACM conference records.
on Computer and communications security, pages 581–592. ACM, 2012. • F-4/F-5: Mean/Maximum number of TLD+3 domains
[84] H. R. Zeidanloo, A. B. A. Manaf, R. B. Ahmad, M. Zamani, and S. S.
Chaeikar. A proposed framework for p2p botnet detection. International
resolved to IPs in current direct inetnum.
Journal of Engineering and Technology, 2(2):161, 2010. • F-6/F-7: Mean/Maximum number of TLD+2 domains
resolved to IPs in current direct inetnum.
IX. A PPENDIX • F-8: Percentage of IPs in /24 IP prefix with DNS records.

A. Residential Classifier • F-9/F-10: Mean/Maximum number of TLD+3 domains


resolved to IPs in /24 IP prefix.
Crafted residential device names and types. The crafted • F-11/F-12: Mean/Maximum number of TLD+2 domains
residential device names and types are listed in Table XI. They resolved to IPs in /24 IP prefix.
are either consumer devices exclusively used in home network
Features from IP Whois. The rest 23 features are retrieved
environment or network function devices usually working as
from IP Whois, in other words, the 24 historical snapshots of
components of residential network facilities.
Device Names Phillips Hue Light IP Whois captured in the last 24 months. Here, historical direct
Amazon Echo inetnums means the 24 direct inetnums in corresponding 24
Wemo Switch historical snapshots while historical direct owners and historical
Nest Thermostat
Amazon Fire TV loose owners share similar meanings.
Device Types Broadband Residential Gateway • F-13: # of unique historical direct inetnums
Residential ADSL Gateway
• F-14 to F-18: Current/Maximum/Mean/Minimum/Stan-
VoIP Phone Adapter
Media Device dard deviation of the sizes of historical direct inetnums.
DVR • F-19 to F-23: Current/Maximum/Mean/Minimum/Stan-
TABLE XI: Crafted residential device names and types dard deviation of the depths of historical direct inetnums.
Sources of non-residential ground truth Here we provide • F-24: # of unique assignment types of historical direct
more details about our non-residential datasets as introduced inetnums
in §III-B. To collect IPs from cloud services, we gathered lists • F-25: Assignment type of the current direct inetnum
Dataset Label % resi % non-resi
• F-26: # of current direct owners
Device Search Engines resi-clean 98.47% 1.53%
• F-27: # of historical direct owners Trace My IP resi-noisy 94.36% 5.64%
• F-28: the percent of current direct owners over historical Filtered IP Whois resi-noisy 99.10% 0.90%
direct owners IoT Botnets resi-noisy 98.82% 1.18%
Public Clouds non-resi-clean 0.39% 99.61%
• F-29: # of direct inetnums of the current direct owners
Alexa Top 1M non-resi-clean 2.45% 97.55%
• F-30: # of IPs of the current direct owners Public Proxies non-resi noisy 63.54% 36.46%
• F-31: # of current loose owners RESIP IPs Unknown 95.22% 4.78%
• F-32: # of historical loose owners
TABLE XII: Evaluation results of our residential classifier on various
datasets. Last two columns show the percentage of IPs in the given
• F-33: the percent of current loose owners over historical
dataset being predicted as residential or non-residential.
loose owners
MD5 Name Providers
• F-34: # of direct inetnums of the current loose owners
74ac25ba1fa653041b3e2a3d60ceb1d0 hola svc.exe LU, IAPS
• F-35: # of IPs of the current loose owners 707ffb5567bf730136614d3356a7d3c5 csrss.exe PR
Figure 9 shows the CDFs for some example features on 7971ebdb5da5c60d0b3f3d8523d94ec7 svchostwork.exe GS, PO
6925e54c4aecd522230f5765aa6e5a29 swufeb17.exe PO
our labeled training set including 10K residential and 10K 2639cd8da42d90a2e112c3d7d3e35540 netmedia.exe GS, PO
non-residential IPs. 7b024bb2efa5428bbd04f513849cc185 start.vbs PO
e7dca36767fadfded989ed67e23c2eda cloudnet.exe PR
Evaluation and results. Using the training data of 10K b4b595be616779d4a557cdb49b1350d0 hola plugin.exe LU
residential IPs and 10K non-residential IPs, we train classifiers d85dab7b7112af3feda144bbbffa9b49 produpd.exe PR
of three types: Support Vector Machine (SVM), Random c0a3b6dbbb454a7f3f345d7a87f8e487 pprx.exe PO
Forest (RF) and Decision Tree (DT). We further evaluate the TABLE XIII: List of the top 10 PUPs with their MD5.
effectiveness of the models by 5-fold cross validation, testing automated queries. Our validation shows that the classifier
them on the rest of the four labeled datasets as well as the achieved a high precision, 95.80%.
unlabeled dataset (the RESIP IP dataset) with sampled manual
validation. B. Botnet Connections
• 5-Fold cross validation. We explored the three classifiers We studied whether IoT botnets are involved in RESIP
with various parameters. 5-fold cross validation reveals random services. Through cross-matching our RESIP IP database with
forest with 50 trees outperforms others, achieving the precision two botnet IP blacklists (Hajime [12] and IoT Reaper [13],
of 95.61% and the recall of 97.12%. see §III-D), we found 1,248 IPs reported by at least one
blacklist on the same way when serving as RESIPs. We further
• Testing on the labeled set. We test the random forest model on
discovered 28,097 RESIP IPs blacklisted between July 2017
all ground truth sets shown in Table II (only those not selected
and Nov 2017. These findings indicate that at least some
for training). As shown in Table XII, overall the classifier
resources are shared between RESIP services and botnets, due
works well. However, surprisingly, it detects 2.45% of IPs in
to either co-hosting of both bots and RESIP software on the
Alexa top 1M set as residential IPs. We find that the domains of
same residential system or co-existence of the RESIP system
those IPs often belong to small local organizations (e.g., local
and the bot-infected system behind the same NAT.
governments or small education institutions) who access the
network through residential ISP networks. Another interesting C. Others
finding is that 65.81% of public proxies (most are either Tor
Datasets and Code release. We will continue collecting
relays or proxy IPs from KuaiDaili service) are predicted as
and profiling more RESIP services and their RESIPs. Using
residential, indicating Tor network’s effective recruitment of
the techniques developed in this paper, we are working on
relay volunteers, and also the suspicious proxy sources of
publishing a service at http://rpaas.site where users can query
KuaiDaili service.
using a network prefix and obtain a comprehensive report on
• Manually validating on the unlabeled set. We also apply how the prefix has been used as RESIPs. We will also release
the random forest model on 6.2M RESIP IPs we collected weekly snapshots of our RESIP dataset, groundtruth datasets
(see §III-A). We detect 5.9M (95.22%) residential IPs and for our residential IP classifier, and all source code of this
0.3M (4.78%) non-residential IPs. To evaluate the results, we work once this paper is published.
randomly sampled and manually validated 1K RESIP IPs.
Our validation was based upon a set of indicators identified
manually. In particular, we searched the Internet to find out
whether the owner of a given IP, as indicated in its Whois
record, is an ISP or an organization; further we searched the
IP itself, which if utilized for a hosting service, most likely
was analyzed and reported by the IP information websites
such as http://whatismyipaddress.com/ip. The reason we used
those as indicators instead of classification features for manual
validation is the former are easier for human to tell. Also
some of the services have rate limits, prohibiting large-scale
1.0 1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 Resi 0.2 Resi 0.2 Resi 0.2 Resi 0.2 Resi
Non-Resi Non-Resi Non-Resi Non-Resi Non-Resi
0.0 0.0 0.0 0.0 0.0
0 5 10 15 20 25 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0 1 2 3 4 0.0 0.1 0.2 0.3 0.4

(a) F-2: # of TLD+3 domains (b) F-3: Percentage of IPs in (c) F-4: Mean number of (d) F-6: Mean number of (e) F-8: Percentage of IPs in
resolved to the given IP. current direct inetnum with TLD+3 domains resolved to TLD+3 domains resolved to /24 IP prefix with DNS records.
DNS records. IPs in current direct inetnum. IPs in current direct inetnum.
1.0 1.0 Resi Resi Resi
0.8 Non-Resi 0.8 Non-Resi 0.8 Non-Resi
0.8 0.8
0.6 0.6 0.6
0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 Resi 0.2 Resi 0.2 0.2 0.2


Non-Resi Non-Resi
0.0 0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0 1 2 3 4 5 6 1.0 1.5 2.0 2.5 3.0 3.5 4.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5

(f) F-9: Mean number of (g) F-11: Mean number of (h) F-17: log10 Mean value of (i) F-21: Mean value of the (j) F-25: Assignment type of
TLD+3 domains resolved to TLD+3 domains resolved to the sizes of historical direct depths of historical direct inet- the current direct inetnum
IPs in /24 IP prefix. IPs in current direct inetnum. inetnums nums

1.0 0.7 1.0


Resi Resi Resi
0.8 Non-Resi 0.6 Non-Resi 0.8 Non-Resi
0.8 0.8
0.5
0.6 0.6 0.6 0.6
0.4
0.4 0.4 0.3 0.4 0.4
0.2
0.2 Resi 0.2 0.2 Resi 0.2
0.1
Non-Resi Non-Resi
0.0 0.0 0.0 0.0 0.0
0 1000 2000 3000 4000 0 2 4 6 8 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 4000 5000 0 2 4 6

(k) F-29: # of direct inetnums (l) F-30: log10 # of IPs of the (m) F-33: the percent of cur- (n) F-34: # of direct inetnums (o) F-35: log10 # of IPs of the
of the current direct owners current direct owners rent loose owners over histori- of the current loose owners current loose owners
cal loose owners
Fig. 9: Cumulative distribution functions of example features on our labeled training dataset.

You might also like