Speedreader www19

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

SpeedReader: Reader Mode Made Fast and Private

Mohammad Ghasemisharif Peter Snyder


University of Illinois at Chicago Brave Software
Chicago, IL, USA San Francisco, CA, USA
[email protected] [email protected]

Andrius Aucinas Benjamin Livshits


Brave Software Brave Software & Imperial College London
London, UK London, UK
[email protected] [email protected]
ABSTRACT ACM Reference Format:
Most popular web browsers include “reader modes” that improve Mohammad Ghasemisharif, Peter Snyder, Andrius Aucinas, and Benjamin
Livshits. 2019. SpeedReader: Reader Mode Made Fast and Private. In Pro-
the user experience by removing un-useful page elements. Reader
ceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–
modes reformat the page to hide elements that are not related to the 17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA, 12 pages.
page’s main content. Such page elements include site navigation, https://doi.org/10.1145/3308558.3313596
advertising related videos and images, and most JavaScript. The
intended end result is that users can enjoy the content they are
interested in, without distraction.
1 INTRODUCTION
In this work, we consider whether the “reader mode” can be “Web bloat” is a colloquial term that describes the trend in websites
widened to also provide performance and privacy improvements. to accumulate size and visual complexity over time. The phenom-
Instead of its use as a post-render feature to clean up the clutter on a ena has been measured in many dimensions, including total page
page we propose SpeedReader as an alternative multistep pipeline size [7], page load time [5, 44, 45], memory needed [30], number of
that is part of the rendering pipeline. Once the tool decides during network requests [16, 28], amount of scripts executed [26, 34, 37, 39]
the initial phase of a page load that a page is suitable for reader mode and third parties contacted [25, 26, 28]. This work suggests that
use, it directly applies document tree translation before the page is growth in page size and complexity is outpacing improvements in
rendered. Based on our measurements, we believe that SpeedReader device hardware. All of this has a predictably negative impact on
can be continuously enabled in order to drastically improve end- user experience.
user experience, especially on slow mobile connections. Combined Web users and browser vendors have reacted to this “bloat” in a
with our approach to predicting which pages should be rendered variety of ways, all partially helpful, but with significant downsides.
in reader mode with 91% accuracy, SpeedReader achieves average Ad and tracking blockers are a popular and useful tool for reducing
speedups and bandwidth reductions of up to 27× and 84×, respec- the size complexity of sites. Prior work has shown that these tools
tively. We further find that our novel “reader mode” approach brings can be effective in reducing privacy leaks [31], network use, and
with it significant privacy improvements to users. Our approach extend device memory life. Such tools, which use filter lists, are
effectively removes all commonly recognized trackers, issues 115 inherently limited in the scope of improvements they can achieve.
fewer requests to third parties, and interacts with 64 fewer trackers While these filter lists are large [42], they are small as a proportion
on average, on transformed pages. of all URLs on the web. Similarly, while these lists are updated often,
they are updated slowly compared to URL updates on the web.
CCS CONCEPTS Similarly, “reader mode” tools, provided in many popular
browsers and browser extensions, are an effort to reduce the grow-
• Human-centered computing → Web-based interaction; •
ing visual complexity of websites. Such tools attempt to extract
Information systems → Browsers; Clustering and classifica-
the subset of page content useful to users, and remove advertising,
tion; Content analysis and feature selection; • Security and privacy
animations, boilerplate code, and other non-core content. Current
→ Privacy protections.
“reader modes” do not provide the user with resource savings since
the referenced resources have already been fetched and rendered.
KEYWORDS The growth and popularity of such tools suggest they are useful to
Reader Mode; Boilerplate Removal; Web Document Classification; browser users, looking to address the problem of page clutter and
Web Performance; Ad Blocking visual “bloat”.
In this work, we propose a novel strategy called SpeedReader
This paper is published under the Creative Commons Attribution 4.0 International for dealing with resource and bloat on websites. Our technique
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution. provides a user experience similar to existing “reader mode” tools,
WWW ’19, May 13–17, 2019, San Francisco, CA, USA but with network, performance, and privacy improvements that
© 2019 IW3C2 (International World Wide Web Conference Committee), published exceed existing ad and tracking blocking tools, on a significant
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-6674-8/19/05. portion of websites. Significantly, SpeedReader differs from exist-
https://doi.org/10.1145/3308558.3313596 ing deployed reader mode tools by operating before page rendering,
which allows it to determine which resources are needed for the websites user encounters that are amenable to SpeedReader, under
page’s core content before fetching. several browser use scenarios. Section 5 provides some discussion
How we achieve speedups. SpeedReader achieves its perfor- for how our findings can inform future readability, privacy and
mance improvements through a two-step pipeline: performance work, Section 6 places this work in the context of
prior research, and Section 7 concludes.
(1) SpeedReader uses a classifier to determine whether there
is a readable subset of the initial, fetched page HTML. This
classifier is trained on a labeled corpus of 2,833 websites (see 2 BACKGROUND
Section 3), and determines whether a page can be display in 2.1 Terminology
reader mode with 91% accuracy. This subsection presents several terms that are not standardized.
(2) If the classifier has determined that the page is readable, We present them up front, to ease the understanding of the rest of
SpeedReader extracts the readable subset of document be- the work.
fore rendering, using a variety of heuristics developed in prior
research [24] and browser vendors [9, 23], and passes the Reader mode. We use the term “reader mode” to describe any tool
simplified, reader mode document to the browser’s render that attempts to extract a useful subset of a website for a simplified
layer. This tree translation step is described in Section 4. presentation. These tools can be either included in the web browser
by the browser vendor, added by users through browser extensions,
Deployment. Combined with a highly accurate classifier of “read- or provided by third parties as a web service. Our use of the term
able” pages, the drastic improvements in performance, reduction in “reader mode” is generic to the concept, and should not be confused
bandwidth use and elimination of trackers in reader mode make with any specific tool.
the approach practical for continuous use. We therefore propose
Classification and transduction. Reader mode tools generally
SpeedReader as a sticky feature that a user can toggle to be always
include both a technique for determining whether a page is readable,
on. This approximates the experience of using an e-book reader,
which we refer to as “classification”, and a strategy for converting
but with strengths of content availability on the web. It is also a
the initial HTML tree into a simplified reader mode tree, which we
suitable strategy for content prerendering or prefetching that could
refer to as “tree transduction”. Though most reader mode tools in-
be implemented by web browser vendors, automatically delivering
clude both steps within a single tool or library, they are conceptually
graceful performance degradation in poor connectivity areas or on
distinct.
underpowered mobile devices until the rest of the page content can
be fetched for a complete render. Readable. We use the term “readable” to describe whether a web
Contributions. page contains a subset of content that would be useful to display in a
reader mode presentation. Reader mode presentation works best on
• Novel approach to Reader Mode - combining machine- pages that are text and image focused, and that are mostly static (i.e.
learning driven approach to checking whether content can few interactive page elements). Examples of such readable pages in-
be transformed to text-focused representation for end-user clude articles, blog posts, and news sites. Reader mode presentation
consumption. does not work well on websites that are highly interactive, or when
• Applicability - we demonstrate that 22.0% of web pages a page’s structure is significant to the page’s content. Examples
are convertible to reader mode style in a dataset of pages of such non-readable pages include web applications (e.g. Google
reported popular by Alexa. We further demonstrate that Mail, Google Maps) or pages that are indexes of other content.
46.27% of pages shared on social networks are readable.
• Privacy - we demonstrate that using reader mode in the
2.2 Existing Reader Modes
proposed design provides superior privacy protection, ef-
fectively removing all trackers from the tested pages, and Several popular web browsers include reader modes designed to
dramatically reducing communication with third-parties. simplify a page’s presentation, so that browser users can read the
• Ad Blocking - we show that our unique reader mode ap- page’s contents without distraction of visual clutter such as adver-
proach blocks ads at least as well as existing ad blocking tools, tisements, page animations, and unnecessary page boilerplate (e.g.
blocking 100% of resources labeled as advertising related by footers, page navigation, comments).
EasyList in a crawl of 91,439 pages, without the need to use In this section, we give a brief description of several existing
hand curated, hard-coded filter lists. reader mode tools, how they’re deployed by their authors, and how
• Speed - we find that the lightweight nature of reader mode they are used in the evaluations given in the rest of this paper.
content results in huge performance gains, with up to 27×
Readability.js. Readability.js [9] is an open source reader mode
page load time speedup on average, together with up to 84×
library, implemented in JavaScript. It is maintained by Mozilla, and
bandwidth and 2.4× memory reduction on average.
is used for the reader mode function in Firefox. The code is closely
Paper organization. The rest of this paper is structured as follows. related to “Readability” [2], an open sourced library developed by
Section 2 provides background information to place SpeedReader Arc90 and used for their now-defunct readability.com web service.
in context. Section 3 describes the design, evaluation and accuracy Classification works by looking for the element on the page with
of the classifying step in the SpeedReader pipeline, and Section the highest density of text and link nodes. If the number and density
4 gives the design and evaluation of the reader mode extraction of text and link nodes in that element exceed a given threshold,
step in the SpeedReader pipeline. Section 3.3 measures how many the library treats the page as readable. Tree transduction works
2
by normalizing the contents of the text-and-link dense element (to HTML </> HTML </>

remove styling and other mark up), looking for near-by images for Page rendering,
Extract features
executing
inclusion, and using text patterns in the document that identify the JavaScript

page’s author, source and publication date. No


Has readable
Fetching
Significant to SpeedReader, Readability.js does not consider any subset?(Classifier)
resources,
trackers, ads, etc.
display or presentation information when performing either the Yes
Page rendering,
classification or tree transduction steps. This means that the page executing
JavaScript
Tree
transduction Extract features
does not need to be loaded and rendered to generate a reader mode
presentation (though in practice Firefox does not use this library in Fetching
resources,
Reader Mode
No
trackers, ads, etc. Has readable
this way). subset?(Classifier)

Get necessary
resources for Yes Do not show the
reader mode reader mode
Safari Reader View. Safari Reader View is a JavaScript library Show the reader
button

that implements the reader mode presentation in Safari. Like Read- mode button

ability.js, it is also a fork from Arc90’s “Readability”, though Apple


Tree
has changed how the library works in significant ways. In addition transduction

to looking for elements with high text and anchor density, Sa-
fari Reader View also uses presentation-level heuristics, including Reader Mode

where elements appear on the page and what elements are hid-
Get necessary
den from display by default. Relevant to SpeedReader, this means resources for
reader mode

that Safari Reader View must load a page and at least some of its Figure 1: Comparison of SpeedReader (left) with other existing
resources (e.g. images, CSS, JavaScript) to perform either the clas- reader modes (right)
sification or tree transduction level decisions. Because significant
portions of Safari Reader View require a document be fetched and small part of DOM Distiller’s strategy. We modified DOM Distiller
rendered before being evaluated, we do not consider it further in to remove these display level checks, so that DOM Distiller could
this work (for reasons that are detailed in Sections 3 and 4). be applied to prerendered HTML documents. We note that the
evaluation of DOM Distiller in this work uses this modified ver-
BoilerPipe. BoilerPipe is an academic research project from sion of DOM Distiller, and not the version that Google ships with
Kohlschütter et al. [24], and is implemented in Java. BoilerPipe Chrome. We expect this modification to have minimal effects on
has not been deployed directly by any browser vendor. BoilerPipe the discussed measurements, but draw the reader’s attention to this
does not provide functionality for (readability) classification, and change for completeness.
assumes that any HTML document contains a readable subset. For
tree transduction, BoilerPipe considers number of words and link
2.3 Comparison to SpeedReader
density features. Like Readability.js, it does not require a browser
to load and render a page in order to do reader mode extraction. The reader mode functionality shipped with all current major
Their analysis reveals a strong correlation between short text and browsers is applied after the document is fully fetched and ren-
boilerplate, as well as long text and actual content text (of the tex- dered.1 This greatly restricts the possible performance, network
tual content) on the Web. Using features with low calculation cost and privacy improvements existing reader modes can achieve. In
such as number of words enables BoilerPipe to lower the overhead fact, in some reader mode implementations we measured, using
while maintaining high accuracy. reader modes increased the amount of network used, as some re-
sources were fetched twice, i.e. once for the initial page loading,
DOM Distiller. DOM Distiller is a JavaScript and C++ library main- and then again when presenting images in the reader mode display.
tained by Google, and used to implement reader mode in recent Most significantly, SpeedReader differs from existing reader
versions of Chrome. The project is based on BoilerPipe, though mode techniques in that it is implemented strictly before the display,
has been significantly changed by Google. The classification step rendering, and resource-fetching steps in the browser. SpeedReader
in DOM Distiller uses a classifier based approach, and considers can therefore be thought of as a function that sits between the
features such as whether the page’s URL contains certain keywords browser’s network layer (i.e. takes as input the initial HTML doc-
(e.g. “forum”, “.php”, “index”), if the page’s markup contains Face- ument), and returns either the received HTML (when there is no
book open graph, Google AMP, identifiers, or the number of “/” readable subset), or a greatly simplified HTML document, represent-
characters used in the URL’s path, in addition to the text-and-link ing the reader mode presentation (when there is a readable subset).
density measures used by Readability.js. At a high level, the tree Figure 1 provides a high level comparison of how SpeedReader
transduction step also looks at text-and-link dense element in the functions, compared to existing reader modes.
page, as well as special-cased embedded elements, such as YouTube The fact that SpeedReader only considers features available in
or Vimeo videos. the initial HTML and URL enables SpeedReader to achieve perfor-
DOM Distiller considers some render-level information in both mance orders of magnitude above existing approaches. Figure 2
the classification and tree transduction steps. For example, any
elements that are hidden from display are not included in the text- 1 WhileReadability.js does not require that the page be rendered before making reader
and-link density measurements. These render-level checks are a mode evaluations, in practice Firefox does not expose reader mode functionality to
users until after the page is fetched and loaded.
3
Figure 3: The example page transformed with each of the evaluated
SpeedReader transducers

3.1 Classifier Design


The classification step of SpeedReader uses a random forest clas-
sifier, trained on a hand-labeled data set of 2,833 websites. Our
Figure 2: An example page loaded with Google Chrome browser classifier takes as input a string, depicting an HTML document,
with no modifications and returns a boolean label of whether there is a readable subset of
the document. We note that the input to the classifier is the initial
provides a strawman example of a news page as delivered to a stan- HTML returned by the server, and not the final state of the website
dard client: including portal branding and content, but also a range after JavaScript execution.
of links to different articles, images and trackers, for a total of 2.7MB Our classifier is designed to execute quickly, since document
of data transferred and 53 scripts executed. Figure 3 demonstrates rendering is delayed during classification. The classifier is trained
the functionality of SpeedReader when applying existing reader using 50 estimators, it expands the nodes until all leaves are pure or
mode transducers to just the initial HTML document. Therefore, contain less than 2 samples, and considers 21 features, each selected
for documents SpeedReader determines are readable, the sources to be extractable quickly. Selected features include the number of
of SpeedReader improvements include: text nodes, number of words, the presence of Facebook open graph
or Google AMP markup, and counts for a variety of other tags.
• Never fetching or executing script or CSS. Our classifier considers the following features. We have made
• Fetching far fewer images or videos (since images and videos the source code for our classifier available publicly as well.2
not core to the page’s presentation are never retrieved). • Counts of the following tags: <p>, <ul>, <ol>, <dl>,
• Performing network requests to far fewer third parties (zero, <div>, <pre>, <table>, <select>, <article>, <section>,
in the common case). <blockquote>, <a>, <img>, <script>
• Saving processing power from not rendering animations, • Count of block elements that contain at least 400 words.
videos or complex layout operations, since reader mode pre- • # of words in block elements that match above condition.
sentations of page content are generally simple. • Number of path segments in the URL.
• Boolean determination if the page has any of the following
metatags: amphtml, fb_pages, og_article.
The above are just some of the ways that SpeedReader is able to
• Boolean determination if the page has plaintext match for
achieve considerable performance improvements. The following
any of schema.org markup for Article, NewsArticle or
sections describe how SpeedReader’s classification and tree trans-
APIReference.
duction steps were designed and evaluated, and what percentage
of websites are amenable to SpeedReader’s approach.
3.2 Classifier Accuracy
The goal of the classifier in SpeedReader is to predict whether the
end result of a page’s fetching and execution will result in a readable
3 PAGE CLASSIFICATION page, based on the initial HTML of the page. This section describes
SpeedReader uses a two stage pipeline for generating reader mode the data set we used to both train the SpeedReader classifier, and
versions of websites. This section presents the design and evaluation
of the first half of the pipeline, the classification step. 2 https://github.com/brave/speedreader-paper-materials.git

4
Table 1: Description of data set used for evaluating and training
“readability” classifiers.
Data set Number of pages % Readable 100.0%
Article pages 956 91.8%
Landing pages 932 1.5% 80.0%
Random pages 945 22.0%
Total 2,833 38.8%

Share of Pages
60.0%
Replay = 15.5 Broadband = 652
Table 2: Accuracy measurements for three classifiers attempting to Classification = 1.9 3G = 2606
40.0%
replicate the manual labels described in Table 1.
Classifier Precision Recall
20.0% curl, domestic broadband
ReadabilityJS 68% 85% curl, simulated 3G
DOM Distiller 90% 75% replayed trace
SpeedReader Classifier 91% 87% 0.0% prediction time
1 10 100 1000 10000 100000
to evaluate its accuracy against existing popular, deployed reader Time (ms)
mode tools. Figure 4: Time to fetch initial HTML document.
Data Set. To assess the accuracy of our classifier, we first gathered
2,833 websites, summarized in Table 1. Our data set is made up For comparison sake, we also evaluated the accuracy of the clas-
of three smaller sets of crawled data, each containing 1,000 URLs, sification functionality in Readability.js and our modified version
each meant to focus on a different kind of page, with a different of DOM Distiller when applied to the same data set, to judge their
expected distribution of readability. 1,000 pages were URLs selected ability to predict the final readability state of each document, given
from the RSS feeds of popular news sites (e.g. The New York Times, its initial HTML. We note that Readability.js is designed to be used
ArsTechnica), which we expected to be frequently readable. The this way, but that this prediction point is slightly different than how
second 1,000 pages were the landing pages from the Alexa 1K, DOM Distiller is deployed in Chrome. In Chrome, DOM Distiller
which we expected to rarely be readable. The final 1,000 pages labels a page as readable based on its final rendered state. This eval-
were selected randomly from non-landing pages linked from the uation of DOM Distiller’s classification capabilities should therefore
landing pages of the Alexa 5K, which we expected to be occasionally not be seen as an evaluation of DOM Distiller’s overall quality, but
readable. We built a crawler that, given a URL, recorded both the only its ability to achieve the kinds of optimizations sought by
initial HTML response, and a screenshot of the final rendered page SpeedReader. Table 2 presents the results of this measurement. As
(i.e. after all resources had been fetched and rendered, and after the table shows, SpeedReader strictly outperforms the classifica-
JavaScript had executed). We applied our crawler to each of the tion capabilities of both DOM Distiller and Readability.js. DOM
3,000 selected URLs. 167 pages did not respond to our crawler, Distiller has a higher false positive rate than our classifier, while
accounting for the difference between the 3,000 selected URLs and Readability.js has a higher false negative rate.
the 2,833 pages in our data set.
Finally, we manually considered each of the final page screen- 3.3 Classifier Usability
shots, and gave each a boolean label of whether there was a subset
Problem Statement. Our classifier operates on complete HTML
of page content that was readable. We considered a page readable
documents, before they are rendered. As a result, the browser is not
if it met the following criteria:
able to render the document until the entire initial HTML document
(1) The primary utility of the page was its text and image content is fetched. This is different from how current browsers operate,
(i.e. not interactive functionality). where websites are progressively rendered as each segment of the
(2) The page contained a subset of content that was useful, with- HTML document is received and parsed. This entails a trade off
out being sensitive to its placement on the page. between rendering delay (since rendering is delayed until the initial
(3) The usefulness of the page’s content was not dependent on HTML document) and network and device resource use (since,
its specific presentation or layout on the website. when a page is classified as readable, far fewer resources will be
This meant that single page applications, index pages, and pages fetched and processed).
with complex layout were generally labeled as not-readable, while In this sub-section, we evaluate the rendering delay caused by
pages with generally static content, and lots of text and content- our classifier, under several representative network conditions. The
depicting media, were generally labeled readable. We also share rendering delay is equal to the time to fetch the entire initial HTML
our labeled data,3 and a guide to the meaning behind the labels,4 document. We find that the rendering delay imposed is small, espe-
to make our results transparent and reproducible. cially compared to the dramatic performance improvements deliv-
Evaluation. We evaluated our classifier on our hand labeled corpus ered when a page is readable (discussed in more detail in Section 4).
of 2,833 websites, performing a standard ten-fold cross-validation. Classification Time. We evaluated the rendering delay imposed
3 https://github.com/brave/speedreader-paper-materials/blob/master/labels.csv by our classifier by measuring the time taken to fetch the initial
4 https://github.com/brave/speedreader-paper-materials/blob/master/labels-legend. HTML for a page, under different network conditions, and com-
txt pared it against the time taken for document classification.
5
Table 3: Measurements of how applicable our readability strategy is as well as its relevance to the web. As presented in Table 3, we find
under common browser use scenarios. that a significant number of visited URLs are readable, suggesting
that SpeedReader can deliver significant privacy and performance
Measurement # measured # readable % readable
improvements to users. This subsection continues by describing
Popular pages 42,986 9,653 22.5%
Unpopular pages 40,908 8,794 21.5% how we selected URLs in each browsing scenario.
Total: Random crawl 83,894 18,457 22.0% Websites by popularity. We first estimated how many pages
Reddit linked 3,035 1,260 41.51%
hosted on popular and unpopular domains are readable. To do
Twitter linked 494 276 31.2% so, we first created two sets of domains, a popular set, consisting
RSS linked 506 331 65% of the most popular 5,000 domains, as determined by Alexa, and an
Total: OSN 4,035 1,867 46.27% unpopular set, comprising a random sample of pages ranked 5,001–
100,000. For each domain, we conducted a breadth three, depth
First, we determined how long our classifier took to determine three crawl. We first visited the landing page for the domain, and
if a parsed HTML document was readable. We did so by parsing recorded all URLs linked to pages with the same TLD+1 domain.
each HTML string with myhtml, a fast, open source, C++ HTML Then we selected up to three URLs from this set, and repeated the
parser [4]. We then measured the execution time taken to extract above process another time, giving a maximum of 13 URLs per do-
the relevant features from the document, and to return the predicted main, and a total data set of 91,439 pages. The crawl was conducted
label. Our classifier took 2.8 ms on average and 1.9 ms in the median from AWS IP addresses on 17-20 October 2018.
case. Next, we measured the fixed, simulation cost time of serving Social network shared content. We next estimated how much
each web page from a locally hosted web server, which allowed content linked to from online social networks is readable, to sim-
us to account for the fixed overhead in establishing the network ulate a user that spends most of their browsing time on popular
connection, and similar unrelated browser book keeping operations. online social networks, and generally only browses away to view
This time was 22.3 ms on average, and 15.5 ms median. shared content. We gathered URLs shared from Reddit and Twitter.
Finally, we selected two network environments to represent dif- We gathered links shared on Reddit by using RedditList [32] to
ferent network conditions and device capabilities web users are obtain top 125 subreddits ranked based on their number of sub-
likely to encounter: a fast, domestic broadband link, with 50 Mbps scribers. We then visited the 25 posts of each popular subreddit
uplink/downlink bandwidth and 2 ms latency as indicated by a pop- and extracted any shared URLs. For Twitter, we extracted shared
ular network speed testing utility,5 and a simulated 3G network, links from the top 10 worldwide Twitter trends by crawling and
created using the operating system’s Network Link Conditioner.6 extracting shared links from their Tweets.
We use a default 3G preset with 780 kbps downlink, 330 kbps uplink, RSS / feed readers. Finally, we estimated how much content
100 ms packet delay in either direction and no additional packet loss. shared from RSS feeds is readable, to simulate a user who finds
Downloading the documents on such connection took 1,372 ms content mainly through an RSS (or similar) aggregation service.
/ 652 ms (average/median) and 4,023 ms / 2,516 ms for the two We built a list of RSS-shared content by crawling the Alexa 1K,
cases respectively. Figure 4 summarizes the results of those mea- identifying websites that included RSS feeds, and fetching the five
surements. Overall, the approximately 2.8 ms taken for an average most recent pages of content in each RSS feed.
document classification is a tiny cost compared to just the initial
HTML download on reasonably fast connections. It could poten- 3.5 Conclusion
tially be further optimized by classifying earlier, i.e. when only a
In this section we have described how SpeedReader determines
chunk of the initial document is available. Initial tests show promis-
whether a page should be rendered in reader mode, based on its
ing results, however this adds significant complexity to patching
initial HTML. We find that SpeedReader outperforms the classi-
the rendering pipeline and we leave it for future work.
fication capabilities of existing, deployed reader mode tools. We
also find that the overhead imposed by our classification strategy
3.4 Applicability to the Web
is small and acceptable in most cases, and dwarfed by the perfor-
While subsequent sections will demonstrate the significant per- mance improvements delivered by SpeedReader, for cases when a
formance and privacy improvements provided by SpeedReader, page is judged readable.
these improvements are only available on a certain type of web
document, those that have readable subsets. The performance im- 4 PAGE TREE TRANSDUCTION
provements possible through SpeedReader are therefore bounded
This section describes how SpeedReader generates a reader mode
by the amount of websites users visit that are readable.
presentation of a page, for pages that have been classified as read-
In this subsection, we determine how much of the web is
able. Our evaluation includes three possible reader mode renderings,
amenable to SpeedReader, by applying our classifier to a sampling
each presenting a different trade off between amount of media in-
of websites, representing different common browsing scenarios. Do-
cluded, performance and privacy improvements.
ing so allows us to estimate the benefits SpeedReader can deliver
Generating a reader mode presentation of an HTML document
5 speedtest.net- web service that provides analysis of Internet access performance can be thought of as translating one tree structure to another: taking
metrics, such as connection data rate and latency the document represented by the page’s initial HTML and generat-
6 Network Link Conditioner is a tool released by Apple with hardware IO Tools for
XCode developer tools to simulate different connection bandwidth, latency and packet ing the document containing a simplified reader mode version. This
loss rates process of tree mapping is generally known as tree transduction.
6
Table 4: Description of data set used for evaluating the performance Table 5: Performance comparisons of three popular readability tree
implications of different content extraction strategies. transducer strategies, as applied to the data set described in Table 4.
Values are given as Average, Median. Gain multiplier (×) is calcu-
Measurement Value lated for each page load and Average and Median values are re-
Measurement date 17-20 October 2018 ported.
# crawled domains 10,000
# crawled pages 91,439 Transducer Resources Data Memory Load Time
# domains with readable pages 4,931 (#) (KB) (MB) (ms)
# readable pages 19,765 - A M A M A M A M
% readable pages 21.62%
Default 144 91 2,283 1,461 197 174 1,813 1,069
ReadabilityJS 5 2 186 61 85 79 583 68
We evaluate Tree transduction by comparing the performance and Dom Distiller 5 2 186 61 84 79 550 63
BoilerPipe 2 2 101 61 81 77 545 44
privacy improvements of the three techniques (Readability.js, DOM
Gain (×)
Distiller and BoilerPipe) described in detail in Section 2.2.
ReadabilityJS 51 28 84 24 2.4 2.1 20 11
Dom Distiller 52 32 84 24 2.4 2.1 21 12
4.1 Limitations and Bounds BoilerPipe 77 48 84 24 2.4 2.1 27 15
We note that we did not attempt any evaluation of how users per-
ceive or enjoy the reader mode versions of pages rendered by each page, against the full version of each page. We evaluate perfor-
considered technique. We made this decision for several reasons. mance and privacy characteristics of each page by visiting the URL
First, two of the techniques (Readability.js and DOM Distiller) are as replayed from its archive. These findings are described in detail
developed and deployed by large browser vendors, with millions in the next subsections.
or billions of users. We assume that these large companies have We note that using a replay proxy with a snapshot of content
conducted their own usability evaluation of their reader mode often underestimates the costs of a page load. Despite taking care
techniques, and found them satisfactory to users. to mitigate the effects of non-determinism by injecting a small
Second, the third considered tree transduction technique, script that overrides date functions to use a fixed date and random
Kohlschütter et al’s BoilerPipe [24], is an academic work that in- number generator functions to use a fixed seed and produce a
cludes its own evaluation, showing that the technique can success- predictable sequence of numbers, it cannot account for all sources
fully extract useful contents from HTML documents. We assume of non-determinism. For all requests that the proxy cannot match, it
that the authors’ evaluation is comprehensive and sufficient, and responds with a Not Found response. We notice that it results in a
that their technique can successfully render pages in reader mode small number of requests being missed, primarily those responsible
presentations. Finally, we are planning to deploy a tree transducer for dynamic ad loading or tracking. It also occasionally interferes
different from existing techniques and a more thorough subjective with site publisher’s custom resource fetching retry logic, where
evaluation of its presentation is left for future study. the same request is retried a number of times unsuccessfully, before
the entire page load times out and the measurement is omitted.
4.2 Evaluation Methodology
We compared the performance and privacy improvements achieved
4.3 Results: Performance
through SpeedReader’s novel application of three tree transduc- We measured four performance metrics: number of resources re-
tion techniques: Readability.js, DOM Distiller and BoilerPipe. We quested, amount of data fetched, memory used and page load time.
conducted this evaluation in three stages. These results are summarized in Table 5 and Figure 5.
First, we fetched the HTML of each URL in the random crawl We ran all measurements on AWS m5.large EC2 instances. For
data set outlined in Table 3, again from an AWS IP. The HTML performance measurements, one test was executed at a time, per
considered here is only the initial HTML response, not the state of instance. For each evaluation, we fetched the page from a previ-
the document after script execution. We evaluated whether each ously collected record-replay archive, with performance tracing
of the 91,439 fetched pages that were classified as readable, by enabled. Once the page was loaded and the performance metrics
applying the SpeedReader classifier to each page. We then reduced are recorded, we closed the browser and proxy, and started the
the data set to the 19,765 pages (21.62%) were readable. next test. No further steps were taken to minimize the likelihood
Second, we revisited each URL classified as readable to collect of test VM performance being impacted by interfering workloads
a complete version of the page. To minimize variations in page on the underlying hardware. For all tests, we used an unmodified
performance and content during the testing, we collected the "replay Google Chrome browser, version 70.0.3538.67, rendered in Xvfb.7
archive" for each page using the "Web Page Replay" (WPR) [22] Although profiling has overheads of its own [33], in particular for
performance tool. WPR is used in Chrome’s testing framework for memory use and load times, we used a consistent measurement
benchmarking purposes and works as a proxy that records network strategy across all tests, and therefore expect the impact to also be
requests or responds to them instead of letting them through to consistent and minor compared to relative performance gains.
the source depending on whether it works in "record" or "replay" We measured a page’s load time as the difference between
mode. navigationStart and loadEventEnd events [46] in the main
Finally, we applied each of the three tree transduction techniques frame (i.e. the time until all sub-resources have been downloaded
to the remaining 19,765 HTML documents, and compared the net- 7 WhileChrome "headless" mode is available, it effectively employs a different page
work, resource use, and privacy characteristics of each transformed rendering pipeline with different load time characteristics and memory footprint.
7
100.0% Normal Page
DOMDistiller
Firefox
75.0% BoilerPipe
Share of Pages

50.0%

25.0%

0.0%
1 10 100 1000 10000 100 200 300 400 20 50 100 500 1000 30006000 15000
Data Downloaded (KB) Memory Footprint (MB) Load Time (ms)
Figure 5: Performance characteristics of the different tree transducer strategies applied, showing the distribution of the key performance
metrics.

and the page is fully rendered). Since page content is replayed from Table 6: Comparisons of the privacy implications of three popular
a local proxy, network bandwidth and latency variation impact is readability tree transducer strategies, as applied to the data set de-
scribed in Table 4. Values are given as Average and Median values.
minimized and the reported load time is a very optimistic figure,
especially for bigger pages with more sub-resources as illustrated Transducer # third-party # scripts Ads & Trackers
in Figure 4. Although network cost is still non-zero, the number Avg Med Avg Med Avg Med
primarily reflects the time taken to process and render the entire Default 117 63 83 51 63 24
page. ReadabilityJS 3 1 0 0 0 0
Dom Distiller 3 1 0 0 0 0
We also recorded the number of resources fetched and the BoilerPipe 1 1 0 0 0 0
amount of data downloaded during each test. Note that the amount
of data downloaded for all of the tree transduction strategies reflects
and thus before any requests to third parties have been initiated.
the size of the initial HTML rather than that of the transformed
The privacy improvements gained by SpeedReader are threefold:
document, as the transformation happens on the client and does
a reduction in third party requests, a reduction in script execution
not result in additional network traffic. All measured transducers
(an often necessary, though not sufficient, part of fingerprinting
discard the majority of page content (both in page content like text
online), and a complete elimination of ad and tracking related re-
and markup, but also referenced content like images, video files,
quests (as labeled by EasyList and EasyPrivacy). This last measure
and JavaScript). Figures 2 and 3 provide an example of how tree
is particularly important, since 92.8% of the 19,765 readable pages
transduction techniques simplify page content.
in our data set loaded resources labeled as advertising or tracking
For memory consumption, we measure the overall memory used
related by EasyList and EasyPrivacy [10, 11].
by the browser and its subprocesses. Google Chrome uses a multi-
This subsection proceeds by both describing how we measured
process model, where each tab and frame may run in a separate
the privacy improvements provided by SpeedReader, and the re-
process and content of each page also affects what runs in the
sults of that measurement. These findings are presented in Table 6.
main browser process. We note that our testing scenario does not
We measured the privacy gains provided by SpeedReader by
consider the case of multiple pages open simultaneously in the
first generating reader mode versions of each of the 19,765 readable
same browsing session, as some of the resources are reused. The
URLs in our dataset, and counting the number of third parties, script
reported number is therefore that of the entire browser rather than
resources, and ad and tracking resources in each generated reader
the specific page alone, with some fixed browser runtime overheads.
mode page. We determined the number of ad and tracking resources
Memory snapshots are collected with an explicit trigger after the
by applying EasyList and EasyPrivacy with an open-source ad-block
page load is complete with disabled-by-default-memory-infra
Node library [21] to each resource URL included in the page. We
tracing category enabled. Despite including a level of fixed browser
then compared these measurements to the number of third-parties,
memory costs, we still see average memory reduction of up to
script units, and ad and tracking resource requests made in the
2.4× in average or median cases. Overall, depending on the chosen
typical, non-reader mode rendering of each URL.
transducer, we show:
We found that all three of the evaluated tree transduction tech-
• average speedups ranging from 20× to 27× niques dramatically reduced the number of third parties communi-
• average bandwidth savings on the order of 84× cated with, and removed all script execution and ad and tracking
• number of requests is reduced 51× to 77× resource requests from the page. Put differently, SpeedReader is
• average memory reduction of 2.4× able to achieve privacy improvements at least as good, and almost
certainly exceeding existing ad and tracking blockers, on readable
pages. This claim is based on the observation that ad and tracking
4.4 Results: Privacy blockers do not achieve the same significant reduction in third party
SpeedReader achieves substantial privacy improvements, because communication and script execution as SpeedReader achieves.
it applies the tree transduction step before rendering the document,
8
5 DISCUSSION AND FUTURE WORK too. First, user privacy is harmed, since the rendering-server must
manage and observe all client secrets when interacting with the
5.1 Reader Mode as a Content Blocker
destination server on the client’s behalf. Additionally, while the
Most existing reader mode tools function to improve the presenta- server may be able to improve the loading and rendering of the
tion of page content for readers, by removing distracting content page, its limited in the kinds of performance improvements it can
and reformatting text for the browser user’s benefit. While the pop- achieve. Server assisted rendering does not provide any of the pre-
ularity of existing reader modes suggest that this is a beneficial use sentation simplification or content blocking benefits provided by
case, the findings in this work suggest an additional use case for SpeedReader.
reader modes, blocking advertising and tracking related content.
As discussed in Section 4.4, SpeedReader prevents all ad and 5.3 SpeedReader Deployment Strategies
tracking related content from being fetched and rendered, as identi-
fied by EasyList and EasyPrivacy (Table 6). SpeedReader also loads Always On. SpeedReader as described in this work is designed
between 51 and 77 times fewer resources than typical page render- to be “always on”, attempting to provide a readable presentation
ing and reader modes (Table 5), a non-trivial number of which are of every page fetched. Although Safari Reader View also supports
likely also ad and tracking related. SpeedReader differs fundamen- an “always on” functionality, it lacks performance and privacy
tally from existing content blocking strategies. Existing popular enhancement provided by SpeedReader (Section 2). While this
tools, like uBlock Origin[20] and AdBlock Plus[15], aim to identify decision maximizes the amount of privacy and performance im-
malicious or undesirable content, and prevent it from being loaded provements provided, it entails an overhead while loading each
or displayed; all unlabeled content is treated as desirable and loaded page (Figure 4), which may not be worthwhile in some browsing
as normal. SpeedReader, and (at last conceptually) reader modes in patterns such as interacting with application-like sites. Additionally,
general, take the opposite approach. Reader modes try to identify there may be times when users want to maintain a page’s inter-
desirable content, and treat all other page content as undesirable, active functionality (e.g. JavaScript), even when SpeedReader has
or, at least, unneeded. determined that the page is readable. Ensuring the user’s ability to
Our results suggest that the reader mode technique can achieve disable SpeedReader would be important in such cases. The system
ad and tracking blocking quality at least as well as existing content described in this work does not preclude such an option, but only
blocking tools, but with dramatic performance improvements. We imagines changing the default page loading behavior.8
expect that SpeedReader actually outperforms content blocking Tree Transduction Improvements. The three evaluated tech-
tools (as content blockers suffer from false-negative problems, for niques in Section 4, which are adapted from existing tools and
a variety of reasons), but lack a ground truth basis to evaluate research, can provide a reader mode presentation with different
this claim further. We suggest evaluating the content blocking performance and privacy improvements. Users of SpeedReader
capabilities of reader mode-like tools as a compelling area for future could select which tree transduction technique best suited their
work. needs. However, we expect that ML and similar techniques could
be applied to the tree transduction problem, to provide a reader
5.2 Comparison to Alternatives mode presentation that exceeds existing techniques. An improved
SpeedReader exists among other systems that aim to improve the tree transduction algorithm would achieve equal or greater per-
user experience of viewing content on the web. While a full eval- formance and privacy improvements, while doing a better job of
uation of these systems is beyond the scope of this work (mainly maintaining the meaning and information of the extracted content.
because the compared systems have different goals and place dif- We are currently exploring several options in this area, but have
ferent restrictions on users), we note them here for completeness. found the problem large enough to constitute its own unique work.

AMP. Accelerated Mobile Pages (AMP)[17] is a system developed 6 RELATED WORK


by Google that improves website performance, in a number of ways.
Website authors opt-in to the AMP system by limiting their content Content Extraction. The problem of removing boilerplate and ex-
to a subset of HTML, JavaScript and CSS functionality, which allows tracting relevant content from a webpage has been extensively stud-
for optimized loading and execution. AMP pages are also served ied. Previous approaches primarily focused on the code structure,
from Google’s servers, which provide network level improvements. visual representation and the link between the two. Lin et al. [29]
AMP differs from SpeedReader and other reader mode systems proposed a method to detect content blocks using <TABLE> tags and
in that users only achieve performance improvements when site calculate their entropy to distinguish the informative blocks from
authors design their pages for AMP; AMP offers no improvement the redundant ones. Laber et al. [27] proposed a heuristic method
on existing, traditional websites. for extracting textual sections and title from news articles using
<a>, <p> and <title> tags. Other studies have tried to detect useful
Server-Assisted Rendering. Other browser vendors attempt to segments in a web page using structural and positional information.
improve the user experience by moving page, loading, rendering Gupta et al. [18] introduced a DOM-based method to modify and
execution from the client to a server. The client then fetches a ren- remove irrelevant DOM nodes to extract the main content. Their
dered version of the page from the server (generally either rendered
8 Current browsers and reader modes load all pages in the standard manner, and allow
HTML or as a bitmap). The most popular such system is likely Ama-
the users to enable a reader mode presentation, while SpeedReader would load pages
zon Silk[1]. While there are significant performance upsides with in the optimized reader mode presentation by default, when possible, and allow users
this thin-client technique, they come with significant downsides to enable the standard loading behavior.
9
approach utilized filters to remove DOM nodes with advertisements, could affect user experience. Moreover, their analysis demonstrated
and link and text ratio thresholds to remove unwanted table cells. that the number of loaded objects and servers could indicate page
While the proposed rule-based method was simple, it had a poor load time, and both numbers were significant in News websites.
performance in link rich pages where the main content contained
many links. Weninger et al. [47] introduced a fast algorithm which Performance and User Experience. While complexity of web-
calculated the HTML tag ratio of each line to cluster and extract pages can affect page load time, their visual complexity can impact
text content. Their algorithm did not perform well on home pages user experience. Harper et al.showed that visual complexity in web-
as well as it suffered from high recall and low precision. Cai et al. [8] pages, defined as diversity, density, and positioning of the elements,
introduced a tag-free vision-based page segmentation algorithm could increase cognitive load [19] and even have detrimental cogni-
to segment a webpage and extract its web content structure using tive and emotional impact on users [41]. In many websites, online
the link between the visual layout and the content. Fan et al. [13] advertisements are the only source of income. Nonetheless, online
introduced Article Clipper, a web content extractor that leveraged ads, especially intrusive ads, have usability consequences [6]. As
visual cues in addition to HTML tags to extract non-textual and Pujol et al. [36] observed, 22% of the most active users of a major
textual content and detect multi-page articles. Their approach un- European ISP use Adblock Plus. As a result, providing the main con-
derperformed in extracting captions which were links as well as tent in a clutter free page, such as Reader Mode, not only decreases
images and captions that were outside of main content. the complexity of a page, but also preserves privacy by limiting the
Heuristic methods are limited by their lack of adaptability. Some number of requests for third-party services and trackers [12, 25] as
have proposed learning based methods to overcome this rigidness. well as improves user experience.
Pasternack and Roth [35] described a semi-supervised algorithm,
Maximum Subsequence Segmentation, which tokenized HTML into 7 CONCLUSION
list of tags, words and symbols, and attempted to classify each block The modern web’s progress has led us to the point far beyond
as either "in article" or "out of article" text. Kohlschütter et al. [24] Hypertext Markup for document discovery, to having full-fledged,
developed BoilerPipe to classify text elements using both structural media-rich experiences and dynamic applications. With this growth
and text features, such as average word length and average sentence in capability, there has been a growth in page “bloat”, making pages
length. Sun et al. [40] proposed Content Extraction via Text Density expensive to load, and bringing with it ubiquitous advertising and
(CETD) to extract the text content from pages using a variety of tracking. In this work, we propose SpeedReader as an approach
text density measurements. Their method relied on the observation broadening the applicability of “reader mode” browser features to
that the amount of text in content sections is large, and the text deliver huge improvements to the end-user browsing experience.
in boilerplate sections contains more noise and links. Sluban and Unique among reader mode tools, SpeedReader determines if a
Grčar [38] introduced an unsupervised and language-independent page is readable based only on the page’s initial HTML, before the
method for extracting content from streams of HTML pages, by HTML is parsed and rendered, and before sub-resources are fetched.
detecting commonalities in page structure. While their method Our classifier can classify within 2 ms and with 91% accuracy, which
outperformed other open-source content extractor algorithms, it makes it practical as an always-on part of the rendering pipeline to
suffered from high memory consumption and poor performance in transform all suitable pages at load time. We find that SpeedReader
diverse and small HTML data set. is widely applicable, and can deliver performance and privacy im-
Wu et al. [48] proposed a machine learning model using DOM provements to 22% of pages on popular and unpopular websites, and
tree node features such as position, area, font, text and tag properties a larger proportion of pages linked to from online social networks
to select and group content related nodes and their children. In their like Reddit (42%) and Twitter (31%). Since SpeedReader makes its
recent paper, Vogels et al. [43] presented an algorithm combining a modifications before sub-resources are fetched, it uses 84× less
hidden markov model and a convolutional neural networks (CNNs). network than traditional page rendering (and current reader mode
Their model first preprocessed an HTML page into a Collapsed techniques). This results in page load time improvements, important
DOM (CDOM) tree where each single child parent node was merged in a range of scenarios from poor connectivity or low-end devices,
with its child. CDOM was then segmented into blocks of main to expensive data connectivity or simply wanting a clean and sim-
content and boilerplate using sequence labeling of DOM leaves. ple interaction with primarily textual content. SpeedReader also
The features were then used to train two CNNs, obtain potentials delivers page loading speedups of 20× - 27× and average memory
and finally find the optimal labeling. Their approach outperformed reduction of 2.4×, while maintaining a pleasant, reader mode style
previous studies on the CleanEval benchmark [3]. user experience. Finally, when SpeedReader was applied to 19,765
Web Complexity. While content extraction has attracted much readable webpages, it prevented 100% of advertising and tracking
attention in the scientific literature, fewer studies are conducted to related resources from being fetched (as labeled by EasyList and
understand website complexity and its impact on page load time EasyPrivacy).
and user experience. Gibson et al. [14] analyzed webpage template
evolution using site-level template detection algorithms and found 8 ACKNOWLEDGEMENT
that templates, with little raw content value, represented 40-50% of This research was supported by Brave Software. We would like to
the data on the Web and the rate continued to grow at a rate about thank David Temkin for his practical feedback and helpful com-
6% per year. Butkiewicz et al. [7] showed that modern websites, ments on the project. We also would like to thank anonymous
regardless of their popularity, were complex and such complexity reviewers for their time and effort in reviewing this paper.
10
REFERENCES Extracting Relevant Content from News Webpages. In Proceedings of the 18th
[1] Amazon. [n. d.]. Amazon Silk Documentation. docs.aws.amazon.com/silk/index. ACM Conference on Information and Knowledge Management (CIKM ’09). ACM,
html New York, NY, USA, 1685–1688. https://doi.org/10.1145/1645953.1646204
[2] Arc90. [n. d.]. Readability - An Arc90 Lab Experiment. http://ejucovy.github.io/ [28] Timothy Libert. 2015. Exposing the Invisible Web: An Analysis of Third-Party
readability/ HTTP Requests on 1 Million Websites. International Journal of Communication 9,
[3] Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. 0 (2015). https://ijoc.org/index.php/ijoc/article/view/3646
Cleaneval: a Competition for Cleaning Web Pages.. In LREC. [29] Shian-Hua Lin and Jan-Ming Ho. 2002. Discovering Informative Content Blocks
[4] Alexander Borisov. [n. d.]. myHTML - Fast C/C++ HTML 5 Parser. Using threads. from Web Documents. In Proceedings of the Eighth ACM SIGKDD International
https://github.com/lexborisov/myhtml Conference on Knowledge Discovery and Data Mining (KDD ’02). ACM, New York,
[5] Anna Bouch, Allan Kuchinsky, and Nina Bhatti. 2000. Quality is in the Eye of NY, USA, 588–593. https://doi.org/10.1145/775047.775134
the Beholder: Meeting Users’ Requirements for Internet Quality of Service. In [30] Georg Merzdovnik, Markus Huber, Damjan Buhov, Nick Nikiforakis, Sebastian
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI Neuner, Martin Schmiedecker, and Edgar Weippl. 2017. Block me if you can: A
’00). ACM, New York, NY, USA, 297–304. https://doi.org/10.1145/332040.332447 large-scale study of tracker-blocking tools. In Security and Privacy (EuroS&P),
[6] Giorgio Brajnik and Silvia Gabrielli. 2010. A review of online advertising effects 2017 IEEE European Symposium on. IEEE, 319–333.
on the user experience. International Journal of Human-Computer Interaction 26, [31] Georg Merzdovnik, Markus Huber, Damjan Buhov, Nick Nikiforakis, Sebastian
10 (2010), 971–997. Neuner, Martin Schmiedecker, and Edgar Weippl. 2017. Block Me if You Can: A
[7] Michael Butkiewicz, Harsha V. Madhyastha, and Vyas Sekar. 2011. Understanding Large-Scale Study of Tracker-Blocking Tools. Proceedings - 2nd IEEE European
Website Complexity: Measurements, Metrics, and Implications. In Proceedings of Symposium on Security and Privacy, EuroS and P 2017 (2017), 319–333. https:
the 2011 ACM SIGCOMM Conference on Internet Measurement Conference (IMC ’11). //doi.org/10.1109/EuroSP.2017.26
ACM, New York, NY, USA, 313–328. https://doi.org/10.1145/2068816.2068846 [32] mikesizz. [n. d.]. RedditList - Tracking the top 5000 subreddits. http://redditlist.
[8] Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. com/
VIPS: a Vision-based Page Segmentation Algorithm. (November [33] Thomas Nagele. 2015. Client-side performance profiling of JavaScript for web
2003), 28. https://www.microsoft.com/en-us/research/publication/ applications. Master Thesis. Radboud University Nijmegen.
vips-a-vision-based-page-segmentation-algorithm/ [34] Nick Nikiforakis, Luca Invernizzi, Alexandros Kapravelos, Steven Van Acker,
[9] Mozilla Corporation. 2018. Readability.js. https://github.com/mozilla/readability Wouter Joosen, Christopher Kruegel, Frank Piessens, and Giovanni Vigna. 2012.
[10] EasyList. 2018. About EasyList. https://easylist.to/pages/about.html You Are What You Include: Large-scale Evaluation of Remote Javascript In-
[11] EasyList. 2018. EasyList Github repository. https://github.com/easylist/easylist clusions. In Proceedings of the 2012 ACM Conference on Computer and Com-
[12] Steven Englehardt and Arvind Narayanan. 2016. Online Tracking: A 1-million-site munications Security (CCS ’12). ACM, New York, NY, USA, 736–747. https:
Measurement and Analysis. In Proceedings of the 2016 ACM SIGSAC Conference //doi.org/10.1145/2382196.2382274
on Computer and Communications Security (CCS ’16). ACM, New York, NY, USA, [35] Jeff Pasternack and Dan Roth. 2009. Extracting Article Text from the Web with
1388–1401. https://doi.org/10.1145/2976749.2978313 Maximum Subsequence Segmentation. In Proceedings of the 18th International
[13] Jian Fan, Ping Luo, Suk Hwan Lim, Sam Liu, Parag Joshi, and Jerry Liu. 2011. Conference on World Wide Web (WWW ’09). ACM, New York, NY, USA, 971–980.
Article Clipper: A System for Web Article Extraction. In Proceedings of the 17th https://doi.org/10.1145/1526709.1526840
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining [36] Enric Pujol, Oliver Hohlfeld, and Anja Feldmann. 2015. Annoyed Users: Ads
(KDD ’11). ACM, New York, NY, USA, 743–746. https://doi.org/10.1145/2020408. and Ad-Block Usage in the Wild. In Proceedings of the 2015 Internet Measurement
2020525 Conference (IMC ’15). ACM, New York, NY, USA, 93–106. https://doi.org/10.1145/
[14] David Gibson, Kunal Punera, and Andrew Tomkins. 2005. The Volume and 2815675.2815705
Evolution of Web Page Templates. In Special Interest Tracks and Posters of the [37] Paruj Ratanaworabhan, Benjamin Livshits, and Benjamin G. Zorn. 2010. JSMeter:
14th International Conference on World Wide Web (WWW ’05). ACM, New York, Comparing the Behavior of JavaScript Benchmarks with Real Web Applications.
NY, USA, 830–839. https://doi.org/10.1145/1062745.1062763 In Proceedings of the 2010 USENIX Conference on Web Application Development
[15] Eyeo GmbH. 2018. Adblock Plus. https://adblockplus.org/ (WebApps’10). USENIX Association, Berkeley, CA, USA, 3–3. http://dl.acm.org/
[16] Utkarsh Goel, Moritz Steiner, Mike P Wittie, Martin Flack, and Stephen Ludin. citation.cfm?id=1863166.1863169
[38] Borut Sluban and Miha Grčar. 2013. URL tree: efficient unsupervised content
2017. Measuring What is Not Ours: A Tale of 3rd Party Performance. In In-
extraction from streams of web documents. In Proceedings of the 22nd ACM
ternational Conference on Passive and Active Network Measurement. Springer,
international conference on Conference on information &#38; knowledge manage-
142–155.
ment (CIKM ’13). ACM, New York, NY, USA, 2267–2272. https://doi.org/10.1145/
[17] Google. [n. d.]. Accelerated Mobile Pages Project. https://www.ampproject.org
2505515.2505654
[18] Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. 2003. DOM-based
[39] Peter Snyder, Lara Ansari, Cynthia Taylor, and Chris Kanich. 2016. Browser
Content Extraction of HTML Documents. In Proceedings of the 12th International
Feature Usage on the Modern Web. In Proceedings of the 2016 Internet Measurement
Conference on World Wide Web (WWW ’03). ACM, New York, NY, USA, 207–214.
Conference (IMC ’16). ACM, New York, NY, USA, 97–110. https://doi.org/10.1145/
https://doi.org/10.1145/775152.775182
2987443.2987466
[19] Simon Harper, Eleni Michailidou, and Robert Stevens. 2009. Toward a Definition
[40] Fei Sun, Dandan Song, and Lejian Liao. 2011. DOM Based Content Extraction via
of Visual Complexity As an Implicit Measure of Cognitive Load. ACM Trans.
Text Density. In Proceedings of the 34th International ACM SIGIR Conference on
Appl. Percept. 6, 2, Article 10 (March 2009), 18 pages. https://doi.org/10.1145/
Research and Development in Information Retrieval (SIGIR ’11). ACM, New York,
1498700.1498704
NY, USA, 245–254. https://doi.org/10.1145/2009916.2009952
[20] Raymond Hill. 2018. uBlock Origin - An efficient blocker for Chromium and
[41] Alexandre N. Tuch, Javier A. Bargas-Avila, Klaus Opwis, and Frank H. Wilhelm.
Firefox. Fast and lean. https://github.com/gorhill/uBlock
2009. Visual complexity of websites: Effects on users’ experience, physiology,
[21] Brave Software Inc. 2018. Brave Ad Block. https://github.com/brave/ad-block
performance, and memory. International Journal of Human-Computer Studies 67,
[22] Google Inc. [n. d.]. Catapult - Web Page Replay. https://github.com/
9 (2009), 703 – 715. https://doi.org/10.1016/j.ijhcs.2009.04.002
catapult-project/catapult.git
[42] Antoine Vastel, Peter Snyder, and Benjamin Livshits. 2018. Who Filters the
[23] Google Inc. 2018. DOM Distiller. https://github.com/chromium/dom-distiller
Filters: Understanding the Growth, Usefulness and Efficiency of Crowdsourced
[24] Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate
Ad Blocking. (2018). http://arxiv.org/abs/1810.09160
Detection Using Shallow Text Features. In Proceedings of the Third ACM Interna-
[43] Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff. 2018. Web2Text:
tional Conference on Web Search and Data Mining (WSDM ’10). ACM, New York,
Deep Structured Boilerplate Removal. In European Conference on Information
NY, USA, 441–450. https://doi.org/10.1145/1718487.1718542
Retrieval. Springer, 167–179.
[25] Balachander Krishnamurthy and Craig Wills. 2009. Privacy Diffusion on the Web:
[44] Xiao Sophia Wang, Aruna Balasubramanian, Arvind Krishnamurthy, and David
A Longitudinal Perspective. In Proceedings of the 18th International Conference
Wetherall. 2013. Demystifying Page Load Performance with WProf. In Presented
on World Wide Web (WWW ’09). ACM, New York, NY, USA, 541–550. https:
as part of the 10th USENIX Symposium on Networked Systems Design and Imple-
//doi.org/10.1145/1526709.1526782
mentation (NSDI 13). USENIX, Lombard, IL, 473–485. https://www.usenix.org/
[26] Deepak Kumar, Zane Ma, Zakir Durumeric, Ariana Mirian, Joshua Mason,
conference/nsdi13/technical-sessions/presentation/wang_xiao
J. Alex Halderman, and Michael Bailey. 2017. Security Challenges in an In-
[45] Xiao Sophia Wang, Arvind Krishnamurthy, and David Wetherall. 2016. Speeding
creasingly Tangled Web. In Proceedings of the 26th International Conference
up Web Page Loads with Shandian. In 13th USENIX Symposium on Networked
on World Wide Web (WWW ’17). International World Wide Web Conferences
Systems Design and Implementation (NSDI 16). USENIX Association, Santa Clara,
Steering Committee, Republic and Canton of Geneva, Switzerland, 677–684.
CA, 109–122. https://www.usenix.org/conference/nsdi16/technical-sessions/
https://doi.org/10.1145/3038912.3052686
presentation/wang
[27] Eduardo Sany Laber, Críston Pereira de Souza, Iam Vita Jabour, Evelin Car-
[46] Zhiheng Wang. 2012. Navigation Timing. W3C Recommendation. W3C.
valho Freire de Amorim, Eduardo Teixeira Cardoso, Raúl Pierre Rentería, Lú-
http://www.w3.org/TR/2012/REC-navigation-timing-20121217/.
cio Cunha Tinoco, and Caio Dias Valentim. 2009. A Fast and Simple Method for

11
[47] Tim Weninger, William H. Hsu, and Jiawei Han. 2010. CETR: Content Extraction [48] Shanchan Wu, Jerry Liu, and Jian Fan. 2015. Automatic Web Content Extraction by
via Tag Ratios. In Proceedings of the 19th International Conference on World Wide Combination of Learning and Grouping. In Proceedings of the 24th International
Web (WWW ’10). ACM, New York, NY, USA, 971–980. https://doi.org/10.1145/ Conference on World Wide Web (WWW ’15). International World Wide Web
1772690.1772789 Conferences Steering Committee, Republic and Canton of Geneva, Switzerland,
1264–1274. https://doi.org/10.1145/2736277.2741659

12

You might also like