Speedreader www19
Speedreader www19
Speedreader www19
remove styling and other mark up), looking for near-by images for Page rendering,
Extract features
executing
inclusion, and using text patterns in the document that identify the JavaScript
Get necessary
resources for Yes Do not show the
reader mode reader mode
Safari Reader View. Safari Reader View is a JavaScript library Show the reader
button
that implements the reader mode presentation in Safari. Like Read- mode button
to looking for elements with high text and anchor density, Sa-
fari Reader View also uses presentation-level heuristics, including Reader Mode
where elements appear on the page and what elements are hid-
Get necessary
den from display by default. Relevant to SpeedReader, this means resources for
reader mode
that Safari Reader View must load a page and at least some of its Figure 1: Comparison of SpeedReader (left) with other existing
resources (e.g. images, CSS, JavaScript) to perform either the clas- reader modes (right)
sification or tree transduction level decisions. Because significant
portions of Safari Reader View require a document be fetched and small part of DOM Distiller’s strategy. We modified DOM Distiller
rendered before being evaluated, we do not consider it further in to remove these display level checks, so that DOM Distiller could
this work (for reasons that are detailed in Sections 3 and 4). be applied to prerendered HTML documents. We note that the
evaluation of DOM Distiller in this work uses this modified ver-
BoilerPipe. BoilerPipe is an academic research project from sion of DOM Distiller, and not the version that Google ships with
Kohlschütter et al. [24], and is implemented in Java. BoilerPipe Chrome. We expect this modification to have minimal effects on
has not been deployed directly by any browser vendor. BoilerPipe the discussed measurements, but draw the reader’s attention to this
does not provide functionality for (readability) classification, and change for completeness.
assumes that any HTML document contains a readable subset. For
tree transduction, BoilerPipe considers number of words and link
2.3 Comparison to SpeedReader
density features. Like Readability.js, it does not require a browser
to load and render a page in order to do reader mode extraction. The reader mode functionality shipped with all current major
Their analysis reveals a strong correlation between short text and browsers is applied after the document is fully fetched and ren-
boilerplate, as well as long text and actual content text (of the tex- dered.1 This greatly restricts the possible performance, network
tual content) on the Web. Using features with low calculation cost and privacy improvements existing reader modes can achieve. In
such as number of words enables BoilerPipe to lower the overhead fact, in some reader mode implementations we measured, using
while maintaining high accuracy. reader modes increased the amount of network used, as some re-
sources were fetched twice, i.e. once for the initial page loading,
DOM Distiller. DOM Distiller is a JavaScript and C++ library main- and then again when presenting images in the reader mode display.
tained by Google, and used to implement reader mode in recent Most significantly, SpeedReader differs from existing reader
versions of Chrome. The project is based on BoilerPipe, though mode techniques in that it is implemented strictly before the display,
has been significantly changed by Google. The classification step rendering, and resource-fetching steps in the browser. SpeedReader
in DOM Distiller uses a classifier based approach, and considers can therefore be thought of as a function that sits between the
features such as whether the page’s URL contains certain keywords browser’s network layer (i.e. takes as input the initial HTML doc-
(e.g. “forum”, “.php”, “index”), if the page’s markup contains Face- ument), and returns either the received HTML (when there is no
book open graph, Google AMP, identifiers, or the number of “/” readable subset), or a greatly simplified HTML document, represent-
characters used in the URL’s path, in addition to the text-and-link ing the reader mode presentation (when there is a readable subset).
density measures used by Readability.js. At a high level, the tree Figure 1 provides a high level comparison of how SpeedReader
transduction step also looks at text-and-link dense element in the functions, compared to existing reader modes.
page, as well as special-cased embedded elements, such as YouTube The fact that SpeedReader only considers features available in
or Vimeo videos. the initial HTML and URL enables SpeedReader to achieve perfor-
DOM Distiller considers some render-level information in both mance orders of magnitude above existing approaches. Figure 2
the classification and tree transduction steps. For example, any
elements that are hidden from display are not included in the text- 1 WhileReadability.js does not require that the page be rendered before making reader
and-link density measurements. These render-level checks are a mode evaluations, in practice Firefox does not expose reader mode functionality to
users until after the page is fetched and loaded.
3
Figure 3: The example page transformed with each of the evaluated
SpeedReader transducers
4
Table 1: Description of data set used for evaluating and training
“readability” classifiers.
Data set Number of pages % Readable 100.0%
Article pages 956 91.8%
Landing pages 932 1.5% 80.0%
Random pages 945 22.0%
Total 2,833 38.8%
Share of Pages
60.0%
Replay = 15.5 Broadband = 652
Table 2: Accuracy measurements for three classifiers attempting to Classification = 1.9 3G = 2606
40.0%
replicate the manual labels described in Table 1.
Classifier Precision Recall
20.0% curl, domestic broadband
ReadabilityJS 68% 85% curl, simulated 3G
DOM Distiller 90% 75% replayed trace
SpeedReader Classifier 91% 87% 0.0% prediction time
1 10 100 1000 10000 100000
to evaluate its accuracy against existing popular, deployed reader Time (ms)
mode tools. Figure 4: Time to fetch initial HTML document.
Data Set. To assess the accuracy of our classifier, we first gathered
2,833 websites, summarized in Table 1. Our data set is made up For comparison sake, we also evaluated the accuracy of the clas-
of three smaller sets of crawled data, each containing 1,000 URLs, sification functionality in Readability.js and our modified version
each meant to focus on a different kind of page, with a different of DOM Distiller when applied to the same data set, to judge their
expected distribution of readability. 1,000 pages were URLs selected ability to predict the final readability state of each document, given
from the RSS feeds of popular news sites (e.g. The New York Times, its initial HTML. We note that Readability.js is designed to be used
ArsTechnica), which we expected to be frequently readable. The this way, but that this prediction point is slightly different than how
second 1,000 pages were the landing pages from the Alexa 1K, DOM Distiller is deployed in Chrome. In Chrome, DOM Distiller
which we expected to rarely be readable. The final 1,000 pages labels a page as readable based on its final rendered state. This eval-
were selected randomly from non-landing pages linked from the uation of DOM Distiller’s classification capabilities should therefore
landing pages of the Alexa 5K, which we expected to be occasionally not be seen as an evaluation of DOM Distiller’s overall quality, but
readable. We built a crawler that, given a URL, recorded both the only its ability to achieve the kinds of optimizations sought by
initial HTML response, and a screenshot of the final rendered page SpeedReader. Table 2 presents the results of this measurement. As
(i.e. after all resources had been fetched and rendered, and after the table shows, SpeedReader strictly outperforms the classifica-
JavaScript had executed). We applied our crawler to each of the tion capabilities of both DOM Distiller and Readability.js. DOM
3,000 selected URLs. 167 pages did not respond to our crawler, Distiller has a higher false positive rate than our classifier, while
accounting for the difference between the 3,000 selected URLs and Readability.js has a higher false negative rate.
the 2,833 pages in our data set.
Finally, we manually considered each of the final page screen- 3.3 Classifier Usability
shots, and gave each a boolean label of whether there was a subset
Problem Statement. Our classifier operates on complete HTML
of page content that was readable. We considered a page readable
documents, before they are rendered. As a result, the browser is not
if it met the following criteria:
able to render the document until the entire initial HTML document
(1) The primary utility of the page was its text and image content is fetched. This is different from how current browsers operate,
(i.e. not interactive functionality). where websites are progressively rendered as each segment of the
(2) The page contained a subset of content that was useful, with- HTML document is received and parsed. This entails a trade off
out being sensitive to its placement on the page. between rendering delay (since rendering is delayed until the initial
(3) The usefulness of the page’s content was not dependent on HTML document) and network and device resource use (since,
its specific presentation or layout on the website. when a page is classified as readable, far fewer resources will be
This meant that single page applications, index pages, and pages fetched and processed).
with complex layout were generally labeled as not-readable, while In this sub-section, we evaluate the rendering delay caused by
pages with generally static content, and lots of text and content- our classifier, under several representative network conditions. The
depicting media, were generally labeled readable. We also share rendering delay is equal to the time to fetch the entire initial HTML
our labeled data,3 and a guide to the meaning behind the labels,4 document. We find that the rendering delay imposed is small, espe-
to make our results transparent and reproducible. cially compared to the dramatic performance improvements deliv-
Evaluation. We evaluated our classifier on our hand labeled corpus ered when a page is readable (discussed in more detail in Section 4).
of 2,833 websites, performing a standard ten-fold cross-validation. Classification Time. We evaluated the rendering delay imposed
3 https://github.com/brave/speedreader-paper-materials/blob/master/labels.csv by our classifier by measuring the time taken to fetch the initial
4 https://github.com/brave/speedreader-paper-materials/blob/master/labels-legend. HTML for a page, under different network conditions, and com-
txt pared it against the time taken for document classification.
5
Table 3: Measurements of how applicable our readability strategy is as well as its relevance to the web. As presented in Table 3, we find
under common browser use scenarios. that a significant number of visited URLs are readable, suggesting
that SpeedReader can deliver significant privacy and performance
Measurement # measured # readable % readable
improvements to users. This subsection continues by describing
Popular pages 42,986 9,653 22.5%
Unpopular pages 40,908 8,794 21.5% how we selected URLs in each browsing scenario.
Total: Random crawl 83,894 18,457 22.0% Websites by popularity. We first estimated how many pages
Reddit linked 3,035 1,260 41.51%
hosted on popular and unpopular domains are readable. To do
Twitter linked 494 276 31.2% so, we first created two sets of domains, a popular set, consisting
RSS linked 506 331 65% of the most popular 5,000 domains, as determined by Alexa, and an
Total: OSN 4,035 1,867 46.27% unpopular set, comprising a random sample of pages ranked 5,001–
100,000. For each domain, we conducted a breadth three, depth
First, we determined how long our classifier took to determine three crawl. We first visited the landing page for the domain, and
if a parsed HTML document was readable. We did so by parsing recorded all URLs linked to pages with the same TLD+1 domain.
each HTML string with myhtml, a fast, open source, C++ HTML Then we selected up to three URLs from this set, and repeated the
parser [4]. We then measured the execution time taken to extract above process another time, giving a maximum of 13 URLs per do-
the relevant features from the document, and to return the predicted main, and a total data set of 91,439 pages. The crawl was conducted
label. Our classifier took 2.8 ms on average and 1.9 ms in the median from AWS IP addresses on 17-20 October 2018.
case. Next, we measured the fixed, simulation cost time of serving Social network shared content. We next estimated how much
each web page from a locally hosted web server, which allowed content linked to from online social networks is readable, to sim-
us to account for the fixed overhead in establishing the network ulate a user that spends most of their browsing time on popular
connection, and similar unrelated browser book keeping operations. online social networks, and generally only browses away to view
This time was 22.3 ms on average, and 15.5 ms median. shared content. We gathered URLs shared from Reddit and Twitter.
Finally, we selected two network environments to represent dif- We gathered links shared on Reddit by using RedditList [32] to
ferent network conditions and device capabilities web users are obtain top 125 subreddits ranked based on their number of sub-
likely to encounter: a fast, domestic broadband link, with 50 Mbps scribers. We then visited the 25 posts of each popular subreddit
uplink/downlink bandwidth and 2 ms latency as indicated by a pop- and extracted any shared URLs. For Twitter, we extracted shared
ular network speed testing utility,5 and a simulated 3G network, links from the top 10 worldwide Twitter trends by crawling and
created using the operating system’s Network Link Conditioner.6 extracting shared links from their Tweets.
We use a default 3G preset with 780 kbps downlink, 330 kbps uplink, RSS / feed readers. Finally, we estimated how much content
100 ms packet delay in either direction and no additional packet loss. shared from RSS feeds is readable, to simulate a user who finds
Downloading the documents on such connection took 1,372 ms content mainly through an RSS (or similar) aggregation service.
/ 652 ms (average/median) and 4,023 ms / 2,516 ms for the two We built a list of RSS-shared content by crawling the Alexa 1K,
cases respectively. Figure 4 summarizes the results of those mea- identifying websites that included RSS feeds, and fetching the five
surements. Overall, the approximately 2.8 ms taken for an average most recent pages of content in each RSS feed.
document classification is a tiny cost compared to just the initial
HTML download on reasonably fast connections. It could poten- 3.5 Conclusion
tially be further optimized by classifying earlier, i.e. when only a
In this section we have described how SpeedReader determines
chunk of the initial document is available. Initial tests show promis-
whether a page should be rendered in reader mode, based on its
ing results, however this adds significant complexity to patching
initial HTML. We find that SpeedReader outperforms the classi-
the rendering pipeline and we leave it for future work.
fication capabilities of existing, deployed reader mode tools. We
also find that the overhead imposed by our classification strategy
3.4 Applicability to the Web
is small and acceptable in most cases, and dwarfed by the perfor-
While subsequent sections will demonstrate the significant per- mance improvements delivered by SpeedReader, for cases when a
formance and privacy improvements provided by SpeedReader, page is judged readable.
these improvements are only available on a certain type of web
document, those that have readable subsets. The performance im- 4 PAGE TREE TRANSDUCTION
provements possible through SpeedReader are therefore bounded
This section describes how SpeedReader generates a reader mode
by the amount of websites users visit that are readable.
presentation of a page, for pages that have been classified as read-
In this subsection, we determine how much of the web is
able. Our evaluation includes three possible reader mode renderings,
amenable to SpeedReader, by applying our classifier to a sampling
each presenting a different trade off between amount of media in-
of websites, representing different common browsing scenarios. Do-
cluded, performance and privacy improvements.
ing so allows us to estimate the benefits SpeedReader can deliver
Generating a reader mode presentation of an HTML document
5 speedtest.net- web service that provides analysis of Internet access performance can be thought of as translating one tree structure to another: taking
metrics, such as connection data rate and latency the document represented by the page’s initial HTML and generat-
6 Network Link Conditioner is a tool released by Apple with hardware IO Tools for
XCode developer tools to simulate different connection bandwidth, latency and packet ing the document containing a simplified reader mode version. This
loss rates process of tree mapping is generally known as tree transduction.
6
Table 4: Description of data set used for evaluating the performance Table 5: Performance comparisons of three popular readability tree
implications of different content extraction strategies. transducer strategies, as applied to the data set described in Table 4.
Values are given as Average, Median. Gain multiplier (×) is calcu-
Measurement Value lated for each page load and Average and Median values are re-
Measurement date 17-20 October 2018 ported.
# crawled domains 10,000
# crawled pages 91,439 Transducer Resources Data Memory Load Time
# domains with readable pages 4,931 (#) (KB) (MB) (ms)
# readable pages 19,765 - A M A M A M A M
% readable pages 21.62%
Default 144 91 2,283 1,461 197 174 1,813 1,069
ReadabilityJS 5 2 186 61 85 79 583 68
We evaluate Tree transduction by comparing the performance and Dom Distiller 5 2 186 61 84 79 550 63
BoilerPipe 2 2 101 61 81 77 545 44
privacy improvements of the three techniques (Readability.js, DOM
Gain (×)
Distiller and BoilerPipe) described in detail in Section 2.2.
ReadabilityJS 51 28 84 24 2.4 2.1 20 11
Dom Distiller 52 32 84 24 2.4 2.1 21 12
4.1 Limitations and Bounds BoilerPipe 77 48 84 24 2.4 2.1 27 15
We note that we did not attempt any evaluation of how users per-
ceive or enjoy the reader mode versions of pages rendered by each page, against the full version of each page. We evaluate perfor-
considered technique. We made this decision for several reasons. mance and privacy characteristics of each page by visiting the URL
First, two of the techniques (Readability.js and DOM Distiller) are as replayed from its archive. These findings are described in detail
developed and deployed by large browser vendors, with millions in the next subsections.
or billions of users. We assume that these large companies have We note that using a replay proxy with a snapshot of content
conducted their own usability evaluation of their reader mode often underestimates the costs of a page load. Despite taking care
techniques, and found them satisfactory to users. to mitigate the effects of non-determinism by injecting a small
Second, the third considered tree transduction technique, script that overrides date functions to use a fixed date and random
Kohlschütter et al’s BoilerPipe [24], is an academic work that in- number generator functions to use a fixed seed and produce a
cludes its own evaluation, showing that the technique can success- predictable sequence of numbers, it cannot account for all sources
fully extract useful contents from HTML documents. We assume of non-determinism. For all requests that the proxy cannot match, it
that the authors’ evaluation is comprehensive and sufficient, and responds with a Not Found response. We notice that it results in a
that their technique can successfully render pages in reader mode small number of requests being missed, primarily those responsible
presentations. Finally, we are planning to deploy a tree transducer for dynamic ad loading or tracking. It also occasionally interferes
different from existing techniques and a more thorough subjective with site publisher’s custom resource fetching retry logic, where
evaluation of its presentation is left for future study. the same request is retried a number of times unsuccessfully, before
the entire page load times out and the measurement is omitted.
4.2 Evaluation Methodology
We compared the performance and privacy improvements achieved
4.3 Results: Performance
through SpeedReader’s novel application of three tree transduc- We measured four performance metrics: number of resources re-
tion techniques: Readability.js, DOM Distiller and BoilerPipe. We quested, amount of data fetched, memory used and page load time.
conducted this evaluation in three stages. These results are summarized in Table 5 and Figure 5.
First, we fetched the HTML of each URL in the random crawl We ran all measurements on AWS m5.large EC2 instances. For
data set outlined in Table 3, again from an AWS IP. The HTML performance measurements, one test was executed at a time, per
considered here is only the initial HTML response, not the state of instance. For each evaluation, we fetched the page from a previ-
the document after script execution. We evaluated whether each ously collected record-replay archive, with performance tracing
of the 91,439 fetched pages that were classified as readable, by enabled. Once the page was loaded and the performance metrics
applying the SpeedReader classifier to each page. We then reduced are recorded, we closed the browser and proxy, and started the
the data set to the 19,765 pages (21.62%) were readable. next test. No further steps were taken to minimize the likelihood
Second, we revisited each URL classified as readable to collect of test VM performance being impacted by interfering workloads
a complete version of the page. To minimize variations in page on the underlying hardware. For all tests, we used an unmodified
performance and content during the testing, we collected the "replay Google Chrome browser, version 70.0.3538.67, rendered in Xvfb.7
archive" for each page using the "Web Page Replay" (WPR) [22] Although profiling has overheads of its own [33], in particular for
performance tool. WPR is used in Chrome’s testing framework for memory use and load times, we used a consistent measurement
benchmarking purposes and works as a proxy that records network strategy across all tests, and therefore expect the impact to also be
requests or responds to them instead of letting them through to consistent and minor compared to relative performance gains.
the source depending on whether it works in "record" or "replay" We measured a page’s load time as the difference between
mode. navigationStart and loadEventEnd events [46] in the main
Finally, we applied each of the three tree transduction techniques frame (i.e. the time until all sub-resources have been downloaded
to the remaining 19,765 HTML documents, and compared the net- 7 WhileChrome "headless" mode is available, it effectively employs a different page
work, resource use, and privacy characteristics of each transformed rendering pipeline with different load time characteristics and memory footprint.
7
100.0% Normal Page
DOMDistiller
Firefox
75.0% BoilerPipe
Share of Pages
50.0%
25.0%
0.0%
1 10 100 1000 10000 100 200 300 400 20 50 100 500 1000 30006000 15000
Data Downloaded (KB) Memory Footprint (MB) Load Time (ms)
Figure 5: Performance characteristics of the different tree transducer strategies applied, showing the distribution of the key performance
metrics.
and the page is fully rendered). Since page content is replayed from Table 6: Comparisons of the privacy implications of three popular
a local proxy, network bandwidth and latency variation impact is readability tree transducer strategies, as applied to the data set de-
scribed in Table 4. Values are given as Average and Median values.
minimized and the reported load time is a very optimistic figure,
especially for bigger pages with more sub-resources as illustrated Transducer # third-party # scripts Ads & Trackers
in Figure 4. Although network cost is still non-zero, the number Avg Med Avg Med Avg Med
primarily reflects the time taken to process and render the entire Default 117 63 83 51 63 24
page. ReadabilityJS 3 1 0 0 0 0
Dom Distiller 3 1 0 0 0 0
We also recorded the number of resources fetched and the BoilerPipe 1 1 0 0 0 0
amount of data downloaded during each test. Note that the amount
of data downloaded for all of the tree transduction strategies reflects
and thus before any requests to third parties have been initiated.
the size of the initial HTML rather than that of the transformed
The privacy improvements gained by SpeedReader are threefold:
document, as the transformation happens on the client and does
a reduction in third party requests, a reduction in script execution
not result in additional network traffic. All measured transducers
(an often necessary, though not sufficient, part of fingerprinting
discard the majority of page content (both in page content like text
online), and a complete elimination of ad and tracking related re-
and markup, but also referenced content like images, video files,
quests (as labeled by EasyList and EasyPrivacy). This last measure
and JavaScript). Figures 2 and 3 provide an example of how tree
is particularly important, since 92.8% of the 19,765 readable pages
transduction techniques simplify page content.
in our data set loaded resources labeled as advertising or tracking
For memory consumption, we measure the overall memory used
related by EasyList and EasyPrivacy [10, 11].
by the browser and its subprocesses. Google Chrome uses a multi-
This subsection proceeds by both describing how we measured
process model, where each tab and frame may run in a separate
the privacy improvements provided by SpeedReader, and the re-
process and content of each page also affects what runs in the
sults of that measurement. These findings are presented in Table 6.
main browser process. We note that our testing scenario does not
We measured the privacy gains provided by SpeedReader by
consider the case of multiple pages open simultaneously in the
first generating reader mode versions of each of the 19,765 readable
same browsing session, as some of the resources are reused. The
URLs in our dataset, and counting the number of third parties, script
reported number is therefore that of the entire browser rather than
resources, and ad and tracking resources in each generated reader
the specific page alone, with some fixed browser runtime overheads.
mode page. We determined the number of ad and tracking resources
Memory snapshots are collected with an explicit trigger after the
by applying EasyList and EasyPrivacy with an open-source ad-block
page load is complete with disabled-by-default-memory-infra
Node library [21] to each resource URL included in the page. We
tracing category enabled. Despite including a level of fixed browser
then compared these measurements to the number of third-parties,
memory costs, we still see average memory reduction of up to
script units, and ad and tracking resource requests made in the
2.4× in average or median cases. Overall, depending on the chosen
typical, non-reader mode rendering of each URL.
transducer, we show:
We found that all three of the evaluated tree transduction tech-
• average speedups ranging from 20× to 27× niques dramatically reduced the number of third parties communi-
• average bandwidth savings on the order of 84× cated with, and removed all script execution and ad and tracking
• number of requests is reduced 51× to 77× resource requests from the page. Put differently, SpeedReader is
• average memory reduction of 2.4× able to achieve privacy improvements at least as good, and almost
certainly exceeding existing ad and tracking blockers, on readable
pages. This claim is based on the observation that ad and tracking
4.4 Results: Privacy blockers do not achieve the same significant reduction in third party
SpeedReader achieves substantial privacy improvements, because communication and script execution as SpeedReader achieves.
it applies the tree transduction step before rendering the document,
8
5 DISCUSSION AND FUTURE WORK too. First, user privacy is harmed, since the rendering-server must
manage and observe all client secrets when interacting with the
5.1 Reader Mode as a Content Blocker
destination server on the client’s behalf. Additionally, while the
Most existing reader mode tools function to improve the presenta- server may be able to improve the loading and rendering of the
tion of page content for readers, by removing distracting content page, its limited in the kinds of performance improvements it can
and reformatting text for the browser user’s benefit. While the pop- achieve. Server assisted rendering does not provide any of the pre-
ularity of existing reader modes suggest that this is a beneficial use sentation simplification or content blocking benefits provided by
case, the findings in this work suggest an additional use case for SpeedReader.
reader modes, blocking advertising and tracking related content.
As discussed in Section 4.4, SpeedReader prevents all ad and 5.3 SpeedReader Deployment Strategies
tracking related content from being fetched and rendered, as identi-
fied by EasyList and EasyPrivacy (Table 6). SpeedReader also loads Always On. SpeedReader as described in this work is designed
between 51 and 77 times fewer resources than typical page render- to be “always on”, attempting to provide a readable presentation
ing and reader modes (Table 5), a non-trivial number of which are of every page fetched. Although Safari Reader View also supports
likely also ad and tracking related. SpeedReader differs fundamen- an “always on” functionality, it lacks performance and privacy
tally from existing content blocking strategies. Existing popular enhancement provided by SpeedReader (Section 2). While this
tools, like uBlock Origin[20] and AdBlock Plus[15], aim to identify decision maximizes the amount of privacy and performance im-
malicious or undesirable content, and prevent it from being loaded provements provided, it entails an overhead while loading each
or displayed; all unlabeled content is treated as desirable and loaded page (Figure 4), which may not be worthwhile in some browsing
as normal. SpeedReader, and (at last conceptually) reader modes in patterns such as interacting with application-like sites. Additionally,
general, take the opposite approach. Reader modes try to identify there may be times when users want to maintain a page’s inter-
desirable content, and treat all other page content as undesirable, active functionality (e.g. JavaScript), even when SpeedReader has
or, at least, unneeded. determined that the page is readable. Ensuring the user’s ability to
Our results suggest that the reader mode technique can achieve disable SpeedReader would be important in such cases. The system
ad and tracking blocking quality at least as well as existing content described in this work does not preclude such an option, but only
blocking tools, but with dramatic performance improvements. We imagines changing the default page loading behavior.8
expect that SpeedReader actually outperforms content blocking Tree Transduction Improvements. The three evaluated tech-
tools (as content blockers suffer from false-negative problems, for niques in Section 4, which are adapted from existing tools and
a variety of reasons), but lack a ground truth basis to evaluate research, can provide a reader mode presentation with different
this claim further. We suggest evaluating the content blocking performance and privacy improvements. Users of SpeedReader
capabilities of reader mode-like tools as a compelling area for future could select which tree transduction technique best suited their
work. needs. However, we expect that ML and similar techniques could
be applied to the tree transduction problem, to provide a reader
5.2 Comparison to Alternatives mode presentation that exceeds existing techniques. An improved
SpeedReader exists among other systems that aim to improve the tree transduction algorithm would achieve equal or greater per-
user experience of viewing content on the web. While a full eval- formance and privacy improvements, while doing a better job of
uation of these systems is beyond the scope of this work (mainly maintaining the meaning and information of the extracted content.
because the compared systems have different goals and place dif- We are currently exploring several options in this area, but have
ferent restrictions on users), we note them here for completeness. found the problem large enough to constitute its own unique work.
11
[47] Tim Weninger, William H. Hsu, and Jiawei Han. 2010. CETR: Content Extraction [48] Shanchan Wu, Jerry Liu, and Jian Fan. 2015. Automatic Web Content Extraction by
via Tag Ratios. In Proceedings of the 19th International Conference on World Wide Combination of Learning and Grouping. In Proceedings of the 24th International
Web (WWW ’10). ACM, New York, NY, USA, 971–980. https://doi.org/10.1145/ Conference on World Wide Web (WWW ’15). International World Wide Web
1772690.1772789 Conferences Steering Committee, Republic and Canton of Geneva, Switzerland,
1264–1274. https://doi.org/10.1145/2736277.2741659
12