Social Streams Blog Crawler
Social Streams Blog Crawler
Social Streams Blog Crawler
[email protected] [email protected]
AbstractWeblogs, and other forms of social media, differ from traditional web content in many ways. One of the most important differences is the highly temporal nature of the content. Applications that leverage social media content must, to be effective, have access to this data with minimal publication/acquisition latency. An effective weblog crawler should satisfy the following requirements: low latency, highly scalable, high data quality and appropriate network politeness. In this paper, we outline the weblog crawler implemented in the social streams project and summarize the challenges faced during development.
I. INTRODUCTION Social media (weblogs, message boards, customer reviews, etc.) is distinguished from traditional web content in a number of important ways. Firstly, the content is authored by an identifiable contributor, secondly the content has a well defined temporal nature and thirdly the discrete documents, while composed into html pages, are generally subpages (e.g. posts in a thread on a message board, or posts on the home page of a weblog). Each of these aspects presents opportunities (modelling a specific author provides valuable features for understanding the content in a number of scenarios) and challenges. Let us illustrate some of those challenges by an example found in the two leading weblog search engines: Googles blogsearch and Technorati. Weblog content is highly temporal: it is important to the author that it be published immediately and it is important to the reader to be able to read it as soon as it is published. An integral part of the blogging ecosystem is the ping/feed mechanism which supports these requirements. A ping is produced when a weblog is published and a feed is used to syndicate the content. However, as the author has clear motivations to get the reader to the web version of their blog (where monetization often occurs, and where access to other related content can be found) many bloggers opt for a partial feed only including a limited amount of the entire blog post. The ping/feed mechanism makes aggregating the weblog content easy, but at the expense of missing a lot of the content due to these partial feeds. A solution taken by Google and Technorati is to crawl and index the permalink found in the feed: the url of the html version of the blog post. However, these html versions of the blog post contain all the ancillary elements providing advertising, navigation, blog rolls, etc. Thus, a search on these engines will often result in what appears to be timely information but is in fact only the ancillary content.
1084-4627/09 $25.00 2009 IEEE DOI 10.1109/ICDE.2009.146
These challenges are multiplied across the large variety of social media types, requiring custom crawling solutions for weblogs, message boards, micro-blogs, reviews and so on. Some blog hosts provide public APIs which makes crawling very easy. For example, sixapart streams its data through an endless HTTP connection. This mechanism makes crawling trivial and removes the load imposed by crawlers from the blog host. Unfortunately, very few blog hosting providers have adopted this approach. Social Streams is an ongoing project at Microsofts Live Labs. The project aims to acquire all social media content and make it generally available. This paper outlines the weblog crawling system currently deployed in our production environment. II. REQUIREMENTS From a very high level, a blog crawler is a black box which outputs a stream of documents. Each document is a blog post. A document includes fixed and optional fields. Fixed fields are: publication date, author, title and body. Optional fields may include tags, language, etc. While most modern programming languages have a library for fetching document by the HTTP protocol, the specifics of social media aggregation pose some unique challenges. Real-time. The information in blogs is time-sensitive. In most scenarios, it is very important to obtain and handle a blog post within some short time period after it was published, often minutes. By contrast, a regular web crawler doesnt have this requirement. In the general web crawling scenario, it is much more important to fetch a large collection of highquality documents. Coverage. It is important to fetch the entire blogosphere. However, if resources do not allow this, it is more desirable to get all data from a limited set of blogs, rather than less data from a bigger set of blogs (these two aspects of coverage may be termed comprehension and completeness). Scale. As of now, the size of the blogosphere is on the order of few hundred millions blogs. According to [5], 184 million people worldwide have started a blog. Technorati approximates the number of blog posts as being about 900,000 per day [4]. Data Quality. The crawler should output good quality, uncorrupted data. There should be no web spam in the output. Text should be properly decoded, etc.
1615
III. OBJECTIVES AND HIGH-LEVEL DESIGN We assume that the objective of the crawler is to fetch feeds, extract new blog posts from feeds and send blog posts to the downstream system for further processing. The crawler maintains a list of feeds which it is working on. The crawler is running under a set of constraints (memory, CPU, network bandwidth, politeness, etc) Therefore, there is an upper bound on a number of feeds which can be downloaded. A scheduling subsystem ensures that these resources are spent in the best possible way. The fetching module performs actual download of the feeds. The parsing subsystem extracts blog posts from feeds. A list of feeds is created in a separate list creation subsystem. The scheduler, the fetcher and the parser are running in the same process on the same machine (or set of machines). The list creation system is running asynchronously on a separate machine. The fig. 1 shows the architecture of the blog crawler.
The scheduling subsystem interfaces with the fetching subsystem by issuing URL requests one by one whenever the fetcher has the capacity for downloading one more URL. The schedulers algorithm is to return the URL with the highest temporal and static priority. Therefore, the system is selfregulated. If the fetcher is running in a high network bandwidth environment, the bandwidth will be utilized and the rate of (simultaneous) fetching will be high. In the low bandwidth environment, only high-priority URLs will be fetched. The priority of the URL has a temporal and a static part. The static part is provided by the list creation system. It reflects the blog quality and importance. The static priority can also be set by an operator. The temporal priority of a blog is the probability that a new post has been published on the blog. For simplicity, we use a tiered approach for the static priority. We have multiple tiers of blogs (high, medium and low). It is relatively easy to fine-tune the system in different environments by adjusting the size of the each tier to ensure that all URLs from the higher tier are served before serving Machine 1 the lower tier. Machine 1 A good approximation of temporal priority ensures the Machine 1 efficiency of the system. A false-positive in the estimate of probability of a new post results in a wasted download. A Scheduler false negative results in a bigger latency or even data loss (usually, feeds contain the last 10 or 20 posts and the missed post will be downloaded in the next retry. However, if we have more than 20 false negative for the same blog, well lose some posts.) The model where each blog notifies the crawler whenever it has a new post reduces both false-positives and false-negatives. Fortunately for us, the ping server infrastructure enables such model. Fetcher A ping server is a service which aggregates blog updates. Whenever a blog receives a new post, its blog host sends this information to a ping server. A few public ping servers publish a list of blogs which were updated every 5 minutes. The most popular ones are weblogs.com and blo.gs. Social Streams also hosts its own ping server. We have an agreement with some blog hosts to ping our ping server. Pinging is beneficial for both parties: the blog aggregator and the blog Parser host. The blog aggregator benefits by better crawl efficiency as we have seen earlier. The blog host benefits by lower load on its system from blog aggregators. A blog aggregator is more likely to publish links to the blog which pings it. This is a clear benefit for the blog hosts. The temporal priority equals the time from the last ping Fig1: System Architecture from the blog. It is zero when we have scheduled the blog The crawler can be trivially distributed on multiple machines. after the last ping. For blogs which do not ping, we predict the In the distributed configuration, each machine is running its approximate time of a new post based on the publication dates own instances of scheduler, fetcher and parser. Each instance of last few posts. We model the distribution of the time of a scheduler has its own partition of a master blog list. The interval lapsed between posts as normal. For each blog, we partition is based in IP addresses of a blog host. Each IP find the average and standard deviation of the time between address is assigned to the single machine. There is no posts. The approximate time of the next post is the previous particular criterion for the initial grouping of IP addresses into post plus mean time between posts and several standard partitions. The partitioning can be fine-tuned later to ensure deviations. robust load-balancing. A study on a set of a thousand blogs was performed. We have found that 95% of all entries are posted before the mean A. Scheduling Subsystem plus standard deviation after the previous entry. The fig. 2
1616
shows a histogram of z-scores of times between posts. We compute a z-score by subtracting the mean and then dividing by the standard deviation for each blog. The distribution has a clear peak before the mean.
Fig2: Distribution of Z-Scores of Elapsed Time between Posts The scheduler loads the entire list of URLs into memory. The list can be quickly accessed using the url hash to index a hash table. Heap-based priority queues are used to manage priority-based ordering [13]. As we will see below, due to politeness constraints, only several simultaneous connections per IP address are allowed. To facilitate this, the scheduler groups URLs into buckets by their IP. There is a priority queue for each IP bucket and a priority queue of buckets. This scheduler design was able to schedule several hundred thousand URLs per second.
The list creation subsystem creates a list of blogs to be fetched by the crawler. By itself, blog discovery and ranking is an important research area. Here we present preliminary thoughts and results. We obtain new URLs from a multitude of sources: Ping servers. It is important to note that ping servers expose lots of spam and non-blog feeds. Live Search crawl, Social Streams content store. It is customary for bloggers to include links to other blogs into their blog posts The ping infrastructure is heavily abused by spammers trying to promote their spam pages. There are lots of non-blog sites (such as news, photo sites) which use ping servers to publish their updates. Currently, we use rule-based and content-based classifiers to filter out spam pages and nonblogs. For blogs with known URL but unknown feed URL, the feed URL may be discovered by following <link> and <a href> tags in some vicinity of the initial URL (autodiscovery). Any method of feed URL harvesting may result in getting multiple feeds which contain the same posts. Some blogs publish a variety of feeds for the same blog. For instance, a blog may have a feed with all new posts, feed with posts from a particular author or a feed with posts on a particular topic. These duplicate feeds should be removed to reduce the unneeded load on the system. IV. CHALLENGES AND HOW OUR CRAWLER ADDRESSES THEM
A. Scale and Real-Timeliness The top priority for regular web crawler is to download as many documents as possible. This type of web crawler discovers new URLs as it goes. Obviously, the list of URLs is so huge that it cant be kept in memory and it is kept on disk. Disk-based databases cant be randomly qureried efficiently. Therefore, the URL database is sorted and queries are batched and sorted as well. The batch query process iterates both the database and a list of batched queries. Therefore, the B. Fetching Subsystem complexity of a query is linear regardless of a size of the The goal of the fetching subsystem is to fetch as many query batch. It is clearly advantageous to accumulate as many URLs as possible while observing standard politeness queries as possible before running them since iteration over a constraints. The fetcher uses a single thread which serves disk is a slow operation. For instance, IRLBot [2] re-queries multiple connections. Non-blocking socket APIs are used. the URLSeen database every 1-2 days. There is a tradeoff Therefore, the fetcher is capable of saturating the network between lateness and the number of URLs. Fortunately for us, stack while maintaining low CPU load. In case the CPU the number of active blogs in the blogosphere is on the order becomes a bottleneck, several threads may be used. HTTP 50 million or less [10]. On average, each URL is about 80 persistent connections are used. I.e., the same TCP connection characters. Therefore, the entire list will fit into 4GB of memory. There are a number of approaches one can take if is reused to fetch several feeds from the same IP. more than 50 million URLs need to be crawled. Using tries C. Parser yields 3 times or so compression. If some latency is allowed, The parser receives feeds from the fetcher, extracts URLs may be stored on disk and a minimum perfect hash [11] individual posts, decodes them and extracts metadata. The may be used to index into a file. In case we only work with published date is extracted from each post. A sequence of blogs that ping, we do need to store the list of URLs. A Bloom dates is used by the scheduler to predict the time of a next post. filter may be used to determine whether or not the URL Parsed information is sent to the content store for archiving belongs to the list. It is very easy to distribute the crawl on multiple machines if needed just by partitioning the list and (not discussed here). running each list on a separate machine. However, lists D. List Creation Subsystem
1617
shouldnt contain URL from the same host to ensure politeness. B. Politeness There are two important reasons for observing politeness when crawling. Any internet and web host has a denial of service attack protection feature. If we start bombarding the host with requests and connections, our requests will be temporarily denied. This will significantly slow down the fetching process. The second reason is that the owner of the site may be unhappy that we take excessive bandwidth and may turn off the access permanently. There are two aspects to politeness: robots.txt and per-IP politeness. Robots.txt is a standard which allows site creators to exclude certain parts of their site from being crawled [12]. Excluded URLs are removed from the list either by the List Creation System or by the Scheduler. We take a slightly different approach to a perIP politeness than most crawlers. Usually, crawlers limit the rate of requests to a specific IP. We cant apply this approach since we issue multiple requests on the same connection. We try to model the behaviour of the crawler as a user browsing through the web-site. Usually, opening a single page results in a few requests being sent back to the server. Most web browsers open two connections to issues these requests. Therefore, we open no more than two connections per server. We retrieve a few feeds on each connection. We retrieve feeds from the same IP after a pause. A disproportionally large number of blogs are hosted on the same host. For instance, 14% of all pings on our system come from blogs on qq. Other major blog hosts include livejournal, blogger and wordpress. If we assume that we can politely fetch about one page per second from a given host, we can do on order of a hundred thousand requests per 24 hours. This should be accounted for when prioritizing blogs for crawl. C. Feed Variety There is a large variety of formats for Web feeds. There are several major versions of RSS and two versions of Atom. Not all blogs choose to strictly follow specifications. Some blogs expose malformed feeds. The parser doesnt rely on a feed to conform to a particular standard. Instead, There is a set of xpath expressions which probe the acquired feeds robustly. Some hosts do not maintain the uniqueness of permalinks. Every time the feed is loaded, all permalinks are regenerated and do not match permalinks from the previous time. We had to exclude such feeds from the crawl. D. Partial Feed Detection and Full Feed Extraction As we said earlier, many blogs put only abbreviated posts in the feeds. The full content of a post is available only on a post HTML page. The blog crawler detects such blogs and extracts the full post content from the HTML page. The detection algorithm is rule-based. It uses heuristics such as a length of each post in a feed, a presence of ellipsis at the end of feed posts, etc. Most of the time, the post page contains lots of navigational and other content in addition to a post content. The entire post page may be crawled. This results in data quality degradation.
A simple, heuristic-based technique to extract a full blog post may be used. The HTML page is loaded into DOM hierarchy. A textual node (<div> or <span>) which spans the entire snippet from a feed is identified. This node is most likely to contain the blog post. V. CONCLUSION The paper the challenges of building a blog crawler with a description of how these challenges are overcome. The crawler acquires posts from a list of blogs. Effectively, the problem of blog crawling is reduced to finding a good set of blogs. The problem of discovering and ranking blogs is still work in progress. We identified some approaches to this problem. The presented crawler is successfully used by the Social Streams project. On the average it crawled about 100 posts per second with the peaks of up to 400 posts per second. The crawler supplied data to the Blews [6], Social Streams Political Portal [9] and other systems. Future work includes improving the blog list creation including a better spam filtering. We will also work on formalizing the data acquisition of other types of Social Media (usenet, microblogs, forums, etc). ACKNOWLEDGEMENTS We would like to thank the entire Social Streams team and especially Justin Rudd for his contribution to the data acquisition system. VI. REFERENCES
[1] [2] [3] [4] [5] [6] A. Heydon and M. Najork, \Mercator: A Scalable,Extensible Web Crawler," World Wide Web, vol. 2, no. 4,pp. 219{229, Dec. 1999. H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond,'' WWW, April 2008 (best paper award). Ka Cheung Sia, Junghoo Cho, Hyun-Kyu Cho "Efficient Monitoring Algorithm for Fast News Alerts." IEEE Transactions on Knowledge and Data Engineering, 19(7): July 2007 Technorati. State of the blogosphere report. http://technorati.com/blogging/state-of-the-blogosphere/ Universal McCann, Power to the people. SOCIAL MEDIA TRACKER wave.3 Michael Gamon, Sumit Basu, Dmitriy Belenko, Danyel Fisher, Matthew Hurst, Arnd Christian Knig: BLEWS: Using Blogs to Provide Context for News Articles. 2008. Proceedings of the International Conference on Weblogs and Social Media (ICWSM), 2008. IRLbot: Scaling to 6 Billion Pages and Beyond Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov Department of Computer Science, Texas A&M N. Glance. Indexing the blogosphere one post at a time. In Third International Workshop on Web Document Analysis (WDA2005), 2005. http://livelabs.com/social-streams/ Technorati. State of the Blogosphere Report. 2008 Chellapilla, K., Mityagin, A., and Charles, D. 2007. GigaHash: scalable minimal perfect hashing for billions of urls. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07. ACM, New York, NY, 11651166. The Web Robots Pages, http://www.robotstxt.org/ Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford. Introduction to Algorithms, second edition, MIT Press and McGraw-Hill. ISBN 0-262-53196-8. Chapter 19, Binomial Heaps
[12] [13]
1618