Lecture 3

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 52

Web Crawling

Based on the slides by Filippo Menczer


@Indiana University School of Informatics in Web
Data Mining by Bing Liu

Outline
Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts

Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled

starting pages (seeds)

Crawler: basic idea

Many names
Crawler Spider Robot (or bot) Web agent Wanderer, worm, And famous instances: googlebot, scooter, slurp, msnbot,

Googlebot & you

Motivation for crawlers


Support universal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.) Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential competitors, partners Monitor Web sites of interest Evil: harvest emails for spamming, phishing Can you think of some others?

A crawler within a search engine


Web googlebot Page repository

Text & link analysis Query

hits Text index Ranker PageRank

One taxonomy of crawlers


Crawlers Universal crawlers Preferential crawlers

Focused crawlers

Topical crawlers

Adaptive topical crawlers

Static crawlers

Evolutionary crawlers etc...

Reinforcement learning crawlers

Best-first etc...

PageRank

Many other criteria could be used:


Incremental, Interactive, Concurrent, Etc.

Outline
Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts

10

Basic crawlers
This is a sequential crawler Seeds can be any list of starting URLs Order of page visits is determined by frontier data structure Stop criterion can be anything

Graph traversal (BFS or DFS?)


Breadth First Search
Implemented with QUEUE (FIFO) Finds pages along shortest paths If we start with good pages, this keeps us close; maybe other good stuff

Depth First Search


Implemented with STACK (LIFO) Wander away ( lost in cyberspace )

12

A basic crawler in Perl


Queue: a FIFO list (shift and push)
my @frontier = read_seeds($file); while (@frontier && $tot < $max) {
my $next_link = shift @frontier; my $page = fetch($next_link); add_to_index($page); my @links = extract_links($page, $next_link); push @frontier, process(@links);

A workable example
13

Implementation issues
Don t want to fetch same page twice!
Keep lookup table (hash) of visited pages What if not visited but in frontier already?

The frontier grows very fast!


May need to prioritize for large crawls

Fetcher must be robust!


Don t crash if download fails Timeout mechanism

Determine file type to skip unwanted files


Can try using extensions, but not reliable Can issue HEAD HTTP commands to get Content-Type (MIME) headers, but overhead of extra Internet requests

14

More implementation issues


Fetching
Get only the first 10-100 KB per page Take care to detect and break redirection loops Soft fail for timeout, server not responding, file not found, and other errors

15

More implementation issues: Parsing


HTML has the structure of a DOM (Document Object Model) tree Unfortunately actual HTML is often incorrect in a strict syntactic sense Crawlers, like browsers, must be robust/forgiving Fortunately there are tools that can help
E.g. tidy.sourceforge.net

Must pay attention to HTML entities and unicode in text What to do with a growing number of other formats?
Flash, SVG, RSS, AJAX

More implementation issues


Stop words
Noise words that do not carry meaning should be eliminated ( stopped ) before they are indexed E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc Typically syntactic markers Typically the most common terms Typically kept in a negative dictionary
10 1,000 elements E.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words

Parser can detect these right away and disregard them

17

More implementation issues


Conflation and thesauri
Idea: improve recall by merging words with same meaning 1. We want to ignore superficial morphological features, thus merge semantically similar tokens
{student, study, studying, studious} => studi

2. We can also conflate synonyms into a single form using a thesaurus


30-50% smaller index Doing this in both pages and queries allows to retrieve pages about automobile when user asks for car Thesaurus can be implemented as a hash table

18

More implementation issues


Stemming
Morphological conflation based on rewrite rules Language dependent! Porter stemmer very popular for English
http://www.tartarus.org/~martin/PorterStemmer/

Context-sensitive grammar rules, eg:


IES except ( EIES or AIES ) --> Y Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.

Porter has also developed Snowball, a language to create stemming algorithms in any language
http://snowball.tartarus.org/ Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball

19

More implementation issues


Static vs. dynamic pages
Is it worth trying to eliminate dynamic pages and only index static pages? Examples:
http://www.census.gov/cgi-bin/gazetteer http://informatics.indiana.edu/research/colloquia.asp
http://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452

http://www.imdb.com/Name?Menczer,+Erico http://www.imdb.com/name/nm0578801/

Why or why not? How can we tell if a page is dynamic? What about spider traps ? What do Google and other search engines do?

20

More implementation issues


Relative vs. Absolute URLs
Crawler must translate relative URLs into absolute URLs Need to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by default Examples
Base: http://www.cnn.com/linkto/ Relative URL: intl.html Absolute URL: http://www.cnn.com/linkto/intl.html Relative URL: /US/ Absolute URL: http://www.cnn.com/US/

21

More implementation issues


URL canonicalization
All of these:
http://www.cnn.com/TECH http://WWW.CNN.COM/TECH/ http://www.cnn.com:80/TECH/ http://www.cnn.com/bogus/../TECH/

Are really equivalent to this canonical form:


http://www.cnn.com/TECH/

In order to avoid duplication, the crawler must transform all URLs into canonical form Definition of canonical is arbitrary, e.g.:
Could always include port Or only include port when not default :80

22

More on Canonical URLs


Some transformation are trivial, for example:
2 http://informatics.indiana.edu  http://informatics.indiana.edu/ 2 http://informatics.indiana.edu/index.html#fragment  http://informatics.indiana.edu/index.html 2 http://informatics.indiana.edu/dir1/./../dir2/  http://informatics.indiana.edu/dir2/ 2 http://informatics.indiana.edu/%7Efil/  http://informatics.indiana.edu/~fil/ 2 http://INFORMATICS.INDIANA.EDU/fil/  http://informatics.indiana.edu/fil/

23

More on Canonical URLs


Other transformations require heuristic assumption about the intentions of the author or configuration of the Web server: 1. Removing default file name
 2 http://informatics.indiana.edu/fil/index.html http://informatics.indiana.edu/fil/

This is reasonable in general but would be wrong in this case because the default happens to be default.asp instead of index.html
http://informatics.indiana.edu/fil http://informatics.indiana.edu/fil/

2. Trailing directory
2 

This is correct in this case but how can we be sure in general that there isn t a file named fil in the root dir?

24

More implementation issues


Spider traps
Misleading sites: indefinite number of pages dynamically generated by CGI scripts Paths of arbitrary depth created using soft directory links and path rewriting features in HTTP server Only heuristic defensive measures:
Check URL length; assume spider trap above some threshold, for example 128 characters Watch for sites with very large number of URLs Eliminate URLs with non-textual data types May disable crawling of dynamic pages, if can detect

25

More implementation issues


Page repository
Nave: store each page as a separate file
Can map URL to unique filename using a hashing function, e.g. MD5 This generates a huge number of files, which is inefficient from the storage perspective

Better: combine many pages into a single large file, using some XML markup to separate and identify them
Must map URL to {filename, page_id}

Database options
Any RDBMS -- large overhead Light-weight, embedded databases such as Berkeley DB

26

Concurrency
A crawler incurs several delays:
Resolving the host name in the URL to an IP address using DNS Connecting a socket to the server and sending the request Receiving the requested page in response

Solution: Overlap the above delays by fetching many pages concurrently

27

Architecture of a concurrent crawler

28

Concurrent crawlers
Can use multi-processing or multi-threading Each process or thread works like a sequential crawler, except they share data structures: frontier and repository Shared data structures must be synchronized (locked for concurrent writes) Speedup of factor of 5-10 are easy this way

29

Outline
Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts

30

Universal crawlers
Support universal search engines Large-scale Huge cost (network bandwidth) of crawl is amortized over many queries from users Incremental updates to existing index and other data repositories

31

Large-scale universal crawlers


Two major issues: 1. Performance
Need to scale up to billions of pages Need to trade-off coverage, freshness, and bias (e.g. toward important pages)

2. Policy

32

Large-scale crawlers: scalability


Need to minimize overhead of DNS lookups Need to optimize utilization of network bandwidth and disk throughput (I/O is bottleneck) Use asynchronous sockets
Multi-processing or multi-threading do not scale up to billions of pages Non-blocking: hundreds of network connections open simultaneously Polling socket to monitor completion of network transfers

33

Several parallel queues to spread load across servers (keep connections alive)

DNS server using UDP (less overhead than TCP), large persistent inmemory cache, and prefetching

High-level architecture of a scalable universal crawler

Optimize use of network bandwidth

Huge farm of crawl machines

Optimize disk I/O throughput

34

Universal crawlers: Policy


Coverage
New pages get added all the time Can the crawler find every page?

Freshness
Pages change over time, get removed, etc. How frequently can a crawler revisit ?

Trade-off!
Focus on most important pages (crawler bias)? Importance is subjective

35

Web coverage by search engine crawlers


100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1997 1998 1999 2000 16% 35% 34% 50%

This assumes we know the size of the entire the Web. Do we? Can you define the size of the Web ?

Maintaining a fresh collection


Universal crawlers are never done High variance in rate and amount of page changes HTTP headers are notoriously unreliable
Last-modified Expires

Solution
Estimate the probability that a previously visited page has changed in the meanwhile Prioritize by this probability estimate

37

Estimating page change rates


Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch
Brewington & Cybenko; Cho, Garcia-Molina & Page

Assumption: recent past predicts the future (Ntoulas, Cho & Olston 2004)
Frequency of change not a good predictor Degree of change is a better predictor

38

Do we need to crawl the entire Web?


If we cover too much, it will get stale There is an abundance of pages in the Web For PageRank, pages with very low prestige are largely useless What is the goal?
General search engines: pages with high prestige News portals: pages that change often Vertical portals: pages on some topic

What are appropriate priority measures in these cases? Approximations?

39

Breadth-first crawlers
BF crawler tends to crawl high-PageRank pages very early Therefore, BF crawler is a good baseline to gauge other crawlers But why is this so?
Najork and Weiner 2001

Bias of breadth-first crawlers


The structure of the Web graph is very different from a random network Power-law distribution of indegree Therefore there are hub pages with very high PR and many incoming links These are attractors: you cannot avoid them!

Outline
Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts

42

Crawler ethics and conflicts


Crawlers can cause trouble, even unwillingly, if not properly designed to be polite and ethical For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack!
Server administrator and users will be upset Crawler developer/admin IP address may be blacklisted

43

Crawler etiquette (important!)


Identify yourself
Use User-Agent HTTP header to identify crawler, website with description of crawler and contact information for crawler developer Use From HTTP header to specify crawler developer email Do not disguise crawler as a browser by using their User-Agent string

Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem Pay attention to anything that may lead to too many requests to any one server, even unwillingly, e.g.:
redirection loops spider traps

44

Crawler etiquette (important!)


Spread the load, do not overwhelm a server
Make sure that no more than some max. number of requests to any single server per unit time, say < 1/second

Honor the Robot Exclusion Protocol


A server can specify which parts of its document tree any crawler is or is not allowed to crawl by a file named robots.txt placed in the HTTP root directory, e.g. http://www.indiana.edu/robots.txt Crawler should always check, parse, and obey this file before sending any requests to a server More info at:
http://www.google.com/robots.txt http://www.robotstxt.org/wc/exclusion.html

45

More on robot exclusion


Make sure URLs are canonical before checking against robots.txt Avoid fetching robots.txt for each request to a server by caching its policy as relevant to this crawler Let s look at some examples to understand the protocol

46

www.apple.com/robots.txt
# robots.txt for http://www.apple.com/ User-agent: * Disallow:

All crawlers

can go anywhere!

47

www.microsoft.com/robots.txt
# Robots.txt file for http://www.microsoft.com

All crawlers
User-agent: * Disallow: /canada/Library/mnp/2/aspx/ Disallow: /communities/bin.aspx Disallow: /communities/eventdetails.mspx Disallow: /communities/blogs/PortalResults.mspx Disallow: /communities/rss.aspx Disallow: /downloads/Browse.aspx Disallow: /downloads/info.aspx Disallow: /france/formation/centres/planning.asp Disallow: /france/mnp_utility.mspx Disallow: /germany/library/images/mnp/ Disallow: /germany/mnp_utility.mspx Disallow: /ie/ie40/ Disallow: /info/customerror.htm Disallow: /info/smart404.asp Disallow: /intlkb/ Disallow: /isapi/ #etc

are not allowed in these paths

48

www.springer.com/robots.txt
# Robots.txt for http://www.springer.com (fragment) User-agent: Googlebot Disallow: /chl/* Disallow: /uk/* Disallow: /italy/* Disallow: /france/* User-agent: slurp Disallow: Crawl-delay: 2 User-agent: MSNBot Disallow: Crawl-delay: 2 User-agent: scooter Disallow: # all others User-agent: * Disallow: /

Google crawler is allowed everywhere except these paths Yahoo and MSN/Windows Live are allowed everywhere but should slow down AltaVista has no limits Everyone else keep off!

49

More crawler ethics issues


Is compliance with robot exclusion a matter of law?
No! Compliance is voluntary, but if you do not comply, you may be blocked Someone (unsuccessfully) sued Internet Archive over a robots.txt related issue

Some crawlers disguise themselves


Using false User-Agent Randomizing access frequency to look like a human/browser Example: click fraud for ads

50

More crawler ethics issues


Servers can disguise themselves, too
Cloaking: present different content based on UserAgent E.g. stuff keywords on version of page shown to search engine crawler Search engines do not look kindly on this type of spamdexing and remove from their index sites that perform such abuse
Case of bmw.de made the news

51

Gray areas for crawler ethics


If you write a crawler that unwillingly follows links to ads, are you just being careless, or are you violating terms of service, or are you violating the law by defrauding advertisers?
Is non-compliance with Google s robots.txt in this case equivalent to click fraud?

If you write a browser extension that performs some useful service, should you comply with robot exclusion?

52

You might also like