Lecture 3
Lecture 3
Lecture 3
Outline
Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts
Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled
Many names
Crawler Spider Robot (or bot) Web agent Wanderer, worm, And famous instances: googlebot, scooter, slurp, msnbot,
Focused crawlers
Topical crawlers
Static crawlers
Best-first etc...
PageRank
Outline
Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts
10
Basic crawlers
This is a sequential crawler Seeds can be any list of starting URLs Order of page visits is determined by frontier data structure Stop criterion can be anything
12
A workable example
13
Implementation issues
Don t want to fetch same page twice!
Keep lookup table (hash) of visited pages What if not visited but in frontier already?
14
15
Must pay attention to HTML entities and unicode in text What to do with a growing number of other formats?
Flash, SVG, RSS, AJAX
17
18
Porter has also developed Snowball, a language to create stemming algorithms in any language
http://snowball.tartarus.org/ Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball
19
http://www.imdb.com/Name?Menczer,+Erico http://www.imdb.com/name/nm0578801/
Why or why not? How can we tell if a page is dynamic? What about spider traps ? What do Google and other search engines do?
20
21
In order to avoid duplication, the crawler must transform all URLs into canonical form Definition of canonical is arbitrary, e.g.:
Could always include port Or only include port when not default :80
22
23
This is reasonable in general but would be wrong in this case because the default happens to be default.asp instead of index.html
http://informatics.indiana.edu/fil http://informatics.indiana.edu/fil/
2. Trailing directory
2
This is correct in this case but how can we be sure in general that there isn t a file named fil in the root dir?
24
25
Better: combine many pages into a single large file, using some XML markup to separate and identify them
Must map URL to {filename, page_id}
Database options
Any RDBMS -- large overhead Light-weight, embedded databases such as Berkeley DB
26
Concurrency
A crawler incurs several delays:
Resolving the host name in the URL to an IP address using DNS Connecting a socket to the server and sending the request Receiving the requested page in response
27
28
Concurrent crawlers
Can use multi-processing or multi-threading Each process or thread works like a sequential crawler, except they share data structures: frontier and repository Shared data structures must be synchronized (locked for concurrent writes) Speedup of factor of 5-10 are easy this way
29
Outline
Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts
30
Universal crawlers
Support universal search engines Large-scale Huge cost (network bandwidth) of crawl is amortized over many queries from users Incremental updates to existing index and other data repositories
31
2. Policy
32
33
Several parallel queues to spread load across servers (keep connections alive)
DNS server using UDP (less overhead than TCP), large persistent inmemory cache, and prefetching
34
Freshness
Pages change over time, get removed, etc. How frequently can a crawler revisit ?
Trade-off!
Focus on most important pages (crawler bias)? Importance is subjective
35
This assumes we know the size of the entire the Web. Do we? Can you define the size of the Web ?
Solution
Estimate the probability that a previously visited page has changed in the meanwhile Prioritize by this probability estimate
37
Assumption: recent past predicts the future (Ntoulas, Cho & Olston 2004)
Frequency of change not a good predictor Degree of change is a better predictor
38
39
Breadth-first crawlers
BF crawler tends to crawl high-PageRank pages very early Therefore, BF crawler is a good baseline to gauge other crawlers But why is this so?
Najork and Weiner 2001
Outline
Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts
42
43
Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem Pay attention to anything that may lead to too many requests to any one server, even unwillingly, e.g.:
redirection loops spider traps
44
45
46
www.apple.com/robots.txt
# robots.txt for http://www.apple.com/ User-agent: * Disallow:
All crawlers
can go anywhere!
47
www.microsoft.com/robots.txt
# Robots.txt file for http://www.microsoft.com
All crawlers
User-agent: * Disallow: /canada/Library/mnp/2/aspx/ Disallow: /communities/bin.aspx Disallow: /communities/eventdetails.mspx Disallow: /communities/blogs/PortalResults.mspx Disallow: /communities/rss.aspx Disallow: /downloads/Browse.aspx Disallow: /downloads/info.aspx Disallow: /france/formation/centres/planning.asp Disallow: /france/mnp_utility.mspx Disallow: /germany/library/images/mnp/ Disallow: /germany/mnp_utility.mspx Disallow: /ie/ie40/ Disallow: /info/customerror.htm Disallow: /info/smart404.asp Disallow: /intlkb/ Disallow: /isapi/ #etc
48
www.springer.com/robots.txt
# Robots.txt for http://www.springer.com (fragment) User-agent: Googlebot Disallow: /chl/* Disallow: /uk/* Disallow: /italy/* Disallow: /france/* User-agent: slurp Disallow: Crawl-delay: 2 User-agent: MSNBot Disallow: Crawl-delay: 2 User-agent: scooter Disallow: # all others User-agent: * Disallow: /
Google crawler is allowed everywhere except these paths Yahoo and MSN/Windows Live are allowed everywhere but should slow down AltaVista has no limits Everyone else keep off!
49
50
51
If you write a browser extension that performs some useful service, should you comply with robot exclusion?
52