Ms. Poonam Sinai Kenkre
Ms. Poonam Sinai Kenkre
Ms. Poonam Sinai Kenkre
DNS
Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch
URL Frontier
URL Frontier: containing URLs yet to be fetches
in the current crawl. At first, a seed set is stored
in URL Frontier, and a crawler begins by taking a
URL from the seed set.
DNS: domain name service resolution. Look up IP
address for domain names.
Fetch: generally use the http protocol to fetch
the URL.
Parse: the page is parsed. Texts (images, videos,
and etc.) and Links are extracted.
Content Seen?: test whether a web page
with the same content has already been seen
at another URL. Need to develop a way to
measure the fingerprint of a web page.
URL Filter:
Whether the extracted URL should be excluded
from the frontier (robots.txt).
URL should be normalized (relative encoding).
en.wikipedia.org/wiki/Main_Page
<a href="/wiki/Wikipedia:General_disclaimer"
title="Wikipedia:General
disclaimer">Disclaimers</a>
Dup URL Elim: the URL is checked for duplicate
elimination.
What is a web crawler?
Why is web crawler required?
How does web crawler work?
Crawling strategies
Breadth first search traversal
Depth first search traversal
Architecture of web crawler
Crawling policies
Distributed crawling
Selection Policy that states which pages to
download.
Re-visit Policy that states when to check for
changes to the pages.
Politeness Policy that states how to avoid
overloading Web sites.
Parallelization Policy that states how to
coordinate distributed Web crawlers.
Search engines covers only a fraction of Internet.
This requires download of relevant pages, hence a
good selection policy is very important.
Common Selection policies:
Restricting followed links
Path-ascending crawling
Focused crawling
Crawling the Deep Web
Web is dynamic; crawling takes a long time.
Cost factors play important role in crawling.
Freshness and Age- commonly used cost functions.
Objective of crawler- high average freshness;
low average age of web pages.
Two re-visit policies:
Uniform policy
Proportional policy
Crawlers can have a crippling impact on the
overall performance of a site.
The costs of using Web crawlers include:
Network resources
Server overload
Server/ router crashes
Network and server disruption
A partial solution to these problems is the robots
exclusion protocol.
How to control those robots!
Web sites and pages can specify that robots
should not crawl/index certain areas.
Two components:
Robots Exclusion Protocol (robots.txt): Site wide
specification of excluded directories.
Robots META Tag: Individual document tag to
exclude indexing or following links.
Site administrator puts a “robots.txt” file at
the root of the host’s web directory.
http://www.ebay.com/robots.txt
http://www.cnn.com/robots.txt
http://clgiles.ist.psu.edu/robots.txt
File is a list of excluded directories for a
given robot (user-agent).
Exclude all robots from the entire site:
User-agent: *
Disallow: /
New Allow:
User-agent: *
Disallow: /
Only use blank lines to separate different
User-agent disallowed directories.
One directory per “Disallow” line.
No regex (regular expression) patterns in
directories.
The crawler runs multiple processes in parallel.
The goal is:
To maximize the download rate.
To minimize the overhead from parallelization.
To avoid repeated downloads of the same page.