Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
DEPARTMENT
OF
COMPUTER SCIENCE AND ENGINEERING
LECTURE NOTES
CS8080 – INFORMATION RETRIEVAL
(Regulation - 2017)
Year/Semester: IV/08
Academic Year: 2020 – 2021
Prepared by
Introduction - History of IR - Components of IR - Issues – Open source Search engine Frameworks - The
impact of the web on IR - The role of artificial intelligence (AI) in IR – IR Versus Web Search -
Components of a Search engine - Characterizing the Web
INTRODUCTION
Information
• Cookie Monster’s definition: news or facts about something.
Types of Information
• Text
• XML and structured documents
• Images
• Audio
• Video
• Source Code
• Applications/Web services
Retrieval
• “Fetch something” that’s been stored
Main objective of IR
• Provide the users with effective access to and interaction with information resources.
Goal of IR
• The goal is to search large document collections to retrieve small subsets relevant to the user’s
information need.
Purpose/role of an IR system
An information retrieval system is designed to retrieve the documents or information required by the
user community. It should make the right information available to the right user. Thus, an information retrieval
system aims at collecting and organizing information in one or more subject areas in order to provide it to the
user as soon as possible. Thus it serves as a bridge between the world of creators or generators of information
and the users of that information.
• 1960-70’s:
– Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law
and business documents.
– Development of the basic Boolean and vector-space models of retrieval.
– Prof. Salton and his students at Cornell University are the leading researchers in the area.
• 1980’s:
– Large document database systems, many run by companies:
• Lexis-Nexis
• Dialog
• MEDLINE
Unit 1 CS6007/Information Retrieval 2
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization
– Learning to Rank
1.Database Management
• Focused on structured data stored in relational tables rather than free-form text.
• Focused on efficient processing of well-defined queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML) brings it closer to IR.
2.Library and Information Science
• Focused on the human user aspects of information retrieval (human-computer interaction, user interface,
visualization).
• Concerned with effective categorization of human knowledge.
• Concerned with citation analysis and bibliometrics (structure of information).
Information retrieval is concerned with representing, searching, and manipulating large collections of
electronic text and other human-language data.
The user’s query is processed by a search engine, which may be running on the user’s local machine, on
a large cluster of machines in a remote geographic location, or anywhere in between. A major task of a search
engine is to maintain and manipulate an inverted index for a document collection. This index forms the
principal data structure used by the engine for searching and relevance ranking. As its basic function, an
inverted index provides a mapping between terms and the locations in the collection in which they occur.
To support relevance ranking algorithms, the search engine maintains collection statistics associated
with the index, such as the number of documents containing each term and the length of each document. In
addition the search engine usually has access to the original content of the documents in order to report
meaningful results back to the user.
Using the inverted index, collection statistics, and other data, the search engine accepts queries from its
users, processes these queries, and returns ranked lists of results. To perform relevance ranking, the search
engine computes a score, sometimes called a retrieval status value (RSV), for each document. After sorting
documents according to their scores, the result list must be subjected to further processing, such as the removal
of duplicate or redundant results. For example, a Web search engine might report only one or results from a
single host or domain, eliminating the others in favor of pages from different sources.
Traditional IRS
Three major components of Traditional IRS
1. Document subsystem
a) Acquisition
b) Representation
c) File organization
2. User sub system
a) Problem
Unit 1 CS6007/Information Retrieval 5
b) Representation
c) Query
3. Searching /Retrieval subsystem
a) Matching
b) Retrieved objects
An information retrieval system thus has three major components- the document subsystem, the users
subsystem, and the searching/retrieval subsystem. These divisions are quite broad and each one is designed to
serve one or more functions, such as:
• Analysis of documents and organization of information (creation of a document database)
• Analysis of user’s queries, preparation of a strategy to search the database
• Actual searching or matching of users queries with the database, and finally
• Retrieval of items that fully or partially match the search statement.
Acquisition (Document subsystem)
• Selection of documents & other objects from various web resources.
• Mostly text based documents
o full texts, titles, abstracts...
o but also other objects:
▪ data, statistics, images, maps, trade marks, sounds ...
• The data are collected by web crawler and stored in data base.
Representation of documents, objects(document subsystem)
• Indexing – many ways :
o free text terms (even in full texts)
o controlled vocabulary - thesaurus
o manual& automatic techniques.
• Abstracting; summarizing
• Bibliographic description:
o author, title, sources, date…
o metadata
• Classifying, clustering
• Organizing in fields & limits
o Basic Index, Additional Index. Limits
File organization (Document subsystem)
• Sequential
Information retrieval is concerned with representing, searching, and manipulating large collections of
electronic text and other human-language data.
Three Big Issues in IR
1.Relevance
• It is the fundamental concept in IR.
• A relevant document contains the information that a person was looking for when she submitted a query
to the search engine.
• There are many factors that go into a person’s decision as to whether a document is relevant.
• These factors must be taken into account when designing algorithms for comparing text and ranking
documents.
• Simply comparing the text of a query with the text of a document and looking for an exact match, as
might be done in a database system produces very poor results in terms of relevance.
To address the issue of relevance, retrieval models are used.
• A retrieval model is a formal representation of the process of matching a query and a document. It is the
basis of the ranking algorithm that is used in a search engine to produce the ranked list of documents.
• A good retrieval model will find documents that are likely to be considered relevant by the person who
submitted the query.
• The retrieval models used in IR typically model the statistical properties of text rather than the linguistic
structure. For example, the ranking algorithms are concerned with the counts of word occurrences than
whether the word is a noun or an adjective.
2.Evaluation
• Two of the evaluation measures are precision and recall.
Precision is the proportion of retrieved documents that are relevant.
Recall is the proportion of relevant documents that are retrieved.
Precision = Relevant documents ∩ Retrieved documents
Retrieved documents
Recall = Relevant documents ∩ Retrieved documents
Relevant documents
• When the recall measure is used, there is an assumption that all the relevant documents for a given query
are known. Such an assumption is clearly problematic in a web search environment, but with smaller test
collection of documents, this measure can be useful. It is not suitable for large volumes of log data.
3.Emphasis on users and their information needs
• The users of a search engine are the ultimate judges of quality. This has led to numerous studies on how
people interact with search engines and in particular, to the development of techniques to help people
express their information needs.
• Text queries are often poor descriptions of what the user actually wants compared to the request to a
database system, such as for the balance of a bank account.
Open source
Open source software is software whose source code is available for modification or enhancement by
anyone. "Source code" is the part of software that most computer users don't ever see; it's the code computer
programmers can manipulate to change how a piece of software—a "program" or "application"—works.
Programmers who have access to a computer program's source code can improve that program by adding
features to it or fixing parts that don't always work correctly.
Advantage of open source
• The right to use the software in any way.
• There is usually no license cost and free of cost.
• The source code is open and can be modified freely.
• Open standards.
• It provides higher flexibility.
Disadvantage of open source
• There is no guarantee that development will happen.
• It is sometimes difficult to know that a project exist, and its current status.
• No secured follow-up development strategy.
Closed software
Closed software is a term for software whose license does not allow for the release or distribution of the
software’s source code. Generally it means only the binaries of a computer program are distributed and the
Unit 1 CS6007/Information Retrieval 9
license provides no access to the programs source code. The source code of such programs is usually regarded
as a trade secret of the company. Access to source code by third parties commonly requires the party to sign a
non-disclosure agreement.
Search Engine
A search engine is a document retrieval system design to help find information stored in a computer
system, such as on the WWW. The search engine allows one to ask for content meeting specific criteria and
retrieves a list of items that match those criteria.
Lists of open source search engines
1. Apache Lucene
2. Sphinx
3. Whoosh
4. Carrot2
2.Sphinx
Sphinx is a full-text search engine, publicly distributed under GPL version 2. Technically, Sphinx is a
standalone software package provides fast and relevant full-text search functionality to client applications. It
was specially designed to integrate well with SQL databases storing the data, and to be easily accessed by
scripting languages. However, Sphinx does not depend on nor require any specific database to function.
Applications can access Sphinx search daemon (searchd) using any of the three different access
methods:
a) via Sphinx own implementation of MySQL network protocol (using a small SQL subset called
SphinxQL, this is recommended way)
b) via native search API (SphinxAPI) or
c) via MySQL server with a pluggable storage engine (SphinxSE).
Starting from version 1.10-beta, Sphinx supports two different indexing backends:
a) "Disk" index backend - Disk indexes support online full-text index rebuilds, but online updates can
only be done on non-text (attribute) data.
b) "Realtime" (RT) index backend - RT indexes additionally allow for online full-text index updates.
Previous versions only supported disk indexes.
Data can be loaded into disk indexes using a so-called data source. Built-in sources can fetch data
directly from MySQL, PostgreSQL, MSSQL, ODBC compliant database (Oracle, etc) or a pipe in TSV or a
custom XML format. Adding new data sources drivers (eg. to natively support other DBMSes) is designed to be
as easy as possible. RT indexes, as of 1.10-beta, can only be populated using SphinxQL. As for the name,
Sphinx is an acronym which is officially decoded as SQL Phrase Index.
Sphinx features
Key Sphinx features are:
• high indexing and searching performance;
• advanced indexing and querying tools;
• advanced result set post-processing (SELECT with expressions, WHERE, ORDER BY, GROUP BY,
HAVING etc over text search results);
• proven scalability up to billions of documents, terabytes of data, and thousands of queries per second;
• easy integration with SQL and XML data sources, and SphinxQL, SphinxAPI, or SphinxSE search
interfaces;
• easy scaling with distributed searches.
To expand a bit, Sphinx:
• has high indexing speed (upto 10-15 MB/sec per core on an internal benchmark);
• has high search speed (upto 150-250 queries/sec per core against 1,000,000 documents, 1.2 GB of data
on an internal benchmark);
• has high scalability (biggest known cluster indexes over 3,000,000,000 documents, and busiest one
peaks over 50,000,000 queries/day);
• provides good relevance ranking through combination of phrase proximity ranking and statistical
(BM25) ranking;
• provides distributed searching capabilities;
• provides document excerpts (snippets) generation;
• provides searching from within application with SphinxQL or SphinxAPI interfaces, and from within
MySQL with pluggable SphinxSE storage engine;
• supports boolean, phrase, word proximity and other types of queries;
• supports multiple full-text fields per document (upto 32 by default);
Unit 1 CS6007/Information Retrieval 11
• supports multiple additional attributes per document (ie. groups, timestamps, etc);
• supports stopwords;
• supports morphological word forms dictionaries;
• supports tokenizing exceptions;
• supports UTF-8 encoding;
• supports stemming (stemmers for English, Russian, Czech and Arabic are built-in; and stemmers for
French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Finnish,
Hungarian, are available by building third party libstemmer library);
• supports MySQL natively (all types of tables, including MyISAM, InnoDB, NDB, Archive, etc are
supported);
• supports PostgreSQL natively;
• supports ODBC compliant databases (MS SQL, Oracle, etc) natively;
Performance and scalability
• Indexing speed of up to 10-15 MB/sec per core and HDD.
• Searching speed of over 500 queries/sec against 1,000,000 document/1.2 GB collection using a 2-core
desktop system with 2 GB of RAM
• The biggest known installation using Sphinx, Boardreader.com, indexes 16 billion documents
• The busiest known installation, Craigslist, serves over 300,000,000 queries/dayand more than 50 billion
page views/month
3.Whoosh
Whoosh was created by Matt Chaput. It started as a quick and dirty search server for the online
documentation of the Houdini 3D animation software package. Whoosh is a fast, featureful full-text indexing
and searching library implemented in pure Python. Programmers can use it to easily add search functionality to
their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your
needs exactly.
Whoosh’s features include:
• Pythonic API.
• Pure-Python. No compilation or binary packages needed, no mysterious crashes.
• Fielded indexing and search.
• Fast indexing and retrieval – faster than any other pure-Python, scoring, full-text search solution I know
of.
• Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
• Powerful query language.
• Pure Python spell-checker (as far as I know, the only one).
Whoosh might be useful in the following circumstances:
• Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries.
• As a research platform
• When an easy-to-use Pythonic interface is more important to you than raw speed.
Tim Berners-Lee conceived the conceptual Web in 1989, tested it successfully in December of 1990,
and released the first Web server early in 1991. It was called World Wide Web, and is referred as Web. At that
time, no one could have imagined the impact that the Web would have. The Web boom, characterized by
exponential growth on the volume of data and information, imply that various daily tasks such as e-commerce,
banking, research, entertainment, and personal communication can no longer be done outside the Web if
convenience and low cost are to be granted.
The amount of textual data available on the Web is estimated in the order of petabytes. In addition, other
media, such as images, audio, and video, are also available in even greater volumes. Thus, the Web can be seen
as a very large, public and unstructured but ubiquitous data repository, which triggers the need for efficient
tools to manage, retrieve, and filter information from the Web. As a result, Web search engines have become
one of the most used tools in the Internet. Additionally, information finding is also becoming more important in
large Intranets, in which one might need to extract or infer new information to support a decision process, a task
called data mining (or Web mining for the particular case of the Web).
The very large volume of data available, combined with the fast pace of change, make the retrieval of
relevant information from the Web a really hard task. To cope with the fast pace of change, efficient crawling of
the Web has become essential. In spite of recent progress in image and non-textual data search in general, the
existing techniques do not scale up well on the Web. Thus, search on text continues to be the most popular and
most studied alternative. While most search engines are based in the United States and focus on documents in
English, there are important non-US search engines that have been designed for specific languages and can
Unit 1 CS6007/Information Retrieval 13
handle various scripts and alphabets, such as Kanji or Cyrillic. Examples include Baidu in China, Yandex in
Russia, and Naver in South Korea.
The search continues to be dominated by a “syntactic” paradigm in which documents that contain user
specified words or patterns are retrieved in response to a query. An alternative approach to syntactic search is to
do a natural language analysis of the text. Techniques to preprocess natural language and extract the text
semantics have been around for a while, yet they are not very effective. In fact, except for some fast entity
extraction tools that have been devised recently, these techniques are too costly to be used with large amounts
of data. In addition, in most cases they are only effective with well written and structured text, in combination
with a thesaurus, or other contextual information such as a specific knowledge domain.
There are basically two main forms of exploring the Web.
• Issue a word-based query to a search engine that indexes a portion of the Web documents.
• Browse the Web, which can be seen as a sequential search process of following hyperlinks, as
embodied for example, in Web directories that classify selected Web documents by subject.
Additional methods exist, such as taking advantage of the hyperlink structure of the Web, yet they are
not fully available, and likely less well known and also much more complex.
A Challenging Problem
Let us now consider the main challenges posed by the Web with respect to search. We can divide them
in two classes: those that relate to the data itself, which we refer to as data-centric, and those that relate to the
users and their interaction with the data, which we refer as interaction-centric.
Data-centric challenges
• Distributed data. Due to the intrinsic nature of the Web, data spans over a large number of computers
and platforms. These computers are interconnected with no predefined topology and the available
bandwidth and reliability on the network interconnections vary widely.
• High percentage of volatile data. Due to Internet dynamics, new computers and data can be added or
removed easily. To illustrate, early estimations showed that 50% of the Web changes in a few months.
Search engines are also confronted with dangling (or broken) links and relocation problems when
domain or file names change or disappear.
• Large volume of data. The fast growth of the Web poses scaling issues that are difficult to cope with,
as well as dynamic Web pages, which are in practice unbounded.
• Unstructured and redundant data. The Web is not a huge distributed hypertext system, as some might
think, because it does not follow a strict underlying conceptual model that would guarantee consistency.
Indeed, the Web is not well structured either at the global or at the individual HTML page level. HTML
pages are considered by some as semi-structured data in the best case. Moreover, a great deal of Web
data are duplicated either loosely (such as the case of news originating from the same news wire) or
strictly, via mirrors or copies. Approximately 30% of Web pages are (near) duplicates. Semantic
redundancy is probably much larger.
• Quality of data. The Web can be considered as a new publishing media. However, there is, in most
cases, no editorial process. So, data can be inaccurate, plain wrong, obsolete, invalid, poorly written or,
as if often the case, full of errors, either innocent (typos, grammatical mistakes, OCR errors, etc.) or
malicious. Typos and mistakes, specially in foreign names are pretty common.
• Heterogeneous data. Data not only originates from various media types, each coming in different
formats, but it is also expressed in a variety of languages, with various alphabets and scripts (e.g. India),
which can be pretty large (e.g. Chinese or Japanese Kanji).
Many of these challenges, such as the variety of data types and poor data quality, cannot be solved by
devising better algorithms and software, and will remain a reality simply because they are problems and issues
(consider, for instance, language diversity) that are intrinsic to human nature.
Interaction-centric challenges
• Expressing a query. Human beings have needs or tasks to accomplish, which are frequently not easy to
express as “queries”. Queries, even when expressed in a more natural manner, are just a reflection of
Information Retrieval
• The amount of available information is growing at an incredible rate, for example the Internet and
World Wide Web.
• Information are stored in many forms e.g. images, text, video, and audio.
• Information Retrieval is a way to separate relevant data from irrelevant.
• IR field has developed successful methods to deal effectively with huge amounts of information.
o Common methods include the Boolean, Vector Space and Probabilistic models.
Artificial Intelligence
• Study of how to construct intelligent machines and systems that can simulate or extend the development
of human intelligence.
Integration of Artificial Intelligence and Information Retrieval
• Both IR and AI fields are developed in parallel during the early days of computers.
• The fields of artificial intelligence and information retrieval share a common interest in developing more
capable computer systems.
• The integration of Artificial Intelligence and Information Retrieval has led to the following
development:
o Development of methods to learn user's information needs.
o Extract information based on what has been learned.
o Represent the semantics of information.
What is Intelligence?
According to Cook et.al.
• Acquisition: the ability to acquire new knowledge.
• Automatization: the ability to refine procedures for dealing with a novel situation into an efficient
functional form.
• Comprehension: the ability to know, understand, and deal with novel problems.
• Memory management: the ability to represent knowledge in memory, to map knowledge on to that
memory representation, and to access the knowledge in memory.
• Metacontrol: the ability to control various processes in intelligent behavior.
• Numeric ability: the ability to perform arithmetic operations.
• Reasoning: the ability to use problem-solving knowledge.
• Social competence: the ability to interact with and understand other people, machines or programs.
Unit 1 CS6007/Information Retrieval 15
• Verbal perception: the ability to recognize natural language.
• Visual perception: the ability to recognize visual images.
What are Intelligent IR Systems?
• The concept of 'intelligent' information retrieval was first suggested in the late 1970s.
• Not pursued by IR Community until early 1990s.
Definitions
• An intelligent IR system can simulate the human thinking process on information processing and
intelligence activities to achieve information and knowledge storage, retrieval and reasoning, and to
provide intelligence support.
• In an Intelligent IR system, the functions of the human intermediary are performed by a program,
interacting with the human user.
• Intelligent IR is performed by a computer program (intelligent agent), which acts on (minimal or no
explicit) instructions from a human user, retrieves and presents information to the user without any other
interaction.
How to introduce AI into IR systems?
• Can’t take the “Human Factor” completely out of the equation.
o A program which takes a query as input, and returns documents as output, without affording the
opportunity for judgment, modification and especially interaction with text, or with the program,
is one which would not qualify as an IR system at all.
o Ignores user interaction and relevance feedback.
• Some processes which can’t be performed by any other component than the user.
o Judgment” is a process which can only be performed by the user.
• The question is, “where” should AI be introduced into the IR system?
• Levels of user and system involvement, according to Bates ‘90:
o Level 0 – No system involvement (User comes up with a tactic, formulating a query, coming up
with a strategy and thinking about the outcome)
o Level 1 – User can ask for information about searching (System suggests tactics that can be used
to formulate queries e.g. help)
o Level 2 – User simply enters a query, suggests what needs to be done, and the system executes
the query to return results.
o Level 3 – First signs of AI. System actually starts suggesting improvements to user.
o Level 4 – Full Automation. User queries are entered and the rest is done by the system.
Barriers to Intelligent IR Systems
• Common Sense Reasoning.
• Natural Language Processing.
• Knowledge Acquisition, Representation, and Maintenance.
• Difficulty in Scaling Up Prototypes to Operational Systems.
• Level of Effort, Technical Expertise, and Expense.
Some AI methods currently used in Intelligent IR Systems
• Web Crawlers (for information extraction)
• Mediator Techniques (for information integration)
• Ontologies (for intelligent information access by making semantics of information explicit and machine
readable)
• Neural Networks (for document clustering & preprocessing)
o Kohonen Neural Networks - Self Organizing maps
o Hopefield Networks
o Semantic Networks
Neural Networks in IR
• Based on Neural Networks
Unit 1 CS6007/Information Retrieval 16
o Document clustering can be viewed as classification in document*document space
o Thesaurus construction can be viewed as laying out a coordinate system in the index*index space
o Indexing itself can be viewed as mappings in the document*index space
o Searching can be conceptualized as connections and activations in the index*document space
• Applying Neural Networks to Information Retrieval will likely produce information systems that will be
able to:
o recall memories despite failed individual memory units
o modify stored information in response to new inputs from the user
o retrieve "nearest neighbor" data when no exact data match exists
o associatively recall information despite noise or missing pieces in the input
o categorize information by their associative patterns
Conclusion
• AI offers us a powerful set of tools, especially when they are combined with conventional and other
innovative computing tools. However, it is not an easy task to master those tools and employ them
skillfully to build truly significant intelligent systems.
• By recognizing the limitations of modern artificial intelligence techniques, we can establish realistic
goals for intelligent information retrieval systems and devise appropriate system development strategies.
• AI models like the neural network will probably not replace traditional IR approaches anytime soon.
However, the application of neural network models can make an IR system more powerful.
Classical IR Web IR
Volume Large Huge
Data quality Clean, No duplicates Noisy, Duplicates available
Data change rate Infrequent In flux
Data Accessibility Accessible Partially accessible
Format diversity Homogeneous Widely Diverse
Documents Text HTML
No. of Matches Small Large
IR techniques Content based Link based
A search engine is the practical application of information retrieval techniques to large-scale text
collections. Search engines come in a number of configurations that reflect the applications they are designed
for. Web search engines, such as Google and Yahoo!, must be able to capture, or crawl, many terabytes of data,
and then provide subsecond response times to millions of queries submitted every day from around the world.
Search engine components support two major functions:
1. Indexing process
2. Query process
1.Indexing process
Characteristics
• Measuring the Internet and in particular the Web, is a difficult task due to its highly dynamic nature.
o There were more than 750 million computers in more than 200 countries directly connected to
the Internet, many of them hosting Web servers.
o Further, the estimated number of Web servers currently exceeds 206 million (Netcraft Web
survey of June, 2010).
o Considering these numbers, we can say that there is about one Web server per every four
computers directly connected to the Internet.
• How many different institutions (not Web servers) maintain Web data?
o This number is smaller than the number of servers, because many places have multiple servers.
o The exact number is unknown, but should be larger than 40% of the number of Web servers.
• More recent studies on the size of search engines estimated that there were over 20 billion pages in
2005, and that the size of the static Web is roughly doubling every eight months.
• Nowadays, the Web is infinite for practical purposes, as we can generate an infinite number of dynamic
pages (e.g. consider an on-line calendar).
• The most popular formats of Web documents are HTML, followed by GIF and JPG (both images),
ASCII text, and PDF, in that order.
• The most popular compression tools used are GNU zip, Zip, and Compress.
• There are many characteristics and statistics of HTML pages, some of which are as follows.
o First, most HTML pages are not standard, meaning that they do not comply with all the HTML
specifications.
o Second, HTML pages are small (around 10KB) and usually contain few images.
Unit 1 CS6007/Information Retrieval 20
o The average page contains between 5 and 15 hyperlinks and most of them are local, that is, they
point to pages in their own Web server hierarchy. On average, the number of external pages
pointing to any given page, is close to zero!
• The most referenced sites are the main Internet companies.
• In the year 2000, it was estimated that around 70% of the pages were written in English, and that the
numbers of words available in other languages was growing faster than the number of words in English.
On January 2003, Google Zeitgeist showed that around 50% of the queries to Google were using
English, down from around 60% in 2001.
• In all the (bounded) studies of the Web, the CORE component is composed of a minority of the Web
sites. On the other hand, it has a heavier density of Web pages.
• From link analysis,
o There is a correlation between structure and quality of the content.
o Some studies hint that the number of ISLANDS is much larger than we may think, as most of
them are not connected to the Web and hence, not easy to find unless they are registered with the
search engines.
This is true not only for the Web graph, but also for the host-graph, which is the connectivity graph at
the level of Web sites.
Table 11.1 shows a summary of the main findings. For the page size, there are two exponents: one for
pages with less than 20KB, and one for the rest. The same for the out-degree: one for pages with roughly less
than 20 out-links, and one for pages with more out-links.
where the average (μ) and standard deviation are 9.357 and 1.318, respectively.