Search Engines Information Retrieval in Practice: W. Bruce Croft Donald Metzler Trevor Strohman
Search Engines Information Retrieval in Practice: W. Bruce Croft Donald Metzler Trevor Strohman
Search Engines Information Retrieval in Practice: W. Bruce Croft Donald Metzler Trevor Strohman
Information Retrieval
in Practice
W. BRUCE CROFT
University of Massachusetts, Amherst
DONALD METZLER
Yahoo! Research
TREVOR STROHMAN
Google Inc.
Boston Columbus Indianapolis New York San Francisco Upper Saddle River
Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto
Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo
Contents
Crawls and F e e d s . 31
3.1 Deciding What to Search 31
3.2 Crawling the Web 32
3.2.1 Retrieving Web Pages 33
3.2.2 The Web Crawler 35
3.2.3 Freshness 37
3.2.4 Focused Crawling 41
3.2.5 Deep Web 41
Contents
3.2.6 Sitemaps 43
3.2.7 Distributed Crawling 44
3.3 Crawling Documents and Email 46
3.4 Document Feeds 47
3.5 The Conversion Problem 49
3.5.1 Character Encodings 50
3.6 Storing the Documents 52
3.6.1 Using a Database System 53
3.6.2 Random Access 53
3.6.3 Compression and Large Files 54
3.6.4 Update 56
3.6.5 BigTable 57
3.7 Detecting Duplicates 60
3.8 Removing Noise................................. 63
Processing Text 75
4.1 From Words to Terms 75
4.2 Text Statistics 77
4.2.1 Vocabulary Growth 82
4.2.2 Estimating Collection and Result Set Sizes 85
4.3 Document Parsing 88
4.3.1 Overview 88
4.3.2 Tokenizing 89
4.3.3 Stopping 92
4.3.4 Stemming 93
4.3.5 Phrases and N-grams 99
4.4 Document Structure and Markup 103
4.5 Link Analysis 106
4.5.1 Anchor Text 107
4.5.2 PageRank 107
4.5.3 LinkQuality 113
4.6 Information Extraction 115
4.6.1 Hidden Markov Models for Extraction 117
4.7 Internationalization 120
Contents XI
References 491
Index 517