Getting Started With Apache Nutch
Getting Started With Apache Nutch
Getting Started With Apache Nutch
Apache Nutch
By Akaram Siddiqui
And
Abdulbasit F Shaikh
Getting Started with Apache Nutch
1
Installing And Configuring of Nutch
Figure 1 : Grab a distribution of Nutch 2.X from
http://apache.claz.org/nutch/2.2/
Figure 2 : Download HBase. ou can get it
http://archi!e.apache.org/dist/hbase/hbase"#.$#.%/ .
&igure ' : ()tract it.
&igure % : Go to hbase"site.)ml*/../hbase"#.$#.%/conf+ and modif, it as below-
./)ml !ersion012.#1/3
./)ml"st,lesheet t,pe01te)t/)sl1 href01configuration.)sl1/3
.configuration3
.propert,3
.name3hbase.rootdir./name3
.!alue3.our path3./!alue3
./propert,3
.propert,3
.name3hbase.zoo4eeper.propert,.dataDir./name3
.!alue3.our path3./!alue3
./propert,3
./configuration3
2
Getting Started with Apache Nutch
&igure 5 : 6pecif, the G789 bac4end in nutch"site.)ml*/../apache"nutch"
2.2.2/conf+
.propert,3
.name3storage.data.store.class./name3
.!alue3org.apache.gora.hbase.store.HBase6tore./!alue3
.description3Default class for storing data./description3
./propert,3
&igure : : (nsure the HBase gora"hbase dependenc, is a!ailable in
i!,/i!,.)ml*/../apache"nutch"2.2.2/i!,+
.;"" <ncomment this to use HBase as Gora bac4end. ""3
.dependenc, org01org.apache.gora1 name01gora"hbase1 re!01#.21 conf01="
3default1 /3*>his line would be commented b, default.6o uncomment it+
&igure ? : (nsure that HBase6tore is set as the default datastore in
gora.properties*/../apache"nutch"2.2.2/conf+
gora.datastore.default0org.apache.gora.hbase.store.HBase6tore*>his line
would not be there.6o add it at the top.+
&igure @ : Go to apache nutch home director,*/../apache"nutch"2.2.2+ and fire
below command from terminal-
ant runtime
&igure $ : Aa4e sure HBase is started and wor4ing properl,.
Getting Started with Apache Nutch
3
&igure 2# : &or chec4ing go to home director, of hbase */../hbase"#.$#.%+ from
terminal and t,pe below command-
./bin/hbase shell
Bf all succeed then ,ou will get output li4e this-
HBase 6hellC enter Dhelp.8(><8N3D for list of supported commands.
>,pe 1e)it.8(><8N31 to lea!e the HBase 6hell
Eersion: #.$#.#- r2##2#:@- &ri 6ep 2% 2':55:%2 FD> 2#2#
hbase*main+:##2:#3
&igure 22 : ou should then be able to use it b, going to /../apache"nutch"
2.2.2/runtime/local/bin
ou should find more details in the logs on /../apache"nutch"
2.2.2/runtime/local/logs/hadoop.log.
Verify your Nutch installation
2+ Go to local director, of apache nutch*/../apache"nutch"
2.2.2/runtime/local+ from terminal and t,pe below command-
bin/nutch
Bf all succeed then ,ou will get below output-
<sage: nutch G7AA9ND
..
..
4
Getting Started with Apache Nutch
..
Aost commands print help when in!o4ed w/o parameters.
2+ 8un the following command if ,ou are seeing 1Fermission denied1:
chmod H) bin/nutch
'+ 6etup I9E9JH7A( if ,ou are seeing I9E9JH7A( not set. 7n Aac- ,ou can
run the following command or add it to K/.bashrc:
e)port I9E9JH7A(0.our Ia!a path3
Crawl your first website
2+ 9dd ,our agent name in the !alue field of the http.agent.name propert, in
nutch"site.)ml*/../apache"nutch"2.2.2/runtime/local/conf+- for e)ample:
.configuration3
.propert,3
.name3http.agent.name./name3
.!alue3A, Nutch 6pider./!alue3
./propert,3
./configuration3
2+ Go to local director,*/../apache"nutch"2.2.2/runtime/local+ of apache
nutch and create director, called urls.
m4dir "p urls
Getting Started with Apache Nutch
5
'+ cd urls
%+ >,pe below command for creating seed.t)t under urls/ with the following
content *one <8L per line for each site ,ou want Nutch to crawl+.
touch seed.t)t
5+ Aodif, the file b, putting below content-
http://nutch.apache.org/
:+ (dit the file rege)"urlfilter.t)t*/../apache"nutch"2.2.2/conf+ and replace
M accept an,thing else
H.
with a regular e)pression matching the domain ,ou wish to crawl. &or e)ample-
if ,ou wished to limit the crawl to the nutch.apache.org domain- the line should
read:
HNhttp://*Oa"z#"$P=Q.+=nutch.apache.org/
>his will include an, <8L in the domain nutch.apache.org.
Crawling website using the crawl script
F.6 : B ha!e tested this with solr '.:.2.Bf ,ou want to run it with higher !ersion
then ,ou need to configure it accordingl,.
2+ Download solr from http://apache.mirrors.hoobl,.com/lucene/solr/'.:.2/
2+ ()tract it.
'+ Go to e)ample director,*/../apache"solr"'.:.2/e)ample+ from terminal.
6
Getting Started with Apache Nutch
%+ >,pe the below command-
Ra!a "Rar start.Rar
Bf all succeed then ,ou will get below output-
...
...
...
5$%@ OmainP BN&7 org.eclipse.Rett,.ser!er.9bstractGonnector S 6tarted
6oc4etGonnectorT#.#.#.#:@$@'
5+ Eerif, solr installation b, hitting below url on browser-
http://localhost:@$@'/solr/admin/
:+ Ue ha!e both Nutch and 6olr installed and setup correctl,. 9nd Nutch
alread, created crawl data from the seed <8L*s+. Below are the steps to
delegate searching to 6olr for lin4s to be searchable:
?+ cp /../apache"nutch"2.2.2/conf/schema.)ml /../apache"solr"
'.:.2/e)ample/solr/conf/
@+ 8estart 6olr with the command VRa!a "Rar start.RarW under /../apache"solr"
'.:.2/e)ample
$+ Go to Home director, of hbase*/../hbase"#.$#.%+ from terminal and start
hbase b, below command-
./bin/start"hbase.sh
Getting Started with Apache Nutch
7
Bf all succeed then ,ou will get below output- starting Aaster- logging to
logs/hbase"user"master"e)ample.org.out
Bf ,ou get below output that means hbase is alread, started.No need to
start it.
master running as process 2$%@. 6top it first.
2#+ Go to local director,*/../apache"nutch"2.2.2/runtime/local+ from
terminal and t,pe below command-
bin/crawl urls/seed.t)t >estGrawl http://localhost:@$@'/solr/ 2
Bf all succeed then ,ou will get below output-
...
...
...
9dding 2 documents
67L8 dedup "3 http://localhost:@$@'/solr/
>he crawl script has lot of parameters set- and ,ou can modif, the parameters
to ,our needs. Bt would be ideal to understand the parameters before setting up
big crawls.
Grawling the web- the GrawlDb- and <8L filters
Grawling the web is alread, e)plained abo!e.ou can add more urls in seed.t)t
file and crawl the same.
8
Getting Started with Apache Nutch
Grawling the crawlDB is automaticall, done b, crawl script as we showed
abo!e.Fre!iousl, we need to manuall, do it.But apache"nutch de!elopers
replace it b, crawl script.B am Rust defining the steps which is followed b, crawl
script for crawling GrawlDB.
2+ Generate : Xbin/nutch generate Xcommon7ptions "topN Xsize&etchlist
"noNorm "no&ilter "addda,s XaddDa,s "crawlBd XG89ULJBD "batchBd
XbatchBd
2+ &etch : Xbin/nutch fetch Xcommon7ptions "D
fetcher.timelimit.mins0XtimeLimit&etch XbatchBd "crawlBd XG89ULJBD
"threads 5#
'+ Farse : Xbin/nutch parse Xcommon7ptions Xs4ip8ecords7ptions XbatchBd
"crawlBd XG89ULJBD
%+ <pdate : Xbin/nutch updatedb Xcommon7ptions "crawlBd XG89ULJBD
<8L&ilters is also e)plained abo!e.&or reference follow the :
th
step in VCrawl
your first website topic abo!e.
Farsing and Farse &ilters
Farsing contains the parsed te)t of each <8L- the outlin4 <8Ls used to update
the crawldb and also contains outlin4s and metadata parsed from each <8L.
Farsing is also done b, crawl script as e)plained abo!e.&or do it manuall,-,ou
need to first e)ecute inRect-generate and fetch command respecti!el,.
go to local director, of apache"nutch*/../apache"nutch"2.2.2/runtime/local+ and
t,pe below command-
&or BnRect : bin/nutch inRect urls*ou can pass different arguments as ,our need+
Getting Started with Apache Nutch
9
&or Generate : bin/nutch generate "topN 2*ou can pass different arguments as
,our need+
&or &etch : bin/nutch fetch "all*ou can pass different arguments as ,our need+
&or parse : bin/nutch parse "all*ou can pass different arguments as ,our need+
Farse &ilters :
HtmlFarse&ilter "" Fermits one to add additional metadata to H>AL parses.
9nal,sis- Lin4 anal,sis- and scoring
Lin4 anal,sis program that con!erges to stable global scores for each url.
WebGraph
>he UebGraph program is the first Rob that must be run once all segments are
fetched and read, to be processed. UebGraph is found at
org.apache.nutch.scoring.webgraph.UebGraph. Below is a printout of the
programs usage.
usage: UebGraph
"help show this help message
"segment .segment3 the segment*s+ to use
"webgraphdb .webgraphdb3 the web graph database to use
>he UebGraph program can ta4e multiple segments to process and reYuires an
output director, in which to place the completed web graph components. >he
UebGraph creates three different components: an inlin4 database- an outlin4
database- and a node database. >he inlin4 database is a listing of url and all of
10
Getting Started with Apache Nutch
its inlin4s. >he outlin4 database is a listing of url and all of its outlin4s. >he node
database is a listing of url with node meta information including the number of
inlin4s and outlin4s- and e!entuall, the score for that node.
Loops
7nce the web graph is built we can begin the process of lin4 anal,sis. Loops is
an optional program that attempts to help weed out spam sites b, determining
lin4 c,cles in a web graph. 9n e)ample of a lin4 c,cle would be sites 9- B- G- and
D- where 9 lin4s to B which lin4s to G which lin4s to D which lin4s bac4 to 9. >his
program is computationall, e)pensi!e and usuall,- due to time and space
reYuirement- canDt be run on more than a three or four le!el depth. Uhile it
does identif, sites which appear to be spam and those lin4s are then discounted
in the later Lin48an4 program- its benefit to cost ratio is !er, low. Bt is included
in this pac4age for completeness and because there ma, be a better wa, to
perform this function with a different algorithm. But on current large production
webgraphs- its use is discouraged. Loops is found at
org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs
usage.
usage: Loops
"help show this help message
"webgraphdb .webgraphdb3 the web graph database to use
Linkank
Uith the web graph built we can now run Lin48an4 to perform an iterati!e lin4
anal,sis. Lin48an4 is a Fage8an4"li4e lin4 anal,sis program that con!erges to
stable global scores for each url. 6imilar to Fage8an4- the Lin48an4 program
starts with a common score for all urls. Bt then creates a global score for each url
based on the number of incoming lin4s and the scores for those lin4s and the
number of outgoing lin4s from the page. >he process is iterati!e and scores tend
Getting Started with Apache Nutch
11
to con!erge after a gi!en number of iterations. Bt is different from Fage8an4 in
that nepotistic lin4s such as lin4s internal to a website and reciprocal lin4s
between websites can be ignored. >he number of iterations can also be
configuredC b, default 2# iterations are performed. <nli4e the pre!ious 7FBG
scoring- the Lin48an4 program does not 4eep scores from one processing time
to another. >he web graph and the lin4 scores are recreated at each processing
run and so we donDt ha!e the problems of e!er increasing scores. Lin48an4
reYuires the UebGraph program to ha!e completed successfull, and it stores its
output scores for each url in the node database of the webgraph. Lin48an4 is
found at org.apache.nutch.scoring.webgraph.Lin48an4. Below is a printout of
the programs usage.
usage: Lin48an4
"help show this help message
"webgraphdb .webgraphdb3 the web graph db to use
Score!pdater
7nce the Lin48an4 program has been run and lin4 anal,sis is completed- the
scores must be updated into the crawl database to wor4 with the current Nutch
functionalit,. >he 6core<pdater program ta4es the scores stored in the node
database of the webgraph and updates them into the crawldb. Bf a url e)ists in
the crawldb that doesnDt e)ist in the webgraph then its score is cleared in the
crawldb. >he 6core<pdater reYuires that the UebGraph and Lin48an4 programs
ha!e both been run and reYuires a crawl database to update. 6core<pdater is
found at org.apache.nutch.scoring.webgraph.6core<pdater. Below is a printout
of the programs usage.
usage: 6core<pdater
"crawldb .crawldb3 the crawldb to use
12
Getting Started with Apache Nutch
"help show this help message
"webgraphdb .webgraphdb3 the webgraphdb to use
Scoring
F.6 : 9pache"nutch 2.2.2 is not supporting this.6o B ha!e configured it with
apache"nuch"2.?.ou can install apache"nutch"2.? same wa, as apach"nucth"
2.2.2
>he new scoring functionalit, can be found in
org.apache.nutch.scoring.webgraph. >his pac4age contains multiple programs
that build web graphs- perform a stable con!ergent lin4"anal,sis- and update the
crawldb with those scores.&or doing scoring go to local director,*/../apache"
nutch"2.?/runtime/local+ from terminal of apache"nutch and t,pe below
commands-
bin/nutch inRect crawl/crawldb urls/
bin/nutch generate crawl/crawldb/ crawl/segments
bin/nutch fetch crawl/segments/))))))))))))))/
bin/nutch updatedb crawl/crawldb/ crawl/segments/)))))))))))))))))/
bin/nutch org.apache.nutch.scoring.webgraph.UebGraph "segment
crawl/segments/))))))))))))))/ "webgraphdb crawl/webgraphdb
7ne thing to point out here is that UebGraph is meant to be used on larger web
crawls to create web graphs. B, default it ignores outlin4s to pages in the same
domain- including subdomains- and pages with the same hostname. Bt also
limits to one outlin4 per page to lin4s in the same page or the same domain. 9ll
of these options are changeable through the following configuration options:
.;"" lin4ran4 scoring properties ""3
Getting Started with Apache Nutch
13
.propert,3
.name3lin4.ignore.internal.host./name3
.!alue3true./!alue3
.description3Bgnore outlin4s to the same hostname../description3
./propert,3
.propert,3
.name3lin4.ignore.internal.domain./name3
.!alue3true./!alue3
.description3Bgnore outlin4s to the same domain../description3
./propert,3
.propert,3
.name3lin4.ignore.limit.page./name3
.!alue3true./!alue3
.description3Limit to onl, a single outlin4 to the same page../description3
./propert,3
.propert,3
14
Getting Started with Apache Nutch
.name3lin4.ignore.limit.domain./name3
.!alue3true./!alue3
.description3Limit to onl, a single outlin4 to the same domain../description3
./propert,3
But b, default if ,ou are onl, crawling pages within a domain or within a set of
subdomains- all outlin4s will be ignored and ,ou will come up with an empt,
webgraph. >his in turn will throw an error while processing through the
Lin48an4 Rob. >he flip side is b, N7> ignoring lin4s to the same domain/host and
b, not limiting those lin4s- the webgraph becomes much- much more dense and
hence there is a lot more lin4s to process which probabl, wonDt affect rele!anc,
as much.
bin/nutch org.apache.nutch.scoring.webgraph.Loops "webgraphdb
crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.Lin48an4 "webgraphdb
crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.6core<pdater "crawldb
crawl/crawldb "webgraphdb crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper "scores "topn 2###
"webgraphdb crawl/webgraphdb/ "output crawl/webgraphdb/dump/scores
bin/nutch readdb crawl/crawldb/ "stats
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
GrawlDb statistics start: crawl/crawldb/
Getting Started with Apache Nutch
15
<se Generic7ptionsFarser for parsing the arguments. 9pplications should
implement >ool for the same.
6tatistics for GrawlDb: crawl/crawldb/
>7>9L urls: 2:?22
retr, #: 2::@:
retr, 2: 25
min score: #.#
a!g score: #.#22?2::5%
ma) score: #.%$5
status 2 *dbJunfetched+: 25?'$
status 2 *dbJfetched+: :??
status ' *dbJgone+: ?5
status % *dbJredirJtemp+: 2%'
status 5 *dbJredirJperm+: ??
GrawlDb statistics: done
9pache nutch plugin:
>his e)ample is for nutch2.%
16
Getting Started with Apache Nutch
Flugin <se : B simpl, would li4e to add a new field to the inde). >his new field
should indicate the length of the parsed content of the respecti!e web page and
therefore be called VpageLengthW.
9s a first step- ,ou need to create all the necessar, new files. Lets sa,- we call
the plugin Vm,FluginW. >hen ,ou need to create the new folder
XN<>GHJH7A(/src/plugin/m,Flugin. Ne)t- simpl, cop, and paste all the files
from the urlmeta"plugin *XN<>GHJH7A(/src/plugin/urlmeta+ to the m,Flugin"
folder. Now- rename and delete the adeYuate files and directories in order to
get the following structure *,ou can do this within (clipse as well as directl, on
the file s,stem+:
Greate file li4e m,Flugin in XN<>GHJH7A(/src/plugin/m,Flugin.
Flace following file inside m,Flugin file are below
plugin.)ml- build.)ml - i!,.)ml- src/Ra!a/org/apache/nutch/inde)er and add
9dd&ield.Ra!a class file inside src/Ra!a/org/apache/nutch/inde)er/
our plugin.)ml should be li4e below..
./)ml !ersion012.#1 encoding01<>&"@1/3
.plugin id01m,Flugin1 name019dd &ield to Bnde)1
!ersion012.#.#1 pro!ider"name01,our name13
.runtime3.librar, name01m,Flugin.Rar13
.e)port name01=1/3
./librar,3./runtime3
Getting Started with Apache Nutch
17
.e)tension id01org.apache.nutch.inde)er.m,Flugin1name019dd &ield to
Bnde)1point01org.apache.nutch.inde)er.Bnde)ing&ilter13.implementation
id01m,Flugin1
class01org.apache.nutch.inde)er.9dd&ield1/3
./e)tension3./plugin3
%. ou need to change in ,our build.)ml file..
./)ml !ersion012.#1 encoding01<>&"@1/3.proRect name01m,Flugin1
default01Rar13.importfile01../build"plugin.)ml1/3./proRect3
5. 9dd following code in ,our 9dd&ield.Ra!a class
pac4age org.apache.nutch.inde)erC
import org.apache.commons.logging.LogC
import org.apache.commons.logging.Log&actor,C
import org.apache.hadoop.conf.GonfigurationC
import org.apache.hadoop.io.>e)tC
import org.apache.nutch.crawl.GrawlDatumC
import org.apache.nutch.crawl.Bnlin4sC
import org.apache.nutch.inde)er.Bnde)ing&ilterC
import org.apache.nutch.inde)er.NutchDocumentC
import org.apache.nutch.parse.FarseC
18
Getting Started with Apache Nutch
public class 9dd&ield implements Bnde)ing&ilter Z
pri!ate static final Log L7G 0 Log&actor,.getLog*9dd&ield.class+C
pri!ate Gonfiguration confC
//implements the filter"method which gi!es ,ou access to important 7bRects li4e
NutchDocument
public NutchDocument filter*NutchDocument doc- Farse parse- >e)t
url-GrawlDatum datum- Bnlin4s inlin4s+ Z
6tring content 0 parse.get>e)t*+C
//adds the new field to the document
doc.add*1pageLength1- content.length*++C
return docC
[
//Boilerplatep
public Gonfiguration getGonf*+ Z
return confC[
public !oid setGonf*Gonfiguration conf+ Z
this.conf 0 confC[[
:. .src/plugin/build.)ml in build.)ml file add
.ant dir01m,Flugin1 target01deplo,1 /3
Getting Started with Apache Nutch
19
./conf/nutch"site.)ml in nutch"site.)ml add following propert,
.propert,3.name3plugin.includes./name3
.!alue3protocol"http\urlfilter"rege)\parse"*html\ti4a+\inde)"*basic\
anchor+
scoring"opic \urlnormalizer"*pass\rege)\basic+\m,Flugin./!alue3
.description39dded m,Flugin./description3./propert,3
X67L8JH7A(/]/solr/conf/schema.)ml add following line in
schema.)ml....
.field name01pageLength1 t,pe01long1 stored01true1 inde)ed01true1/3
in following director, XN<>GHJH7A(/conf/solrinde)"mapping.)ml. 9dd
below code..
.field dest01pageLength1 source01pageLength1/3
Now- in a last step- ,ou need to build Nutch b, e)ecuting
XN<>GHJH7A(/build.)ml.
9pache nutch plugin:
>his e)ample is for nutch2.'
>his section co!ers the integral components reYuired to de!elop and use a
plugin. 9s ,ou can see inside the XN<>GHJH7A(/src/plugin director,- the
plugin folder urlmeta contains the following:
9 plugin.)ml file that tells Nutch about ,our plugin.
9 build.)ml file that tells ant how to build ,our plugin.
20
Getting Started with Apache Nutch
9 i!,.)ml file containing either the description of the dependencies of a
module- its published artifacts and its configurations or else the location of
another file which does specif, this information.
9 /src director, containing the source code of our plugin with the director,
structure shown in the hierarchical !iew below.
plugin.)ml
build.)ml
i!,.)ml
src/ Ra!a/org/apache/nutch/inde)er/
pac4age.html
<8LAetaBnde)ing&ilter.Ra!a
src/ Ra!a/org/apache/nutch/scoring
pac4age.html
<8LAeta6coring&ilter.Ra!a
%+ our plugin.)ml file should loo4 li4e this:
./)ml !ersion012.#1 encoding01<>&"@1/3
.plugin
id01urlmeta1
name01<8L Aeta Bnde)ing &ilter1
!ersion012.#.#1
pro!ider"name01sgon,ea13
.runtime3
.librar, name01urlmeta.Rar13
.e)port name01=1/3
Getting Started with Apache Nutch
21
./librar,3
./runtime3
.reYuires3
.import plugin01nutch"e)tensionpoints1/3
./reYuires3
.e)tension id01org.apache.nutch.inde)er.urlmeta1
name01<8L Aeta Bnde)ing &ilter1
point01org.apache.nutch.inde)er.Bnde)ing&ilter13
.implementation id01inde)er"urlmeta1
class01org.apache.nutch.inde)er.urlmeta.<8LAetaBnde)ing&ilter1/3
./e)tension3
.e)tension id01org.apache.nutch.scoring.urlmeta1
name01<8L Aeta 6coring &ilter1
point01org.apache.nutch.scoring.6coring&ilter13
.implementation id01scoring"urlmeta1
class01org.apache.nutch.scoring.urlmeta.<8LAeta6coring&ilter1 /3
./e)tension3
./plugin3
5) build.)ml its loo4s li4e this:
./)ml !ersion012.#1/3
.proRect name01recommended1 default01Rar"core13
.import file01../build"plugin.)ml1/3
./proRect3
22
Getting Started with Apache Nutch
:+ i!,.)ml
>his file is used to describe the dependencies of the plugin on other
libraries. Bt loo4s li4e..
.i!,"module !ersion012.#13
.info organisation01org.apache.nutch1 module01XZant.proRect.name[13
.license name019pache 2.#1/3
.i!,author name019pache Nutch >eam1
url01http://nutch.apache.org1/3
.description3
9pache Nutch
./description3
./info3
.configurations3
.include file01XZnutch.root[/i!,/i!,"configurations.)ml1/3
./configurations3
.publications3
.;""get the artifact from our module name""3
.artifact conf01master1/3
./publications3
.dependencies3
./dependencies3
./i!,"module3
?+ >he Bnde)er ()tension
>his is the source code for the Bnde)ing&ilter e)tension. Aeta >ags that are
included in ,our Grawl <8Ls- during inRection- will be propagated
Getting Started with Apache Nutch
23
throughout the outlin4s of those Grawl <8Ls. >his means that when ,ou
inde) ,our <8Ls- the meta tags that ,ou specified with ,our <8Ls will be
inde)ed alongside those <8Ls""and can be directl, Yueried.
pac4age org.apache.nutch.inde)er.urlmetaC
import org.apache.commons.logging.LogC
import org.apache.commons.logging.Log&actor,C
import org.apache.hadoop.conf.GonfigurationC
import org.apache.hadoop.io.>e)tC
import org.apache.nutch.crawl.GrawlDatumC
import org.apache.nutch.crawl.Bnlin4sC
import org.apache.nutch.inde)er.Bnde)ing()ceptionC
import org.apache.nutch.inde)er.Bnde)ing&ilterC
import org.apache.nutch.inde)er.NutchDocumentC
import org.apache.nutch.parse.FarseC
public class <8LAetaBnde)ing&ilter implements Bnde)ing&ilter Z
pri!ate static final Log L7G 0 Log&actor,
.getLog*<8LAetaBnde)ing&ilter.class+C
pri!ate static final 6tring G7N&JF87F(8> 0 1urlmeta.tags1C
pri!ate static 6tringOP urlAeta>agsC
pri!ate Gonfiguration confC
24
Getting Started with Apache Nutch
/==
= >his will ta4e the metatags that ,ou ha!e listed in ,our
1urlmeta.tags1
= propert,- and loo4s for them inside the GrawlDatum obRect. Bf the,
e)ist-
= this will add it as an attribute inside the NutchDocument.
=
= Tsee Bnde)ing&ilterMfilter
=/
public NutchDocument filter*NutchDocument doc- Farse parse- >e)t
url-
GrawlDatum datum- Bnlin4s inlin4s+ throws Bnde)ing()ception
Z
if *conf ;0 null+
this.setGonf*conf+C
if *urlAeta>ags 00 null \\ doc 00 null+
return docC
for *6tring metatag : urlAeta>ags+ Z
>e)t metadata 0 *>e)t+ datum.getAetaData*+.get*new
>e)t*metatag++C
if *metadata ;0 null+
doc.add*metatag- metadata.to6tring*++C
[
Getting Started with Apache Nutch
25
return docC
[
/== Boilerplate =/
public Gonfiguration getGonf*+ Z
return confC
[
public !oid setGonf*Gonfiguration conf+ Z
this.conf 0 confC
if *conf 00 null+
returnC
urlAeta>ags 0 conf.get6trings*G7N&JF87F(8>+C
[
[
@+ >he 6coring ()tension
>he following is the code for the <8LAeta6coring&ilter e)tension. Bf the
document being inde)ed had a recommended meta tag this e)tension
adds a lucene te)t field to the inde) called 1recommended1 with the
content of that meta tag.
pac4age org.apache.nutch.scoring.urlmetaC
import Ra!a.util.GollectionC
import Ra!a.util.Aap.(ntr,C
26
Getting Started with Apache Nutch
import Ra!a.util.BteratorC
import Ra!a.util.ListC
import org.apache.commons.logging.LogC
import org.apache.commons.logging.Log&actor,C
import org.apache.hadoop.conf.GonfigurationC
import org.apache.hadoop.conf.GonfiguredC
import org.apache.hadoop.io.>e)tC
import org.apache.nutch.crawl.GrawlDatumC
import org.apache.nutch.crawl.Bnlin4sC
import org.apache.nutch.inde)er.NutchDocumentC
import org.apache.nutch.parse.FarseC
import org.apache.nutch.parse.FarseDataC
import org.apache.nutch.protocol.GontentC
import org.apache.nutch.scoring.6coring&ilterC
import org.apache.nutch.scoring.6coring&ilter()ceptionC
public class <8LAeta6coring&ilter e)tends Gonfigured implements
6coring&ilter Z
pri!ate static final Log L7G 0
Log&actor,.getLog*<8LAeta6coring&ilter.class+C
pri!ate static final 6tring G7N&JF87F(8> 0 1urlmeta.tags1C
pri!ate static 6tringOP urlAeta>agsC
pri!ate Gonfiguration confC
Getting Started with Apache Nutch
27
public GrawlDatum distribute6core>o7utlin4s*>e)t from<rl-
FarseData parseData- Gollection.(ntr,.>e)t- GrawlDatum33 targets-
GrawlDatum adRust- int allGount+ throws 6coring&ilter()ception Z
if *urlAeta>ags 00 null \\ targets 00 null \\ parseData 00 null+
return adRustC
Bterator.(ntr,.>e)t- GrawlDatum33 targetBterator 0 targets.iterator*+C
while *targetBterator.hasNe)t*++ Z
(ntr,.>e)t- GrawlDatum3 ne)t>arget 0 targetBterator.ne)t*+C
for *6tring metatag : urlAeta>ags+ Z
6tring meta&romFarse 0 parseData.getAeta*metatag+C
if *meta&romFarse 00 null+
continueC
ne)t>arget.getEalue*+.getAetaData*+.put*new >e)t*metatag+-
new >e)t*meta&romFarse++C
[
[
return adRustC
[
28
Getting Started with Apache Nutch
public !oid pass6coreBeforeFarsing*>e)t url- GrawlDatum datum- Gontent
content+ Z
if *urlAeta>ags 00 null \\ content 00 null \\ datum 00 null+
returnC
for *6tring metatag : urlAeta>ags+ Z
>e)t meta&romDatum 0 *>e)t+ datum.getAetaData*+.get*new
>e)t*metatag++C
if *meta&romDatum 00 null+
continueC
content.getAetadata*+.set*metatag- meta&romDatum.to6tring*++C
[
[
public !oid pass6core9fterFarsing*>e)t url- Gontent content- Farse parse+
Z
if *urlAeta>ags 00 null \\ content 00 null \\ parse 00 null+
returnC
for *6tring metatag : urlAeta>ags+ Z
6tring meta&romGontent 0 content.getAetadata*+.get*metatag+C
if *meta&romGontent 00 null+
continueC
Getting Started with Apache Nutch
29
parse.getData*+.getFarseAeta*+.set*metatag- meta&romGontent+C
[
[
/== Boilerplate =/
public float generator6ortEalue*>e)t url- GrawlDatum datum- float
init6ort+
throws 6coring&ilter()ception Z
return init6ortC
[
/== Boilerplate =/
public float inde)er6core*>e)t url- NutchDocument doc- GrawlDatum
dbDatum-
GrawlDatum fetchDatum- Farse parse- Bnlin4s inlin4s- float init6core+
throws 6coring&ilter()ception Z
return init6coreC
[
public !oid initial6core*>e)t url- GrawlDatum datum+
throws 6coring&ilter()ception Z
returnC
[
public !oid inRected6core*>e)t url- GrawlDatum datum+
throws 6coring&ilter()ception Z
returnC
30
Getting Started with Apache Nutch
[
public !oid updateDb6core*>e)t url- GrawlDatum old- GrawlDatum datum-
List inlin4ed+ throws 6coring&ilter()ception Z
returnC
[
public !oid setGonf*Gonfiguration conf+ Z
super.setGonf*conf+C
if *conf 00 null+
returnC
urlAeta>ags 0 conf.get6trings*G7N&JF87F(8>+C
[
public Gonfiguration getGonf*+ Z
return confC
[
[
$+ Getting Nutch to <se our Flugin
Bn order to get Nutch to use ,our plugin- ,ou need to edit ,our conf/nutch"
site.)ml file and add in a bloc4 li4e this:
.propert,3
.name3plugin.includes./name3
.!alue3protocol"http\urlfilter"rege)\parse"*html\ti4a+\inde)"*basic\
anchor+\scoring"opic\urlnormalizer"*pass\rege)\basic+\urlmeta./!alue3
Getting Started with Apache Nutch
31
.description38egular e)pression naming plugin director, names to
include. 9n, plugin not matching this e)pression is e)cluded.
Bn an, case ,ou need at least include the nutch"e)tensionpoints plugin. B,
default Nutch includes crawling Rust H>AL and plain te)t !ia H>>F-
and basic inde)ing and search plugins.
./description3
./propert,3
2#+ ouDll want to edit the regular e)pression so that it includes the
name of the urlmeta plugin.
22+ Getting 9nt to Gompile our Flugin
Bn order for ant to compile and deplo, ,our plugin ,ou need to edit the
src/plugin/build.)ml file *N7> the build.)ml in the root of ,our chec4out
director,+. ouDll see a number of lines that loo4 li4e.
(dit this bloc4 to add a line for ,our plugin before the ./target3 tag.
.ant dir01urlmeta1 target01deplo,1 /3
8unning DantD in the root of ,our chec4out director, should get e!er,thing
compiled and Rared up. >he ne)t time ,ou run a crawl both the scoring and
inde)ing e)tension will be used which will enable us to search for meta tags
within our 6olr inde).
<se terminal b, writing following command..
bin/nutch crawl ./urls/seed.t)t/ "solr http://localhost:@$@'/solr/ "depth ' "topN
5
32
Getting Started with Apache Nutch
Getting Started with Apache Nutch
33