AllPaths LG Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

ALLPATHSLGManual

ComputationalResearchandDevelopmentGroup
GenomeSequencingandAnalysisProgram
BroadInstituteofMITandHarvard
Cambridge,MA

ManualRevision:(09Jan121:31:00PM)

TableofContents
ALLPATHSLGManual..........................................................................................................................1
Conventions..................................................................................................................................................4
Introduction..................................................................................................................................................4
Capabilitiesandlimitations...........................................................................................................................4
Stayinguptodatewithourblog...................................................................................................................5
Requirements................................................................................................................................................5
Availability.....................................................................................................................................................6
GettingHelp..................................................................................................................................................6
Installation....................................................................................................................................................6
Troubleshooting........................................................................................................................................7
Environment..............................................................................................................................................7
ALLPATHSpipelineoverview.........................................................................................................................7
RunAllPathsLGmodule..............................................................................................................................7
ALLPATHSpipelinedirectorystructure......................................................................................................8
REFERENCE(organism)directory.......................................................................................................................8
DATA(project)directory......................................................................................................................................8
RUN(assemblypreprocessing)directory.............................................................................................................9
ASSEMBLIESdirectory.......................................................................................................................................9
SUBDIR(assembly)directory..............................................................................................................................9
RequiredALLPATHSarguments....................................................................................................................9
PreparingdataforALLPATHS........................................................................................................................9
Supportedlibraryconstructions..............................................................................................................10
Readorientation.....................................................................................................................................10
ALLPATHSinputfiles................................................................................................................................10
Base,qualityscore,andpairinginformationfiles...............................................................................................10
Theploidyfile.................................................................................................................................................11
PreparingALLPATHSinputfiles...............................................................................................................11
Accepteddatafileformats..................................................................................................................................11
in_groups.csvfile.......................................................................................................................................12
in_libs.csvfile............................................................................................................................................12
Runningconversionscript...................................................................................................................................13
Importreference.....................................................................................................................................14
RunningALLPATHSinbrief.......................................................................................................................15
Example...................................................................................................................................................15
Pipelineerrors.........................................................................................................................................16
TheALLPATHSassembly.............................................................................................................................16
Assemblyasagraph...............................................................................................................................16
Graphfeatures........................................................................................................................................17
Repeats...............................................................................................................................................................17
Homopolymers...................................................................................................................................................17
SNPsandbaseerrors..........................................................................................................................................17

ALLPATHSLGManual

FlatteningandScaffolding......................................................................................................................17
AssemblyResults.....................................................................................................................................18
fastaformat........................................................................................................................................................18
efastaformat......................................................................................................................................................18

ALLPATHSReference.........................................................................................................................19
ALLPATHSCacheforpowerusers...............................................................................................................19
CreatingtheALLPATHSCache.................................................................................................................19
UsingtheALLPATHSCache.....................................................................................................................20
ALLPATHScompilationoptions...................................................................................................................22
ALLPATHSpipelineindetail.....................................................................................................................22
KeyFeatures............................................................................................................................................22
DirectorystructureALLPATHS_BASE....................................................................................................22
Targets....................................................................................................................................................22
Pseudotargets....................................................................................................................................................23
Targetfiles..........................................................................................................................................................23
Evaluationmode.....................................................................................................................................23
Kmersize,K.............................................................................................................................................24
Parallelization.........................................................................................................................................24
Crossmoduleparallelization..............................................................................................................................24
Parallelizationofindividualmodules..................................................................................................................25
Logging....................................................................................................................................................25
References.............................................................................................................................................26

ALLPATHSLGManual

Conventions
Thefollowingconventionsareusedinthismanual.
Commands,filenames,directoriesandargumentsaretypesetinCourier.
Commandlineargumentsarenormallysplitoneperlineforclarity,listedbelowtheactualcommand.
Forexample:
RunAllPathsLG PRE=/assemblies DATA=datadir RUN=rundir SUBDIR=attempt1
becomes
RunAllPathsLG
PRE=/assemblies
DATA=datadir
RUN=rundir
SUBDIR=attempt1
Usersuppliedvaluesareindicatedby<description>.Intheexamplebelow,theusershould
provideavalueforthetargetname.
TARGETS=<target name>
Forexample:
TARGETS=import

Introduction
ALLPATHSLGisawholegenomeshotgunassemblerthatcangeneratehighqualitygenomeassemblies
usingshortreads(~100bp)suchasthoseproducedbythenewgenerationofsequencers.Thesignificant
differencebetweenALLPATHSandtraditionalassemblerssuchasArachneisthatALLPATHSassemblies
arenotnecessarilylinear,butinsteadarepresentedintheformofagraph.Thisgraphrepresentation
retainsambiguities,suchasthosearisingfrompolymorphism,uncorrectedreaderrors,andunresolved
repeats,therebyprovidinginformationthathasbeenabsentfrompreviousgenomeassemblies.

Capabilitiesandlimitations
ALLPATHSLGisashortreadassembler.Ithasbeendesignedtousereadsproducedbynewsequencing
technologymachinessuchastheIlluminaGenomeAnalyzer.Theversiondescribedherehasbeen
optimizedfor,butnotnecessarilylimitedto,readsoflength100bases.
ALLPATHSisnotdesignedtoassembleSangeror454FLXreads,oramixofthesewithshortreads.

ALLPATHSLGManual

ALLPATHSLGrequireshighsequencecoverageofthegenomeinordertocompensatefortheshortness
ofthereads.Theprecisecoveragerequireddependsonthelengthandqualityofthepairedreads,but
typicallyisoftheorder100xorabove.Thisisrawreadcoverage,beforeanyerrorcorrectionorfiltering.
Forsmallbacterialsizedgenomes,thistranslatestoafractionofanIlluminalanetheminimumthe
machineiscapableofwithoutmultiplexing.ForlargergenomesthistranslatesintoroughlyoneIllumina
HiSeqflowcell.
ALLPATHSLGrequiresaminimumof2pairedendlibrariesoneshortandonelong.Theshortlibrary
averageseparationsizemustbeslightlylessthantwicethereadsize,suchthatthereadsfromapairwill
likelyoverlapforexample,for100basereadstheinsertsizeshouldbe180bases.Thedistributionof
sizesshouldbeassmallaspossible,withastandarddeviationoflessthan20%.Thelonglibraryinsert
sizeshouldbeapproximately3000baseslongandcanhavealargersizedistribution.Additionaloptional
longerinsertlibrariescanbeusedtohelpdisambiguatelargerrepeatstructuresandmaybegenerated
atlowercoverage.
Thelibrariesmustbepure,thatis,theymustconsistofreadsthatdonotcontainanynongenomic
portionsfromstuffersorsimilarconstructions.Readsfromjumpinglibrariesmaybechimeric,thatis,
theymaycrossthejunctionpointbetweenthetwoendsoftheinsertthatoccursinlibrariesproduced
usingtheIlluminashearedlibraryprotocol.

Stayinguptodatewithourblog
ThebestsourceofcurrentnewsandinformationonALLPATHSLGisourblog:
http://www.broadinstitute.org/software/allpathslg/blog/
Hereyouwillfindannouncements,anFAQ,linkstothelatestcode,manualandtestdata,build
requirementsandinstructions,andinformationonhowtogethelpfromthedevelopersofALLPATHSLG
Werecommendthatourblogpageshouldbeyourstartingpointwheneveryouhaveproblems,
questionsorarejustlookingforthelatestversion.

Requirements
TocompileandrunALLPATHSLGyouwillneedaLinux/UNIXsystemwithatleast16GBofRAM.We
suggestaminimumof32Gbforsmallgenomes,and512Gbformammaliansizedgenomes.
FortheuptodatelistofrequirementspleaseseeourGeneralBuildhelphere:
https://www.broadinstitute.org/science/programs/genomebiology/computationalrd/general
instructionsbuildingoursoftware
Youwillneed:

ALLPATHSLGManual

Theg++compilerfromGCC,version4.3.2orhigher.Weuseversion4.4.3.
http://gcc.gnu.org/
TheGMPlibrarycompiledwiththeC++interface.YourGCCinstallationmayalreadyincludeGMP.
http://gmplib.org/
ThePicardsetofJavabasedcommandlineutilitiesforSAMfilemanipulationavailableat
http://picard.sourceforge.net/
Thegraphcommanddotfromthegraphvizpackage.Weuseversion2.16.1.
http://www.graphviz.org/

Availability
TheALLPATHSsourcecodeisavailablefordownloadviaourblogat:
http://www.broadinstitute.org/software/allpathslg/blog/
Wedonotissueofficialreleases.Instead,pleasedownloadthelatestversionfromournightlybuilds.
Onlybuildsthatpassourinternaltestsaremadeavailableinthiswaywedonotreleasebrokenbuilds.

GettingHelp
PleasefirstconsulttheFAQavailableonourblogat:
http://www.broadinstitute.org/software/allpathslg/blog/
IfyouencounterdifficultiesthatcannotberesolvedusingthismanualortheFAQyoucancontactthe
ALLPATHSdevelopmentteamvia:
[email protected]
Ifyougotpassinstallationandcompilationandyouhaveproblemswiththeassemblyprocessitself(e.g.
RunAllPathsLGcrashes)weaskyoutorunfromtheREFERENCEdirectory:
HelpMe.pl SUBDIR=<DATA>/<RUN>/ASSEMBLIES/<SUBDIR>
Thiswillgeneratethefilehelp_me.tar.gzthatyoushouldattachtoyouremailtous.Pleasesee
theAllPathsPipelineOverviewsectionbelowforthedescriptionofREFERENCE,DATA,RUN,and
SUBDIR.

Installation
FortheuptodatebuildinstructionspleaseseeourGeneralBuildInstructionshere:
https://www.broadinstitute.org/science/programs/genomebiology/computationalrd/general
instructionsbuildingoursoftware

ALLPATHSLGManual

Afteryouhavedownloadedthelatestbuild,unpackitusingtar.Thenyoucansimplycompilethe
sourcecodewithconfigureandmake.Allofthesourcecodewillbeinitsowndirectorycalled
allpathslg-<revision>;wewillrefertothisastheAllPathsdirectory.Forexample,starting
fromtherootdirectory(thelocationofthedownloadedfile):
% tar xzf allpathslg-<revision>.tar
% cd allpathslg-<revision>.tar

//expandtarball
//moveintothesourcedirectory

% ./configure --prefix=/path/to/install/directory
% make

% make install

//buildALLPATHSLG

//installALLPATHSLG

//runconfigure

Troubleshooting
Oftheabovesteps,theonemostlikelytofailisconfigure,whichchecksfortheexistenceofvarious
commandsandlibrariesinyourenvironment.YoumayneedtochangeyourPATHoryour
LD_LIBRARY_PATH.Youmayalsoneedtorunconfigurewithflags.Foralistingofallsuch
availableflags,runconfigure --help.

Environment
Aftercompilation,theexecutablebinaryfileswillbeinthesubdirectorybinoftheallpathslg<revision> directory.YoumaywanttoaddthisdirectorytoyourPATHsothatyoucancallthe
ALLPATHSbinariesfromanywhere.AlsomodifyyourPATHtoincludethedirectoriescontaining
addr2lineandyourchosenversionofg++.YoumayneedtochangeyourLD_LIBRARY_PATHas
well.

ALLPATHSpipelineoverview
ALLPATHSconsistsofaseriesofmodules.Eachmoduleperformsastepoftheassemblyprocess.
Differentmodulesmayberun,andinvaryingorder,dependingontheassemblyparameters.Asingle
modulecalledRunAllPathsLGcontrolstheentirepipeline,decidingwhichmodulestorunandhow
torunthem.Althoughitispossibletoruntheindividualmodulesmanually,youshouldbeableto
accomplisheverythingyouneedthroughRunAllPathsLG.

RunAllPathsLGmodule
RunAllPathsLGusestheUnixmakeutilitytocontroltheassemblypipeline.Itdoesnotcalleach
moduleitself,butinsteadcreatesaspecialmakefilethatdoes.WithinRunAllPathsLGeach
moduleisdefinedintermsofitssourceandtargetfiles,andthecommandlineusedtocallit.Amodule
isonlyrunifitstargetfilesdontexist,orareoutofdatecomparedtoitssourcefiles,orifthecommand
usedtocallthemodulehaschanged.InthiswayRunAllPathsLGcanberunagainandagain,with
differentparameters,andonlythosemodulesthatneedtobecalledwillbe.Thisisefficientandensures

ALLPATHSLGManual

thatallintermediatefilesarealwayscorrect,regardlessofhowmanytimesRunAllPathsLGhasbeen
calledonaparticularsetofsourcedataandhowmanytimesamodulefailsorabortspartwaythrough.

ALLPATHSpipelinedirectorystructure
Theassemblypipelineusesthefollowingdirectorystructuretostoreitsinputs,intermediates,and
outputs.Thepipelineautomaticallycreatesthedirectories(iftheydontalreadyexist)andpopulates
them.Thenamesshownherearecommonlyusedtorefertothedirectories,althoughcommandline
argumentsdeterminetheactualdirectorynames.
REFERENCE/DATA/RUN/ASSEMBLIES/SUBDIR
Themeaningofeachdirectoryisgivenbelow.Thedataseparationdescribedistheidealand
occasionallythisisbrokenforconvenience.Somefilesareduplicatedbetweendirectories,butonlyin
thedownwarddirection.Allfileswithinthisdirectorystructureareunderthecontrolofthepipeline.
ThelocationofthepipelinedirectorystructureisspecifiedwiththeRunAllPathsLGcommandline
argumentPRE.
TypicallyinthedirectoryPREtherewillbeanumberofREFERENCEdirectories,oneforeachorganism
beingassembledbyALLPATHS.
REFERENCE(organism)directory
TheREFERENCEdirectoryissocalledbecausethereshouldbeoneforeachreferencegenomeyouuse.
Itisusedtoseparateassemblyprojectsbyorganismandpossiblyalsobyisolate(if,forexample,you
wanttousetwodifferentE.colireferences)andistypicallynamedaftertheorganism.Allassembly
projectsforagivenorganism/isolatewillbecontainedinthatREFERENCEdirectory.Allintermediate
filesgeneratedforuseinevaluationthatareindependentoftheparticularassemblyattemptwillbe
storedhereandsharedbyallassemblies.
YoudonotneedtosupplyareferencegenomeALLPATHSis,afterall,adenovoassembler.Buteven
indenovoassemblies,thepipelinecanperformusefulevaluationsatvariousstagesoftheassembly
process,soyoushouldprovideareferencegenomeifyouhaveone(seeImportreferencebelowfor
infoonhowtosetupthisfile.)Ifyoudonothaveareferencegenome,simplycreateasingle
REFERENCEdirectoryfortheorganismyouwishtoassemble.
TheREFERENCEdirectorymaycontainmanyDATAdirectories,eachrepresentingaparticularsetof
readdatatoassemble.
RunAllPathsLGargument:REFERENCE_NAME
DATA(project)directory
TheDATAdirectorycontainstheoriginalreaddatausedinaparticularassemblyattempt.(Thisdatais
storedininternalALLPATHSformats:fastb,qualb,pairs.)Italsocontainsintermediatefilesderivedfrom
theoriginaldatathatareindependentoftheparticularassemblyattempttypicallyfilesusedin
evaluation.

ALLPATHSLGManual

EachDATAdirectorymaycontainmanyRUNdirectories,eachrepresentingaparticularattemptto
assembletheoriginaldatausingadifferentsetofparameters.
RunAllPathsLGargument:DATA_SUBDIR
RUN(assemblypreprocessing)directory
TheRUNdirectorycontainsallthenonlocalizedassemblyfiles,thatis,thoseintermediatefiles
generatedfromtheoriginalreaddatainpreparationforthefinalassemblystage(LocalizeReadsLG
andbeyond).Itmayalsocontainintermediatefilesusedinevaluationthataredependentonthe
assemblyparameterschosen.
RunAllPathsLGargument:RUN
ASSEMBLIESdirectory
TheASSEMBLIESdirectorycontainstheactualassembly(orassemblies).Thereisnoargumentfor
namingthisdirectory.ItisactuallynamedASSEMBLIES.
SUBDIR(assembly)directory
TheSUBDIRdirectoryiswherethelocalizedassemblyisgenerated,alongwithsomeassembly
intermediateandevaluationfiles.
RunAllPathsLGargument:SUBDIR

RequiredALLPATHSarguments
Thefollowingcommandlineargumentsmustbesupplied:
PREtherootdirectoryinwhichtheALLPATHSpipelinedirectorywillbecreated.
REFERENCE_NAMEtheREFERENCE(organism)directorynamedescribedpreviously.
DATA_SUBDIRtheDATA(project)directorynamedescribedpreviously.
RUNtheRUN(assemblypreprocessing)directorynamedescribedpreviously.
SUBDIRtheSUBDIR(assembly)directorynamedescribedpreviously.

PreparingdataforALLPATHS
BeforerunningALLPATHS,youmustprepareyourdataforimportintotheALLPATHSpipeline.Thistask
willrequireyoutogatherthereaddataintheappropriateformats,andthenaddmetadatatodescribe
them.Ifyouareusingareferencegenomeforevaluation,youwillneedthataswell.Thissection
describestherequireddataformats.

ALLPATHSLGManual

Supportedlibraryconstructions
Anyinputdatasetshouldincludeatleastonefragmentlibraryandonejumpinglibrary.Afragment
libraryisalibrarywithashortinsertseparation,lessthantwicethereadlength,sothatthereadsmay
overlap(e.g.,100bpIlluminareadstakenfrom180bpinserts.)Ajumpinglibraryhasalongerseparation,
typicallyinthe3kbp10kbprange,andmayincludeshearedorEcoP15Ilibrariesorotherjumpinglibrary
construction;ALLPATHScanhandlereadchimerisminjumpinglibraries.Notethatfragmentreads
shouldbelongenoughtoensuretheoverlap.
Additionally,ALLPATHSalsosupportslongjumpinglibraries.Ajumpinglibraryisconsideredtobelongif
theinsertsizeislargerthan20kbp.Theselibrariesareoptionalandusedonlytoimprovescaffoldingin
mammaliansizedgenomes.Typically,longjumpcoverageoflessthan1xissufficienttosignificantly
improvescaffolding.
ALLPATHSalsoacceptslongunpairedreads(e.g.,PacBioreadsat50xcoverage),whichareoptionaland
areusedonlytopatchgapsinthelaterstagesoftheassemblyprocess.Currentlythisisonlytestedfor
small,bacterialsizedgenomes.
AnyothertypeoflibraryconstructionisnotsupportedbyALLPATHSatthispoint.

Readorientation
Fragmentlibraryreadsareexpectedtobeorientedtowardseachother(inward):

Jumpinglibraryreadsareexpectedtobeorientedawayfromeachother(outward),asaresultofthe
typicaljumpinglibraryconstructionmethods:

Longjumpinglibraryreadsareexpectedtobeorientedtowardseachother(inward),asaresultofthe
typicaljumpinglibraryconstructionmethods:

ALLPATHSinputfiles
TheDATAdirectorymustinitiallyholdfilescontainingthesequencedreads,theirqualityscoresand
informationconcerningtheirpairing.Inadditionaploidyfilemustalsobepresent.Thesefilesmay
alreadyexistifyouarecontinuingorrestartinganexistingassembly,ormaybeassembledtogether
usingtoolsprovidedwiththeALLPATHSdistribution.
Base,qualityscore,andpairinginformationfiles
Thereadlibrariesmentionedintheprevioussectionmusteachbeprovidedasafilecontainingthe
bases,afileholdingthequalityscores,andafilewiththepairingandlibraryinfo.Thespecificfilenames
are:

ALLPATHSLGManual

10

<REF>/<DATA>/frag_reads_orig.fastb
<REF>/<DATA>/frag_reads_orig.qualb
<REF>/<DATA>/frag_reads_orig.pairs
<REF>/<DATA>/jump_reads_orig.fastb
<REF>/<DATA>/jump_reads_orig.qualb
<REF>/<DATA>/jump_reads_orig.pairs
Thefollowinglongjumpfilesareoptional:
<REF>/<DATA>/long_jump_reads_orig.fastb
<REF>/<DATA>/long_jump_reads_orig.qualb
<REF>/<DATA>/long_jump_reads_orig.pairs
Asisthefollowinglongunpairedreadsfile:
<REF>/<DATA>/long_reads_orig.fastb
ThesefilescanbeautomaticallygeneratedfromasetofBAM,fastq,orfastafilesasdescribedbelow.
Theploidyfile
Thefileploidyisasinglelinefilecontaininganumber.Asthenamesuggests,thisnumberindicates
theploidyofthegenomewith1forhaploidgenomesand2fordiploidgenomes.Polyploidgenomesare
notcurrentlysupported.Thespecificfilenameis:
<REF>/<DATA>/ploidy

PreparingALLPATHSinputfiles
ThePerlscriptPrepareAllPathsInputs.plcanbeusedtoautomaticallyconvertasetofBAM,
fasta,fastq,orfastbfilestoALLPATHSinputfiles.Itwillalsooptionallycreatethenecessaryploidy
file.ThisistheeasiestwaytopreparedataforALLPATHSgivenasetoffilesfromtheIlluminaplatform.
Theusermustprovideasinputtwocommaseparatedvalues(.csv)files:
in_groups.csv
in_libs.csv
Thesedescribe,respectively,thelocationsandlibraryinformationofthevariousfilestobeconverted.
Accepteddatafileformats
Eachdatafilemustcontainpairedreadsfromasinglelibrary,butalibrarymaybesplitovermanyfiles.
TypicallyadatafilewillrepresentasinglelaneofanIlluminaflowcell.
Asmentionedabove,currentlyacceptedformatsare.bam,.fastq,and.fasta.Thequalityscoresfor
.fastafilesareexpectedincorresponding.qualafiles.
For.fastqfilesyouMUSTcheckhowthequalityscoresareencoded.Bydefaultitisassumedthatthe
qualityscoresareencodedusingASCII33to126.IfthequalityscoresareencodedusingASCII64to126

ALLPATHSLGManual

11

youMUSTspecifytheoptionPHRED_64=1whenrunningtheconversionscript(thisisdescribed
below).
in_groups.csvfile
Eachlineinin_groups.csvprovides,foreachdatafile,thefollowinginformation:
group_name:aUNIQUEnicknameforthisspecificdataset.
library_name:thelibrarytowhichthedatasetbelongs.
file_name:theabsolutepathtothedatafile.Wildcards*and?areaccepted(butnotinthe
extension)whenspecifyingmultiplefilesasinthecaseoftwopairedormultipleunpairedfastqorfasta
files.Supportedextensionsare:.bam,.fasta,.fa,.fastq,.fq,.fastq.gz,and.fq.gz,allcase
insensitive.For.fastaand.faitisexpectedthatcorresponding.qualaand.qafilesexist,
respectively.

Examplein_groups.csv:
group_name, library_name,
file_name
302, Illumina_011,
/seq/Illumina/011/302.bam
303, Illumina_012, /seq/Illumina/012/303.?.fasta
100,
PacBio_007, /seq/PacBio/007/100.*.fastq.gz

in_libs.csvfile
Eachlineinin_libs.csvdescribesalibrary.Thespecificfieldsare:
library_name:matchesthesamefieldinin_groups.csv.
project_name:astringnamingtheproject.
organism_name:theorganism.
type:fragment,jumping,EcoP15,etc.Thisfieldisonlyinformative.
paired:0:Unpairedreads;1:pairedreads.
frag_size:averagenumberofbasesinthefragments(onlydefinedforFRAGMENTlibraries).
frag_stddev:estimatedstandarddeviationofthefragmentssizes(onlydefinedforFRAGMENT
libraries).
insert_size:averagenumberofbasesintheinserts(onlydefinedforJUMPINGlibraries;iflarger
than20kb,thelibraryisconsideredtobeaLONGJUMPINGlibrary).
ALLPATHSLGManual

12

insert_stddev:estimatedstandarddeviationoftheinsertssizes(onlydefinedforJUMPING
libraries).
read_orientation:inwardoroutward.Outwardorientedreadswillbereversed.
genomic_start:indexoftheFIRSTgenomicbaseinthereads.Ifnonzero,allthebasesbefore
genomic_startwillbetrimmedout.
genomic_end:indexoftheLASTgenomicbaseinthereads.Ifnonzero,allthebasesafter
genomic_endwillbetrimmedout.
Hereisanexamplein_libs.csv(NOTE:allthefieldsshouldbeonasingleline;thatmakesthelines
toolongtoshowhere,hencethe'...'):
library_name, project_name, organism_name,
type, paired, ...
Illumina_011,
Awesome,
E.coli, fragment,
1, ...
Illumina_012,
Awesome,
E.coli, jumping,
1, ...
PacBio_007,
Awesome,
E.coli,
long,
0, ...
... frag_size, frag_stddev, insert_size, insert_stddev, ...
...
180,
10,
,
, ...
...
,
,
3000,
500, ...
...
,
,
,
, ...
... read_orientation, genomic_start, genomic_end
...
inward,
,
...
outward,
,
...
,
,
Runningconversionscript
SimplestexampleofaPrepareAllPathsInputs.plrun:
PrepareAllPathsInputs.pl
DATA_DIR=<full_path to REFERENCE DIR>/mydata
PICARD_TOOLS_DIR=/opt/picard/bin
whereDATA_DIRisthelocationoftheALLPATHSDATAdirectorywheretheconvertedreadswillbe
placed,andPICARD_TOOLS_DIRisthepathtothePicardtoolsneededfordataconversion,ifyour
dataisinBAMformat.Thereareotheroptionsthatcanbespecified:
IN_GROUPS_CSVuseafileotherthan./in_groups.csv.
IN_LIBS_CSVuseafileotherthan./in_libs.csv.
INCLUDE_NON_PF_READS1:(default)includenonPFreads.0:includeonlyPFreads.

ALLPATHSLGManual

13

PHRED_64(forfastqfilesonly)0:(default)providedfastqshavequalityscoresencodedwithASCII
33to126.1:ASCII64to126.
PLOIDYgeneratetheploidyfile.Validvaluesare1or2.
HOSTSlistofhoststouseinparallelbyforking(NOTE:forkingtoremotehostsrequirespasswordless
sshaccess,e.g.usingsshagent/sshadd).Example:2,3.host2,4.host3 whichtranslatesto:

2processesforkedonthelocalhost;
3processesforkedonhost2;
4processesforkedonhost3.

Thefollowingoptionsallowtheusertoselect,randomly,afractionofthetotalnumberofreads:
FRAG_FRACfractionoffragmentreadstoinclude,e.g.30%or0.3.
JUMP_FRACfractionofjumpingreadstoinclude,e.g.20%or0.2.
LONG_JUMP_FRACfractionoflongjumpingreadstoinclude,e.g.90%or0.9.
GENOME_SIZEestimatedgenomesizeforthepurposeofcoverageestimation.
FRAG_COVERAGEfragmentlibrarydesiredcoverage,e.g.45.RequiresGENOME_SIZE.
JUMP_COVERAGEjumpinglibrarydesiredcoverage,e.g.45.RequiresGENOME_SIZE.
LONG_JUMP_COVERAGEjumpinglibrarydesiredcoverage,e.g.1.RequiresGENOME_SIZE
(typically,onlyverylowcoverageisrequiredforlongjumps).
Note,however,thattherearesomerestrictionsontheaboveoptions.IfyouspecifyFRAG_FRACor
JUMP_FRAC,youcannotalsospecifyFRAG_COVERAGEorJUMP_COVERAGE.Ifyouspecify
FRAG_COVERAGEorJUMP_COVERAGEyoumustspecifyGENOME_SIZE,sincebothvaluesare
necessaryforthecalculationofthereadfractiontoinclude.
AfterasuccessfulrunofPrepareAllPathsInputs.plthenecessaryALLPATHSinputfilesshould
beinplaceandreadyforanassemblyruntostart.

Importreference
Ifyouplantoperformevaluations,youcanimportareferencegenomeintothepipelinedirectoryatthe
sametimeasthereaddata.Thereferencegenometoimportisspecifiedusingtheargument:
REFERENCE_DIR=<directory containing reference>
Thereferencegenomemustbesuppliedastwofiles:genome.fastaandgenome.fastb.Thefastb
fileisabinaryversionofthefastafile.YoucanconvertfromfastatofastbusingtheALLPATHSmodule
Fasta2Fastb.

ALLPATHSLGManual

14

ThisargumentisignoredifareferencegenomealreadyexistsintheREFERENCEdirectory.Itwillnot
causeanexistingreferencegenomeinthepipelinedirectorytobeoverwritten.
OncethereferencehasbeenimportedintotheREFERENCEdirectory,youcanomitthe
REFERENCE_DIRargumentwhenrunningRunAllPathsLG.
InsteadofusingtheREFERENCE_DIRargument,youmaysimplycreatetheREFERENCEdirectory
andplacethereferencegenomefilesinit.Thereferencegenomefilesmustbenamed:
genome.fasta
genome.fastb

and

RunningALLPATHSinbrief
OncethereaddatahasbeenimportedyoumayruntheALLPATHSpipelineasoftenasdesired,each
timewithdifferentassemblyparameters.EachtimeyouruntheALLPATHSpipelineitwilldetermine
whichmodulesneedtorun(orrerun)dependingontheparametersyouhavechosen.Unlessyouwant
tooverwriteyourpreviousassembly,specifyanewRUNdirectoryeachtime.
ThissectionbrieflydescribestheRunAllPathsLGargumentscommonlyusedtoruntheALLPATHS
pipeline.CompletedescriptionsofallargumentsareprovidedintheALLPATHSReference.
evaluationmodeGivenareferencegenome,thepipelinecanperformevaluationsatvarious
stagesoftheassemblyprocessandoftheassemblyitself.Toturnevaluationon,set
EVALUATION=STANDARD.
targetsThevalueoftheTARGETSparameterdeterminestheoperationsperformedbythe
pipeline:
TARGETS=full_eval Runsaversionofthepipelinethatincludesadditionalevaluation
modules.
TARGETS=standard

Runsastreamlinedversionofthepipelinethatskipsmanyofthe
evaluationmodules.

parallelizationThepipelinehastwolevelsofparallelization.Itcanruntwoormoremodules
concurrentlyiftheirdependenciesareindependent.Manyindividualmodulesarealsocapable
ofbeingparallelizedviamultithreading.Bydefault,onlymultithreadedparallelizationison.See
theALLPATHSReferenceformoredetails.

Example
TheTARGETSargumentofRunAllPathsLGdetermineswhethertheALLPATHSpipelinerunsto
completionorimportsthedataandstops.Torunanassemblyusingpreviouslyimporteddatause:
TARGETS=standard

ALLPATHSLGManual

15

Forexample,fordataimportedusingPrepareAllPathsInputs.plwithDATA_SUBDIR=<user
pre>/staph/mydata use:
RunAllPathsLG
PRE=<user pre>
DATA_SUBDIR=mydata
RUN=myrun
REFERENCE_NAME=staph
TARGETS=standard
Thiswillcreate(ifitdoesntalreadyexist)thefollowingpipelinedirectorystructure:
<user pre>/staph/mydata/myrun
WherestaphistheREFERENCEdirectory,mydataistheDATAdirectorycontainingtheimported
data,andmyrunistheRUNdirectory.

Pipelineerrors
Thepipelinewillstopwhenitencountersanerror.Therearetwotypesoferrorthatcanoccur:
ruleconsistencycheckerrorBeforeanymodulesarecalled,RunAllPathsLGcheckstosee
ifitknowshowtomakealltheoutputfilesforthegivenassemblyparameters.Ifnot,the
pipelinehaltsimmediatelybeforeanymodulesarerun,reportingthefilesthatitdoesnotknow
howtomake.Checkandcorrectyourargumentsandtryagain.
runtimeconsistencycheckerrorAftereachmoduleinthepipelinehascompleted,thepipeline
checkstoseeifcorrectoutputfileswerecreated.Ifanyfilesaremissing,thepipelinehalts,
reportingthemissingfilesandthemodulethatfailedtoproducethem.Thismostoftenoccurs
whenamodulecrashes.Checkthelogforanerrormessagefromthemoduleinquestion.
Oncetheerrorhasbeenidentifiedandcorrected,reruntheRunAllPathsLGcommand.Thepipeline
restartsatthepointitpreviouslyfailed.

TheALLPATHSassembly
Assemblyasagraph
Unlikeaconventionalgenomeassembly,anALLPATHSassemblyisagraph.Edgesinthisgraphrepresent
basesequences,andeachpaththroughthegraphrepresentsapossiblesolutiontotheassembly
problem.Anidealassemblywouldbeasingleedge,withoccasionalblipscorrespondingtoSNPsina
diploidgenome.However,uncorrectedsequencingerrors,unresolvedrepeatstructures,andassembly
algorithminadequaciesresultinambiguity.Byrepresentingtheassemblyasagraphwecancapturethis
ambiguityratherthanarbitrarilychoosingasolutionandthereforelosinginformation.

ALLPATHSLGManual

16

Graphfeatures
Agraphassemblyconsistsofcomponentsandedges.Acomponentisacollectionofconnectededges.
Anassemblymayconsistofanumberofcomponents,scaffoldedtogetherasinalinearassembly.
Inthefollowingexamplestheedgelengthsarenottoscale.Purplerepresentslongedges;red,medium
sizededges;black,shortedges;andgrey,veryshortedges.
Repeats
Thegraphbelowcontainsa6.2kbrepeatthatoccurs3timesinthegenome.Therepeatislongerthan
thelargestinsertsizeavailableandsocouldnotberesolved.Howeverwedoknowthetwopossible
orderingsofedgesandcanrepresentthisinagraph.

Homopolymers
Withshortreads,longhomopolymerrunscanbedifficulttoresolve.Ratherthanassumingavaluefor
thehomopolymerlength,theyarerepresentedasaloopoflength1base.

SNPsandbaseerrors
Whenthereadsoffertwoseeminglyequallypossiblealternativesforabase,werepresentthisasasmall
bubble.ThissituationcanarisefromSNPs,inwhichcasethebubbleiscorrect,butitmayalsobedue
toparticularlyhardtocorrectbasesubstitutionerrorsintherawreads.Inaconventionalassembly,
basesoflowqualitywouldrepresenttheseambiguities.

FlatteningandScaffolding
Thegraphassemblymentionedintheprevioussectionprovidesthemostcompletedescriptionofthe
genome.However,thisgraphincompatiblewithexistingannotationandanalysistoolsandisalso
troublesometoscaffold.Toovercometheseproblemsweattempttoflattenthegraph,retainingas
muchambiguityaspossible,thenscaffoldtocreatea(nearly)conventionalassemblyconsistingof
scaffoldedlinearcontigs.
Theflatteningandscaffoldingprocessesaregivenelsewhere,buttheendresultisapairoffilesa
conventionalfastafileplusarelatedefastafile.Thesefilesaredescribedindetailinthenext
section.

ALLPATHSLGManual

17

AssemblyResults
Theresultsoftheassemblypipelinearegiveninthefollowingtwofilesintheassemblysub_dir
directory:
final.assembly.fasta
final.assembly.efasta
Boththesefilescontainthefinalflattenedandscaffoldedassembly.Theefasta,enhancedfasta,
fileisanewformatusedbyALLPATHSandisbasedonthestandardfastafileformat.
fastaformat
Thefastafilecontainscontigsthathavebeenscaffoldedtogether,separatedbyn,wherethenumber
ofnsrepresentsthebestestimateofthegapsizebetweencontigs.Asinglenrepresentsanunresolved
negativegap.Withineachcontig,anambiguousbaseisrepresentedbyanN.Forexample,anA/TSNP
wouldbecome:ATGTCNTGTCG.
efastaformat
Theefastafilecontainscontigsthathavebeenscaffoldedtogether,separatedbyN,wherethe
numberofNsrepresentsthebestestimateofthegapsizebetweencontigs.AsingleNrepresentsan
unresolvednegativegap.Withineachcontig,ambiguityisrepresentedbyanexpressionwithinapairof
braces,{}.Forexample,anA/TSNPwouldbecome:ATGTC{A,T}TGTCG.Anunresolved
homopolymerrunofT,wheretheevidencesuggestedthereshouldbe6,7or8Ts,wouldbecome:
GTCACTTTTTT{,T,TT}GCTGT.Inthisenhancedfastaformatsimpleambiguitiesthatwould
otherwisebelostarenowretained.Theefastaformatcanbeeasilyflattenedbypickingthefirst
optionforeachambiguity,resultingintheassemblygivenintheassociatedfastafile.Withinthe
bracestheoptionsareorderedintermsofdecreasinglikelihood.

ALLPATHSLGManual

18

ALLPATHSReference
ALLPATHSCacheforpowerusers
ThePerlscriptPrepareAllPathsInputs.plthatimportsdatatoALLPATHSisinfactawrapper
aroundafewtools.Itfirstcreatesatemporarycacheoffastb/qualbfilesin<DATA>/read_cache/
foreachdatafiledescribedinin_groups.csv.Then,itautomaticallymergesallthesecachedfiles
intoeachoftheinputfilesexpectedbyALLPATHS.
Alternatively,anontemporarycachecanbecreatedseparatelyatadifferentlocationanditcanworkas
arepositoryofdataformanydifferentprojects.Theadvantageofhavingacacheisthatitseparatesthe
timeconsumingstepofconvertingdatafiles(especiallyBAMfiles)tothefastbandqualbformatfrom
themergingofthefastbandqualbfilesintoALLPATHSinputfiles.Thisisuseful,forexample,whena
userwantstorundifferentassembliesbasedondifferentsubsetsoftheoriginaldata.

CreatingtheALLPATHSCache
TheALLPATHScachestoresalltheinformationregardingthelibrariesandgroupsintwofilesinthe
cachedirectory:libraries.csvandgroups.csv.Tobuildthecachethefollowingcommands
needtoberun:
CacheLibs.pl
CACHE_DIR=<CACHE_DIR>
IN_LIBS_CSV=in_libs.csv
ACTION=Add

CacheGroups.pl
CACHE_DIR=<CACHE_DIR>
PICARD_TOOLS_DIR=/opt/picard/bin
IN_GROUPS_CSV=in_groups.csv
TMP_DIR=/large-tmp
HOSTS=2,3.host2,4.host3
ACTION=Add

TheCacheLibs.plcommandsimplyaddsthelibraryinformationinin_libs.csvtothecache
libraries.csv.TheCacheGroups.plcommandconvertsallthedatafilesdescribedin
in_groups.csvtofastbandqualbfilesinthecache,andaddsthecorrespondingentriestothe
cachegroups.csv.Thecommonoptionsare:
CACHE_DIRthefullpathtothecachedirectory.Canbeomittediftheenvironmentvariable
ALLPATHS_CACHE_DIRisdefined.
ACTIONAdd,List,orRemoveentriesto,in,andfromthecache.

ALLPATHSLGManual

19

CacheLibs.ploptions:
IN_LIBS_CSValternativefiletothedefault./in_libs.csv.
GroupLibs.ploptions:
PICARD_TOOLS_DIRthefullpathtothePicardtoolsneededfordataconversion.Canbeomitted
iftheenvironmentvariableALLPATHS_PICARD_TOOLS_DIRisdefined.
IN_GROUPS_CSValternativefiletothedefault./in_groups.csv.
TMP_DIRthefullpathofalocaltemporarydirectory;mustbelargeifyourdataislarge.
INCLUDE_NON_PF_READS1:(default)includenonPFreads.0:includeonlyPFreads.
HOSTSlistofhoststouseinparallelbyforking.Eachforkconvertsasingledatafile(NOTE:forking
toremotehostsrequirespasswordlesssshaccess,e.g.usingsshagent/sshadd).Example:
2,3.host2,4.host3 whichtranslatesto:

2processesforkedonthelocalhost;
3processesforkedonhost2;
4processesforkedonhost3.

Finally,thecontentsofthecachecaneasilybelistedbyrunning:
CacheGroups.pl
CACHE_DIR=<CACHE_DIR>
ACTION=List

UsingtheALLPATHSCache
OncethecacheiscreateditcanbeusedtogeneratetheALLPATHSinputfiles:
<DATA>/frag_reads_orig.fastb
<DATA>/frag_reads_orig.qualb
<DATA>/frag_reads_orig.pairs
<DATA>/jump_reads_orig.fastb
<DATA>/jump_reads_orig.qualb
<DATA>/jump_reads_orig.pairs
<DATA>/long_jump_reads_orig.fastb
<DATA>/long_jump_reads_orig.qualb
<DATA>/long_jump_reads_orig.pairs
<DATA>/long_reads_orig.fastb

ALLPATHSLGManual

20

ThecommandtogeneratetheALLPATHSinputfilesis:
CacheToAllPathsInputs.pl
CACHE_DIR=<CACHE_DIR>
GROUPS={12345AAXX.{1,2,3},67890ABXX.{6,7}}
DATA_DIR=<DATA_DIR>
FRAG_FRAC=50%
JUMP_FRAC=34%
Theoptionsare:
CACHE_DIRthepathtothecachedirectory.Canbeomittediftheenvironmentvariable
ALLPATHS_CACHE_DIRisdefined.
DATA_DIRthefullpathtotheALLPATHSDATAdirectorywheretheinputfileswillbeplaced.
GROUPSalistofthegroupstoincludeasinputs.
IN_GROUPS_CSVfileincludingthegroupsdescription.OptionalalternativetoGROUPS.
FRAG_FRACfractionoffragmentreadstoinclude,e.g.30%or0.3.
JUMP_FRACfractionofjumpingreadstoinclude,e.g.20%or0.2.
LONG_JUMP_FRACfractionoflongjumpingreadstoinclude,e.g.90%or0.9.
FRACTIONS(usewithGROUPSonly)listoffractions,onepergroup,e.g.{0.5,30%,100%}.
GENOME_SIZEestimatedgenomesizeforthepurposeofcoverageestimation.
FRAG_COVERAGE(requiresGENOME_SIZE)fragmentlibrarydesiredcoverage,e.g.45.
JUMP_COVERAGE(requiresGENOME_SIZE)jumpinglibrarydesiredcoverage,e.g.45.
LONG_JUMP_COVERAGE(requiresGENOME_SIZE)jumpinglibrarydesiredcoverage,e.g.1.
(typically,onlyverylowcoverageisrequiredforlongjumps).
COVERAGES(usewithGROUPSonly,requiresGENOME_SIZE)listofcoverages,onepergroup,
e.g.{45,50,2}.
LONG_READ_MIN_LEN(default500)thissetsthethresholdforwhatqualifiesasalongunpaired
read(e.g.PacBioreads).
AsinPrepareAllPathsInputs.pl,therearesomerestrictionsontheaboveoptions.Ifyou
specifyFRAG_FRAC,JUMP_FRAC,orFRACTIONS,youcannotalsospecifyFRAG_COVERAGEor
JUMP_COVERAGE,orCOVERAGES.IfyouspecifyFRAG_COVERAGE,JUMP_COVERAGE,or
COVERAGESyoumustspecifyGENOME_SIZE,sincebothvaluesarenecessaryforthecalculationof

ALLPATHSLGManual

21

thereadfractiontoinclude.IfyouspecifyoneofFRACTIONSandCOVERAGESlistsyoumustspecify
aGROUPSlistandsupplyonefractionorcoverageentryforeachgroup.
AfterasuccessfulrunofCacheToAllPathsInputs.plthenecessaryALLPATHSinputfilesshould
beinplaceandreadyforanassemblyruntostart.

ALLPATHScompilationoptions
ThefollowingcommandlineoptionsmaybeappendedtomakewhenbuildingALLPATHS:
-j<n> Splitthecompilationintonparallelprocesses.IfyousetnequaltothenumberofCPUson
yourmachine,itwillspeedupcompilationapproximatelynfold.SeeInstallationforan
example.

ALLPATHSpipelineindetail
KeyFeatures
TheALLPATHSpipelineincorporatesthefollowingkeyfeatures:

Runsonlythosemodulesthatarerequiredforaparticularsetofparameters.
Ensuresintermediatefilesarealwaysconsistent.
Iftheparametersforamodulechange,rerunsonlythechangedmoduleandmodulesthat
dependonitsoutput.
Intheeventofaproblem,restartsatthepointtheproblemoccurred.
Supportseasyparallelizationbyallowingmodulesthatdontdependoneachothersoutputto
runconcurrently.
Caneasilyberunuptoanypoint.
Caninitiallyexcludemodulesthatarenotrequiredfortheassemblyprocess(evaluation
modulesforexample),theneasilyrunthemoncetheassemblyiscomplete.
Determinesifithasallthenecessaryinputfilesandknowshowtobuildalltherequestedoutput
filesbeforestartinganymodules.Stopsimmediatelyifthereisaproblem.

DirectorystructureALLPATHS_BASE
InadditiontousingthecommandlineargumentPREtospecifythelocationofthepipelinedirectory,
youmayoptionallyalsouseALLPATHS_BASE.Thepipelinedirectorylocationiseither:
PRE
or
PRE/ALLPATHS_BASE

Targets
Thepipelinedetermineswhichoutputfilesitneedstogeneratebymeansofalistoftargets.Ifa
particulartargetfileisrequested,thenthemodulesrequiredtocreatebothit,andanyintermediate

ALLPATHSLGManual

22

filesitdependson,willberuninthecorrectorder.Onlythesemoduleswillberun.Further,ifany
requiredintermediatefilesalreadyexistandareuptodatewithrespecttothefilesthattheyinturn
dependon,thenthecalltothemodulerequiredtobuildthemisskipped.Thisholdstrueforthefinal
targetfileorfilesiftheyalreadyexistandareuptodatethennothingwillbedone.
Youcanspecifythetargetfilestobuildintwoways.Thesimplestistouseoneofthepredefinedpseudo
targetsthatrepresentasetofusefultargetfilesmuchlikepseudotargetsinMake.Thesecondisto
specifyalistofindividualfilesthatthepipelineknowshowtomake.Bothmethodsmaybeusedatthe
sametime.
Ifyouaskforatargetfilethatthepipelinedoesntknowhowtomakeyouwillgetanerrormessage.
Pseudotargets
Thisisthebestwaytocontrolwhichfilesthepipelinewillcreate.Thepseudotargetvalueispassedto
RunAllPathsLGusing:
TARGETS=<pseudo target name>
Thereare3possiblepseudotargets:
nonenopseudotargets,onlymakeexplicitlylistedtargetfiles(seebelow).

standardcreatetheassemblyandselectedevaluationfiles.
full_evalcreatetheassemblyandadditionalevaluationfiles.

Thedefaulttargetisstandard.
Targetfiles
Individualfilesmaybespecifiedastargetsinsteadof,orinadditionto,thepseudotargets.Listsoftarget
filesineachpipelinesubdirectoryarepassedtoRunAllPathsLGusing:
TARGETS_DATA=<target files in the DATA dir>
TARGETS_RUN=<target files in the RUN dir>
TARGETS_SUBDIR=<target files in the SUBDIR dir>
Multipletargetfilesmaybepassedinthefollowingmanner:
TARGETS_RUN={target1,target2,target3}
Thelistofvalidtargetfileschangesbasedontheassemblyparameterschosen.

Evaluationmode
Givenareferencegenome,thepipelinecanperformevaluationsatvariousstagesoftheassembly
process.

ALLPATHSLGManual

23

Certainevaluationshavethepotentialtoaltertheassembly,astheyrequirereferencegenomedatato
beincorporatedintodatastructuresusedbytheassemblyprocess.Anysuchperturbationofthe
assemblyshouldbeneutralbutwillhaveastochasticeffectontheresult.Suchunsafeevaluations
allowmuchmoredetailedinformationtobegatheredabouttheassemblyprocessandareextremely
usefulduringdevelopment,butcanbeconsideredcheatingfromthepointofviewofdenovo
assembly.
Theevaluationmodeusediscontrolledby:
EVALUATION=<evaluation mode>
Therearethreeevaluationmodes:
NONEdonotevaluate/noreferenceisavailable.
BASICbasicevaluationthatdoesnotrequireareference.
STANDARDrunevaluationmodulesusingasuppliedreference.
FULLturnoninplaceevaluationincertainassemblymodules.Doesnotperturbassembly.
CHEATruninplaceevaluationsthatpotentiallyperturbtheassembly(inaneutralfashion),
butallowamoredetailedanalysis.
Thedefaultmodeis BASIC.

Kmersize,K
TheusershouldnotadjustthekmersizefromthedefaultvalueofK=96.
TherelationshipbetweenkmersizeKandreadsizeisnotadirectoneinALLPATHSLG,unlikeinmay
otherassemblers.ALLPATHSLGactuallyusesanumberofdifferentsizesofKinternally,andbecauseof
this,itisnotintendedthatuserschangetheKvaluesforanassembly.

Parallelization
Givensufficientmemory,itispossibletoparallelizethepipelineinordertoreduceruntime.Twoforms
ofparallelizationarepossibleandbothmaybeusedatthesametime.
Crossmoduleparallelization
Modulesinthepipelinethatdonotdependoneachothermayberunconcurrently.Thisfunctionalityis
providedbymake,whichisusedbyRunAllPathsLGtoexecutethepipeline.Itisequivalenttousing
theoptionj<n>whencompilingtheALLPATHSsourcecode.Nochecksaremadetoensurethatthere
isenoughmemorytorunmultipleALLPATHSmodulesatthesametime.Setthemaximumnumberof
modulesthatcanrunconcurrentlyusing:
MAXPAR=<n>

ALLPATHSLGManual

24

Themajorityofthepipelinenowusesparallelthreading,soinmostcasesthereislittletobegainedin
settingthisvalueabout1.
Parallelizationofindividualmodules
ManyofALLPATHSsmoduleshavebeenengineeredtorunwithparallelthreading.Thisformof
parallelizationisindependentofthemoduleparallelizationdescribedabove.Thelevelofparallelization
canbecontrolledusingtheargumenttoRunAllPathsLG:
THREADS=<n>
Formaximumperformance,setthisvaluetothenumberofprocessorsavailablebutbewaryof
exceedingavailablememoryasthenumberofthreadsincreases.Duetohardwarerestraints(suchas
I/Olimitingandheapcontention)youwillfinddiminishingreturnsinruntimeimprovementasthe
numberofthreadsincreases.
Bydefaultthepipelinewillattempttouseallavailableprocessors.

Logging
Inadditiontostandardout,theoutputfromeachALLPATHSmoduleiscapturedtofile.Ineachpipeline
directorythereexistsasubdirectorynamedmakeinfothatcontainsvariousloggingfilesplus
metadatausedbythepipelinetocontrolandtrackprogress.Everysinglefileproducedbythepipeline
willhavetwologfilesassociatedwithit.Forexample,thefilehyper.fastawillhavethefollowinglog
filesinSUBDIR/makeinfo:
hyper.fasta.cmd
hyper.fasta.DumpHyper.out
The.cmdfilecontainsthecommandusedtogeneratehyper.fasta.The.outfilecontainsthe
capturedoutputofthemoduleusedtocreatehyper.fasta.Inthiscasethemoduleiscalled
DumpHyper,asyouwouldseefromlookingatthefilehyper.fasta.cmd.

ALLPATHSLGManual

25

References
GnerreS,MacCallumI,PrzybylskiD,RibeiroF,BurtonJ,WalkerB,SharpeT,HallG,SheaT,SykesS,
BerlinA,AirdD,CostelloM,DazaR,WilliamsL,NicolR,GnirkeA,NusbaumC,LanderES,JaffeDB.High
qualitydraftassembliesofmammaliangenomesfrommassivelyparallelsequencedataProceedingsof
theNationalAcademyofSciencesJanuary2011vol.108no.415131518
MacCallumI,PrzybylskiD,GnerreS,BurtonJ,ShlyakhterI,GnirkeA,MalekJ,McKernanK,RanadeS,
SheaTP,WilliamsL,YoungS,NusbaumC,JaffeDB.ALLPATHS2:smallgenomesassembledaccurately
andwithhighcontinuityfromshortpairedreads.GenomeBiology2009,10(10):R103.
ButlerJ,MacCallumI,KleberM,ShlyakhterIA,BelmonteMK,LanderES,NusbaumC,JaffeDB.
ALLPATHS:Denovoassemblyofwholegenomeshotgunmicroreads,GenomeRes.May200818:810
820.

ALLPATHSLGManual

26

You might also like