-
Notifications
You must be signed in to change notification settings - Fork 5
Generate A3M and TGT file from a given sequence in FASTA format.
License
realbigws/TGT_Package
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
#============================= # Sequence Feature Constructor # A3M_TGT_Gen.sh (v1.07) #============================= Abstract: Construct sequence features in TGT format via the search of homology protein sequences for a query sequence in FASTA format. Also generate the Multiple Sequence Alignment (MSA) in A3M format. Author: Sheng Wang Email: [email protected] #============= # Publication: #============= [1] RaptorX-Property: a Web Server for Protein Structure Property Prediction Sheng Wang#*, Wei Li*, Shiwang Liu, Jinbo Xu# Nucleic Acids Research, 2016 [2] Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields Sheng Wang#, Jian Peng, Jianzhu Ma, Jinbo Xu# Scientific Reports, 2016 [3] PredMP: a web server for de novo prediction and visualization of membrane proteins Sheng Wang#, Shiyang Fei, Zongan Wang, Yu Li, Jinbo Xu, Feng Zhao#, Xin Gao# Bioinformatics, 2018 #========= # Install: #========= 1. download the package git clone https://github.com/realbigws/TGT_Package -------------- 2. compile cd source_code/ make cd ../ -------------- 3. jackhmmer (optional) To re-compile the executables, run the following commands: cd jackhmm/ ./install cd ../ -------------- 4. blastpgp (optional) To compile the executables in 'util/', run the following commands: cd buildali2/source_code/ make cd ../../ #========== # Database: #========== 1. if databases/ not exist, create it by mkdir -p databases/ 2. download the UniProt20 database from the following link: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/uniprot20_2016_02.tgz uncompressed it, move all files to databases/, and rename these files with prefix 'uniprot20'. 3. if other version of UniProt20 (or, UniClust30) is applied, then use '-d uniprot20_XXXX_YY' option in ./A3M_TGT_Gen.sh for example, the new version UniClust30 could be downloaded from the below link: http://wwwuser.gwdg.de/~compbiol/uniclust/2017_10/uniclust30_2017_10_hhsuite.tar.gz Again, you may rename these files with prefix 'uniclust30'. 4. if users want to run other packages, such as jackhmm or buildali2, please install the below databases: (a) jackhmm : databases/uniref90.fasta this database could be downloaded from the following link: ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz [note]: you may rename this file (say, uniref90.fasta) with respect to the current date (e.g., uniref90_2018_06.fasta), and them symbol link it to uniref90.fasta (b) buildali2 : databases/nr_databases (must contain nr90 and nr70) these databases could be downloaded from the following link: http://raptorx.uchicago.edu/download/ #======= # Usage: #======= A3M_TGT_Gen v1.00 [Dec-06-2019] Generate A3M and TGT file from a given sequence in FASTA format. USAGE: ./A3M_TGT_Gen.sh <-i input_fasta> [-o out_root] [-c CPU_num] [-m memory] [-h package] [-d database] [-n iteration] [-e evalue] [-E neff] [-C coverage] [-M min_cut] [-N max_num] [-A addi_meff] [-V addi_eval] [-D addi_db] [-K remove_tmp] [-f force] [-H home] Options: ***** required arguments ***** -i input_fasta : Query protein sequence in FASTA format. ***** optional arguments ***** #--| misc parameters -o out_root : Default output would the current directory. [default = './${input_name}_A3MTGT'] -c CPU_num : Number of processors. [default = 4] -m memory : Maximal allowed memory (for hhsuite2 or hhsuite3 only). [default = 3.0 (G)] #--| search engine and database -h package : The selected package to generate A3M file. [default = hhsuite2] users may use other packages: hhsuite3, jackhmm, or buildali2. -d database : The selected database for sequence search. [default = uniprot20_2016_02] users may use other uniprot20 databases to run hhsuite2 or hhsuite3, or use uniref90 for jackhmm, and nr_databases for buildali2. #--| search strategy -n iteration : Maximal iteration to run the seleced package. [default = 2] -e evalue : E-value cutoff for the selected package. [default = 0.001] -E neff : Neff cutoff for threading purpose (i.e., -C -2). [default = 7] (for hhsuite only) -C coverage : Coverage for hhsuite only. [default = -2 (i.e., NOT use -cov in HHblits)] if set to -1, then automatically determine coverage value. if set to any other positive value, then use this -cov in HHblits. #--| filter strategy -M min_cut : Minimal coverage of sequences in the generated MSA. [default = -1] -1 indicates that we DON'T perform any filtering. Please set from 50 to 70. -N max_num : Maximal number of sequences in the generated MSA. [default = -1] -1 indicates that we DON'T perform any filtering. For example, set 20000 here. #--| additional A3M -A addi_meff : run additional A3M only if the previous ln(meff) is lower than this. [default = -1] -1 indicates that we DON'T search for additional A3M. -V addi_eval : run additional A3M with a given e-value. [default = 0.001] -D addi_db : run additional A3M using a given database. [default = metaclust50] #--| other options -K remove_tmp : Remove temporary folder or not. [default = 1 to remove] -f force : If specificied, then FORCE overwrite existing files. [default = 0 NOT to] ***** home relevant ********** -H home : home directory of TGT_Package. [default = .] #================= # Running example: #================= #-------------- part I: generate A3M without filtering and metagenomics ------------------# #-> 1. by default, we run HHblits to generate A3M ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -d uniprot20 #-> 2. we may also use JackHMM to produce A3M ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -h jackhmm -d uniref90 -n 3 #-> 3. for legacy purpose, we allow BLAST wrapped in BuildAli2 ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -h buildali2 -d nr_databases -n 5 #-------------- part II: generate A3M with filtering and metagenomics --------------------# ./A3M_TGT_Gen.sh -i example/T1001-D1.fasta -d uniprot20 -N 5000 -A 6 #-> [note]: here 5000 is the maximal number of sequences in the generated A3M. # 6 is ln(meff) threshold to run additional A3M. #-------------- part III: generate a variety of A3Ms with user-defined strategies --------# ./SEQ_to_A3Ms.sh -i example/1pazA.fasta -x '1e-3:3:-1|1:3:-1' -y 'null' -z 'null' -X uniprot20 -c 12 #-> [note]: here we only use the HHblits strategies on uniprot20 database. #============ # References: #============ [hhsuite]: https://github.com/soedinglab/hh-suite [jackhmmer]: http://eddylab.org/software/hmmer3/3.1b2 [blastpgp]: a) legacy version: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/ b) current version: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.29/
About
Generate A3M and TGT file from a given sequence in FASTA format.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published