Skip to content

Generate A3M and TGT file from a given sequence in FASTA format.

License

Notifications You must be signed in to change notification settings

realbigws/TGT_Package

Repository files navigation

#=============================
# Sequence Feature Constructor 
# A3M_TGT_Gen.sh (v1.07)
#=============================

Abstract:
Construct sequence features in TGT format via the search of homology protein sequences for a query sequence in FASTA format.
Also generate the Multiple Sequence Alignment (MSA) in A3M format.


Author: 
Sheng Wang

Email:
[email protected]



#=============
# Publication:
#=============

[1]
RaptorX-Property: a Web Server for Protein Structure Property Prediction
       Sheng Wang#*, Wei Li*, Shiwang Liu, Jinbo Xu#
                                            Nucleic Acids Research, 2016

[2]
Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields
       Sheng Wang#, Jian Peng, Jianzhu Ma, Jinbo Xu#
                                            Scientific Reports, 2016

[3]
PredMP: a web server for de novo prediction and visualization of membrane proteins
       Sheng Wang#, Shiyang Fei, Zongan Wang, Yu Li, Jinbo Xu, Feng Zhao#, Xin Gao#
                                            Bioinformatics, 2018



#=========
# Install:
#=========

1. download the package

git clone https://github.com/realbigws/TGT_Package

--------------

2. compile

cd source_code/
	make
cd ../

--------------

3. jackhmmer (optional)

To re-compile the executables, run the following commands:

cd jackhmm/
	./install
cd ../


--------------

4. blastpgp (optional)

To compile the executables in 'util/', run the following commands:

cd buildali2/source_code/
	make
cd ../../




#==========
# Database:
#==========

1. if databases/ not exist, create it by 

mkdir -p databases/


2. download the UniProt20 database from the following link:

http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/uniprot20_2016_02.tgz

uncompressed it, move all files to databases/, and rename these files with prefix 'uniprot20'.


3. if other version of UniProt20 (or, UniClust30) is applied, then use '-d uniprot20_XXXX_YY' option in ./A3M_TGT_Gen.sh

   for example, the new version UniClust30 could be downloaded from the below link:
       http://wwwuser.gwdg.de/~compbiol/uniclust/2017_10/uniclust30_2017_10_hhsuite.tar.gz

Again, you may rename these files with prefix 'uniclust30'.


4. if users want to run other packages, such as jackhmm or buildali2, please install the below databases:

(a) jackhmm :
    databases/uniref90.fasta

    this database could be downloaded from the following link:
        ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
    [note]: you may rename this file (say, uniref90.fasta) with respect to the current date (e.g., uniref90_2018_06.fasta),
            and them symbol link it to uniref90.fasta

(b) buildali2 :
    databases/nr_databases (must contain nr90 and nr70) 

    these databases could be downloaded from the following link:
        http://raptorx.uchicago.edu/download/
    



#=======
# Usage:
#=======


A3M_TGT_Gen v1.00 [Dec-06-2019] 
    Generate A3M and TGT file from a given sequence in FASTA format. 

USAGE:  ./A3M_TGT_Gen.sh <-i input_fasta> [-o out_root] [-c CPU_num] [-m memory] [-h package] [-d database] 
                         [-n iteration] [-e evalue] [-E neff] [-C coverage] [-M min_cut] [-N max_num] 
                         [-A addi_meff] [-V addi_eval] [-D addi_db] [-K remove_tmp] [-f force] [-H home] 
Options:

***** required arguments *****
-i input_fasta  : Query protein sequence in FASTA format. 

***** optional arguments *****
#--| misc parameters
-o out_root     : Default output would the current directory. [default = './${input_name}_A3MTGT'] 

-c CPU_num      : Number of processors. [default = 4] 

-m memory       : Maximal allowed memory (for hhsuite2 or hhsuite3 only). [default = 3.0 (G)] 

#--| search engine and database
-h package      : The selected package to generate A3M file. [default = hhsuite2] 
                  users may use other packages: hhsuite3, jackhmm, or buildali2.

-d database     : The selected database for sequence search. [default = uniprot20_2016_02] 
                  users may use other uniprot20 databases to run hhsuite2 or hhsuite3,
                  or use uniref90 for jackhmm, and nr_databases for buildali2.

#--| search strategy
-n iteration    : Maximal iteration to run the seleced package. [default = 2] 

-e evalue       : E-value cutoff for the selected package. [default = 0.001] 

-E neff         : Neff cutoff for threading purpose (i.e., -C -2). [default = 7] (for hhsuite only) 

-C coverage     : Coverage for hhsuite only. [default = -2 (i.e., NOT use -cov in HHblits)] 
                  if set to -1, then automatically determine coverage value. 
                  if set to any other positive value, then use this -cov in HHblits. 

#--| filter strategy
-M min_cut      : Minimal coverage of sequences in the generated MSA. [default = -1] 
                  -1 indicates that we DON'T perform any filtering. Please set from 50 to 70. 

-N max_num      : Maximal number of sequences in the generated MSA. [default = -1] 
                  -1 indicates that we DON'T perform any filtering. For example, set 20000 here. 

#--| additional A3M
-A addi_meff    : run additional A3M only if the previous ln(meff) is lower than this. [default = -1] 
                  -1 indicates that we DON'T search for additional A3M. 

-V addi_eval    : run additional A3M with a given e-value. [default = 0.001] 

-D addi_db      : run additional A3M using a given database. [default = metaclust50] 

#--| other options
-K remove_tmp   : Remove temporary folder or not. [default = 1 to remove] 

-f force        : If specificied, then FORCE overwrite existing files. [default = 0 NOT to] 

***** home relevant **********
-H home         : home directory of TGT_Package.
                  [default = .]



#=================
# Running example:
#=================


#-------------- part I: generate A3M without filtering and metagenomics ------------------#

#-> 1. by default, we run HHblits to generate A3M
./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -d uniprot20

#-> 2. we may also use JackHMM to produce A3M
./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -h jackhmm -d uniref90 -n 3

#-> 3. for legacy purpose, we allow BLAST wrapped in BuildAli2
./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -h buildali2 -d nr_databases -n 5


#-------------- part II: generate A3M with filtering and metagenomics --------------------#

./A3M_TGT_Gen.sh -i example/T1001-D1.fasta -d uniprot20 -N 5000 -A 6
#-> [note]: here 5000 is the maximal number of sequences in the generated A3M.
#           6 is ln(meff) threshold to run additional A3M.


#-------------- part III: generate a variety of A3Ms with user-defined strategies --------#

./SEQ_to_A3Ms.sh -i example/1pazA.fasta -x '1e-3:3:-1|1:3:-1' -y 'null' -z 'null' -X uniprot20 -c 12
#-> [note]: here we only use the HHblits strategies on uniprot20 database.


#============
# References:
#============

[hhsuite]:
    https://github.com/soedinglab/hh-suite

[jackhmmer]:
    http://eddylab.org/software/hmmer3/3.1b2

[blastpgp]:
    a) legacy version:
        ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/

    b) current version:
        ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.29/

About

Generate A3M and TGT file from a given sequence in FASTA format.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published