Based on the PhiSite Promoter Hunter that use both PSSMs and Gibbs Free Energy to find putative promoter sites.
PSSMs are used to find the -35 and -10 sequences where the RNA polymerase holoenzyme binds in order to translate a gene. Gibbs Free Energy is used to find sections in the genome that are thermodynamically predisposed to unzipping (sites where the holoenzyme likely binds to in order to initiate transcription).
The general process by which the program finds promoters is as follows:
-
Generates a motif object for the -35 and -10 sequences using the
Biopython
Library -
Scans a sequence inputted by the user for promoter hunting
-
Computes -35 and -10 PSSM scores across the inputted sequence, applying strict spacer limits between potential -35 and -10 sequences.
- Spacer = # of Base Pairs between the putative -35 and -10 sequence.
-
Proceeds with potential promoters with -35 and -10 scores that are above the left and right thresholds.
- Only hits that pass the threshold barrier will be reported on the output file
-
Computes the Gibbs Free Energy of the area around the -10 sequence
-
Calculates a Final Score by summing the contributions of the PSSMs and Gibbs Free Energy
-
Generates a file of hits for an input sequence, sorted by Final Score
The original PhiSite Promoter Hunter sums the contributions of the PSSM scores and Gibbs Free Energy
by normalizing each value and applying arbitrary coefficients. Although the program still has functionality to permit the user to compute Final Score as the sum of normalized values, computing the Final Score as the summation of Log-likelihood Ratios is suggested. The program provides this functionality by converting Gibbs Free Energy into a Log-Likelihood Ratio using Free Energy distributions. This method of computing final scores is non-arbitrary and as shown in benchmarking.md
, superior in performance.
To read more about the steps above, please take a look at user_manual.md
and settings.md
in the documentation folder
pHunt runs on python 3.9.12 and depends on packages listed in pHunt_env.yaml
. All dependencies can be installed by setting up a conda environment using the following commands:
$ conda env create -f pHunt_env.yaml
To activate the environment:
$ conda activate pHunt_env
To deactivate the environment:
$ conda deactivate
numpy
Biopython
In the anaconda terminal, please change the path to the location of the src
folder in your machine:
cd (absolute path of src folder)
Run python, import the main pHunt
file, and run the go function to run the program:
python
>>>import pHunt
>>>pHunt.go()
pHunt expects a local directory set up the same as this Github directory. This is essential to the program's functioning since the program expects to find certain files in certain subdirectories.
-
data folder: Contains the files to be used and accessed by the program. Site where the
output file
will be created-
Must be in the data folder:
- left motif file
- right motif file
- Input sequences file
-
Must be in the data folder if
mode_fscr
is set to LLR:- promoter_set file
- background_set file
-
Must be in the data folder if
use_GibbsFE
is set to True:- GibbsFE_file file
-
For more information about these files, please check out
settings.md
-
-
src folder: Contains all the python code (as well as the json file) necessary for the program's functionality
-
Must be in the src folder:
- pHunt python file
- GibbsFE python file
- settings.json
-
The pHunt program uses a JSON file format (settings.json
) to collect user input. Below is a sample json file
"mode_fscr": "LLR",
"LLR_specific_parameters": {
"promoter_set": "PromEC_seqs_filtered.fas",
"background_set": "negativeset.csv",
"step_size_genome": 101,
"GibbsFE_n_bins": 30,
"GibbsFE_pseudocounts": 1
},
"use_GibbsFE": true,
"GibbsFE_related_parameters": {
"GibbsFE_file": "GibbsFE.csv",
"GibbsFE_windowsize": 50,
"lerg": 100,
"rerg": 30
},
"input_sequences": "PromEC_seqs_filterededitextended.fas",
"use_GCcont_background": false,
"left_motif": "Eco-35.fas",
"right_motif": "Eco-10.fas",
"use_GCcont_pseudocnt": false,
"non-GCcont_pseudocount": {
"left_motif_pseudocounts": 0.25,
"right_motif_pseudocounts": 0.25
},
"motif_threshold": "patser",
"min_spacer_value": 15,
"max_spacer_value": 18,
"output_information": "Pseudotestgenomefitted2_Diagnostic.csv",
"output_type":"all hits"
To use parameters other than the recommended ones (default at settings.json), edit the settings.json file or create a new json file with the same format. Specifics about what each parameter does and permitted values can be found in settings.md
pHunt produces a csv file in the data folder with hits sorted by Final Score for each sequence inputted by the user. The file can either contain all hits above the threshold for each sequence or only the top hits for each sequence (adjustable in the json file).
In the documentation subfolder you will find:
user_manual.md
: A file explaining what pHunt does and why it does itbenchmarking.md
: A file substantiating the default parameters and exploring the performance differences between certain techniquessettings.md
: A file explaining each of the json's parameters