Kermit A1 512

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Published online 8 February 2019 Nucleic Acids Research, 2019, Vol. 47, No.

6 e36
doi: 10.1093/nar/gkz061

DeepRibo: a neural network for precise gene


annotation of prokaryotes by combining ribosome
profiling signal and binding site patterns
1,* 2,*
Jim Clauwaert , Gerben Menschaert and Willem Waegeman1
1
KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


Gent, Belgium and 2 Biobix, Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure
Links 653, 9000 Gent, Belgium

Received September 26, 2018; Revised January 02, 2019; Editorial Decision January 23, 2019; Accepted January 30, 2019

ABSTRACT it impractical to perform genome comparison based on se-


quence alignments to unravel the genomic complexity (2).
Annotation of gene expression in prokaryotes of- Even though sequence alignment is conventionally used,
ten finds itself corrected due to small variations of annotation of de novo genes by similarity of properties (i.e.
the annotated gene regions observed between differ- DNA sequence) between previously annotated genes is bi-
ent (sub)-species. It has become apparent that tradi- ased and has shown to propagate errors from anteceding
tional sequence alignment algorithms, used for the misannotations (3). The novel prediction tool presented in
curation of genomes, are not able to map the full this article is based only on features extracted from the short
complexity of the genomic landscape. We present DNA sequence covering the ribosome binding site and ex-
DeepRibo, a novel neural network utilizing features pression data.
extracted from ribosome profiling information and The delineation of the open reading frame (ORF) is an es-
binding site sequence patterns that shows to be a sential element in gene annotation and is mostly performed
in silico (4,5). ribosome profiling (also called ribo-seq) mea-
precise tool for the delineation and annotation of ex-
sures mRNA that is associated with ribosomes by sequenc-
pressed genes in prokaryotes. The neural network ing ribosome-protected fragments (6,7). Ribo-seq experi-
combines recurrent memory cells and convolutional mentally enables the ORF delineation, and the technique
layers, adapting the information gained from both has already been successfully adopted for prokaryotes (8,9).
the high-throughput ribosome profiling data and ri- An important aspect of the ORF delineation is the deter-
bosome binding translation initiation sequence re- mination of the Translation Initiation Site (TIS). Here also,
gion into one model. DeepRibo is designed as a sin- specific prediction tools are in place to perform this task
gle model trained on a variety of ribosome profil- (10–12), but these TISs can also be detected by applying a
ing experiments, used for the identification of open specific antibiotic treatment (e.g. chloramphenicol or tetra-
reading frames in prokaryotes without a priori knowl- cycline) preceding the ribo-seq protocol enriching for ini-
edge of the translational landscape. Through exten- tiating ribosomes (13). Recently, prediction methods based
on machine learning algorithms have been devised to either
sive validation of the model trained on various sets
delineate the ORF (14) or predict the TIS (15) based on a
of data, multiple species sequence similarity, mass combination of ribosome profiling and sequence features
spectrometry and Edman degradation verified pro- for prokaryotic genomes. A multitude of tools are available
teins, the effectiveness of DeepRibo is highlighted. for eukaryotic organisms (16–21).
Alternative proteoform usage can also be investigated by
INTRODUCTION specific mass spectrometry protocols measuring N-terminal
After >20 years of genome sequencing, it has become clear peptides (22,23). Although the technology is recognized, it
that the genomic diversity in bacteria is much larger than suffers from drawbacks (e.g. peptide physical properties and
expected, not only between species but also within (1). Gen- modifications, mass spectrometry measurement range. . . ),
Bank for example currently holds over 10 000 genome as- limiting the number of detectable N-termini. In order to
semblies for Escherichia coli, one of the prokaryotic model attain a more comprehensive map of proteoform usage,
organisms, displaying stunning diversity. The vast number proteogenomics studies have combined the aforementioned
of sequenced prokaryotes across all different phyla makes high-throughput sequencing and mass spectrometry infor-

* To
whom correspondence should be addressed. Tel: +32 926 45931; Email: [email protected]
Correspondence may also be addressed to Gerben Menschaert. Tel: +32 926 49922; Email: [email protected]

C The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work
is properly cited. For commercial re-use, please contact [email protected]
e36 Nucleic Acids Research, 2019, Vol. 47, No. 6 PAGE 2 OF 10

mation, resulting in more precise ORF and TIS validation possible ORFs meeting a minimum signal strength. As ri-
and thus genome annotation. (24,25). bosome profiling changes according to the expression pro-
In this article, we present DeepRibo, a novel neural net- file of the organism at the time of the experiment, no sig-
work implementation applying ribosome profiling data and nal is present along several parts of the genome. Practically,
binding site patterns for the precise annotation of TISs in it is not possible to make any predictions about these re-
prokaryotes. The use of artificial neural networks, which gions based on the expression data. Before selection of the
have proven to be highly effective in solving complex prob- positive and negative data, all candidate ORFs containing
lems given the availability of sufficient data, is still confined low ribosome profiling signal are therefore not considered
to few applications in the field of bioinformatics. Exam- when training/evaluating the model. The remaining data
ples are the use of convolutional neural networks for the is afterwards labeled using the annotations retrieved from

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


prediction of DNA- or RNA-binding with a target pro- the NCBI Reference Sequence (RefSeq) database. The se-
tein (26) or precise variant calling on next-generation se- lection of data is based upon two properties of the samples,
quencing (BioRxiv: https://doi.org/10.1101/092890). Deep- the coverage and signal read count. The coverage indicates
Ribo is an artificial neural network that applies both con- the nucleotide fraction of the candidate ORF at which sig-
volutional neural network (CNN) and recurrent neural net- nal is present. The signal read count, expressed as Reads
work (RNN) architectures in order to process informa- Per Kilobase Million (RPKM), expresses the amount of
tion from the DNA sequence and ribosome profiling signal. reads within the sample as compared to the dataset read
Only a short DNA sequence of 30 nucleotides covering the count. Since the biggest partition of the considered dataset
ribosome binding region is processed by the neural network. has zero to low coverage, a more balanced distribution of
Predictions are based on features extracted from this region, read count, coverage, and label values is obtained from the
selected through prior knowledge, and enhanced with fea- filtered input samples. Moreover, the final dataset contains
tures extracted from the ribosome profiling signal. about one-fifth of the input samples as compared to all can-
DeepRibo is trained on a combination of available exper- didate ORFs present in the collected data.
iments for different bacteria and has been tested to work To determine the minimum cut-off values for coverage
equally well on de novo ribo-seq data of bacterial genomes. and RPKM, a method introduced by Ndah et al. (14) has
We managed to successfully train a highly precise model been applied. The method is based upon threshold dose-
that is able to process ribo-seq data without loss of resolu- response estimation done by Lutz et al. (30). For this, a four
tion. We further validated our results with multiple species parameter S-curve is fitted on the coverage in function of
sequence similarity comparison (27), available mass spec- the RPKM. Only the positive samples are considered when
trometry data and translation initiation site annotations fitting the S-curve. By predicting the lower bend of the fit-
(28). ted S-curve, minimum cut-off values of the signal coverage
and RPKM for each dataset are obtained. This point is of
importance as it separates the positive samples that can be
MATERIALS AND METHODS
distinguished from the background noise. This point is de-
DeepRibo is trained on data collected from ribosome pro- fined as the point from which an increase in RPKM within
filing data. Ribo-seq data has the advantage that it does not the positively labeled candidate ORFs is correlated to the
map the untranslated regions of the transcribed mRNA. coverage of the ribo-seq signal in said dataset. Using this
It upholds a high resolution and low background noise, technique, it is possible to pool the data from several indi-
making precise gene annotation possible. In prokaryotes, no vidual experiments, as the S-curve is fitted on each dataset
splicing of the mRNA occurs, giving rise to more straight- individually.
forward patterns of the signal along the coding regions as To label the samples, the public genome annotations of
compared to eukaryotes. Conversely, bacterial genes are the referred species are used. Indeed, the assumption is
tightly packed and are frequently overlapping, which im- made that DeepRibo, trained on data labeled via sequence
pedes a straightforward annotation. In order to detect ge- alignment, can offer precise predictions by learning from
nomic features, the model is designed to evaluate a set of the ribo-seq signal instead of using the full DNA sequences
possible ORFs containing ribo-seq signal, from which the as an input. Although it is expected that the annotated
top k ranked probability scores are selected to be expressed genomes contain errors because of the shortcomings of
genes. The model is furthermore trained on a short DNA prevalent but more conservative DNA sequence alignment
sequence covering a 30nt region overlapping with both the methods, this behaviour is not mimicked as the model does
the Shine-Dalgarno (SD) motif in SD-led genes (up to 20nt not learn the DNA sequences of the coding sequences.
upstream of the TIS) and ribosome binding region in lead-
erless genes (up to 10nt downstream the TIS). The ribosome
Neural network architecture
binding site has proven to be of major importance in pre-
dicting the presence of a TIS (10,29). Sequences no longer DeepRibo is a neural network built in PyTorch (31), of
than 30nt are considered to prevent DeepRibo from train- which the architecture is presented in Figure 1. It is specif-
ing on intragenic DNA patterns. ically designed to process two types of data: strings (i.e.
DNA sequences) and floats (i.e. ribo-seq signal). The model
first processes each type of data in parallel before combin-
Sample selection using the four parameter S-curve
ing the features created from both inputs into a set of fully-
The input (candidate ORF) samples, labeled using the lat- connected layers. The DNA sequence is transformed into a
est genome assemblies of the species, is the collection of all binary image with four channels, a method proposed by Ali-
PAGE 3 OF 10 Nucleic Acids Research, 2019, Vol. 47, No. 6 e36

(37)). The model is trained with the ribosome profiling cov-


erage signal. The S-curve is fitted on each dataset to ob-
tain the minimum required coverage and RPKM signal of
the ribosome profiling signal of the samples within each
dataset. Table 1 gives an overview of the used datasets,
and the amount of samples each contributes to the the
positive/negative dataset.
To make sure no bias is introduced during the creation
of the input data, the first step selects all candidate ORFs
of the genome for each of the included ribo-seq datasets. It

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


has been shown that ATG, GTG and TTG are the three nu-
cleotide combinations that almost exclusively make up all
start codons in a wide variety of bacteria (38). Therefore,
all DNA sequences within the genomes starting with either
ATG, GTG and TTG up until a stop codon (TAA, TGA or
TAG) are considered candidate ORFs in this study. Since
a large number of ORFs exists with lengths too short to be
Figure 1. The architecture of the neural network DeepRibo. For each can- translated into a functional protein, a pseudo-arbitrary cut-
didate ORF two types of data are processed and fed into their respective off of 30 nucleotides is chosen to be the minimum length of
parts of the neural network. The convolutional layers train on a 30 nu- the samples.
cleotide DNA sequence ranging from 20 nucleotides upstream to 10 nu- The study is built up as follows: the training data is cre-
cleotides downstream of the TIS. The recurrent neural network covers the
complete ORF from 50 nucleotides upstream of the start codon, includ-
ated from six out of the seven available datasets, using the
ing the SD region, and extending 20 nucleotides downstream of the stop remaining dataset as the test set. A total of seven models
codon. The DNA sequence is first translated in a binary image before being have been trained and evaluated for this study, using each of
processed by four 1 × 1 and 32 1 × 12 convolutional kernels, respectively. the available datasets as a test set. Furthermore, the perfor-
The ribosome profiling data is processed by a double layered bidirectional mance of two models has been highlighted in this study. In
GRU of 128 hidden nodes. The outputs of both neural networks are flat-
tened and concatenated and fed into three consecutive fully-connected lay- the first set-up we exclude data of S. aureus from the train-
ers of length 1024, 512 and 2. ing set. In the second set-up, data from E. coli is excluded
from the training set. Both set-ups cover the datasets with
both the lowest and highest correlation between RPKM
panahi (26). This image is consecutively processed by two and coverage of the annotated genes. All experiments evalu-
convolutional layers. The first layer transforms the sparse ate the performance of DeepRibo on de novo data (i.e. trans-
matrix into a dense matrix using four 1 × 1 convolutional fer learning), in accordance to the design goals discussed.
kernels. Afterwards, 32 kernels of 1 × 12 convolutions pro- The training data, constructed out of six datasets, is split
cess the data in the second and last convolutional layer. The up in a training set (95%) and validation set (5%). The loss
ribosome profiling data is fed into a double-layered, bidirec- of the validation set is used to determine the point at which
tional Gated Recurrent Unit (GRU). The gated recurrent training is stopped. Supplementary Figures S3 and S4 visu-
unit was selected instead of the long short-term memory alize the loss of the model on the training, validation and
cell as it showed to train better models and it was overall test sets for all evaluated models.
faster to train. Only the final hidden states of the memory
cell are retrieved for further processing, making the use of
varied length inputs (i.e. candidate ORFs) possible. After Evaluation and post-processing
each type of data is processed, the output nodes of both To evaluate the model, the Area Under the Precision–Recall
networks are concatenated and fed into a fully-connected Curve (PR AUC) performance measure is used. The labels
layer. The final layers of the network consist of three fully of the input samples are highly imbalanced due to the exper-
connected layers that combine the features of both the Con- imental set-up. Therefore, a large change in false positives
volutional Neural Network (CNN) and Recurrent Neural leads to only a small change in the false positive rate. As the
Network (RNN) to obtain a final prediction. The rectified eventual use of the model is focused on the prediction of
linear unit is applied as the activation function for each layer the top k ranked genes, PR AUC is known to be a more
but the last. The binary cross entropy is used as the loss informative measure (39). Indeed, measured Area Under
function during training. Receiver operating characteristic Curve (ROC AUC) values
can be high even in cases in which the absolute amount of
false positives (heavily) outweighs the absolute amount of
Dataset construction
true positives.
Several databases have been included for training, con- An important post-processing step of the annotations
sisting of experiments performed on prokaryotes grown given by the model is the decision whether or not only one
under standard conditions. The experiments cover both TISs can be annotated for each stop codon. The sequencing
Gram-negative (Salmonella typhimurium (14), Escherichia depth, reflected by the translation rates of the RNA, varies
coli (32), Caulobacter crescentus (33)) and Gram-positive strongly between different gene regions. Differences in the
bacteria (Bacillus Subtilis (34), Mycobacterium smegmatis distribution of probability scores exist between gene regions
(35), Staphylococcus aureus (36), Streptomyces coelicolor due to varying RPKM rates. Hence, it occurs that multiple
e36 Nucleic Acids Research, 2019, Vol. 47, No. 6 PAGE 4 OF 10

Table 1. The ribosome profiling datasets used to train and validate DeepRibo
Original data S-curve selection
Dataset Negative set Positive set Negative set Positive set
S. typhimurium (14) 432 983 4938 117 301 3586
E. coli (32) 439 895 4144 148 921 3544
C. crescentus (33) 274 390 3855 52 637 2179
M. smegmatis (35) 576 574 6716 148 909 4607
B. subtilis (34) 417 850 4154 91 010 2798
S. coelicolor (37) 547 814 7 766 27 421 1342
S. aureus (36) 311 296 2767 21 601 852

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


Total 3 000 802 34 340 607 800 18 908

To obtain a more balanced distribution of the labels and RPKM, each dataset has been filtered by applying a minimum threshold on coverage and RPKM.
Cut-off values have been determined by estimating the lower bend point of the fitted S-curve.

start sites are annotated within one region while not obtain-
ing TISs in another region. To compare the model with the
annotations retrieved from NCBI (that do not support mul-
tiple start sites), focus is given to only the highest predic-
tion probability between two stop codons (single start site
setting). In order to obtain a set of postive predictions, a
threshold on the probability scores has to be set, determin-
ing the annotation of the top k ranked predicted ORFs. In
this study, the threshold for each organism is set in order to
obtain an equal amount of positive predictions as positively
labeled ORFs.

Multiple sequence comparison based on local alignment


Given the performance measures for each of the models, a Figure 2. Bend point estimation on the fitted S-curves of the coverage in
more in depth exploration of the results is made. Assum- function of the log RPKM for both the E. coli (left) and S. aureus (right)
ing the existence of incompletions and mistakes in the an- dataset. The positive samples for each dataset (red) are plotted with the
predicted (blue) ones for the fitted S-curve. For each dataset, the lower
notation files, discrepancies between the annotations made bend point of the fitted curve is estimated using the bent-cable function to
by DeepRibo and the assembly have been investigated us- obtain the minimum cut-off values.
ing the Basic Local Alignment Search Tool (BLAST) (27).
The false positive predictions of the model are compared
to a database containing a collection of proteins that have
been previously discussed in literature, forming a good crite- are assumed to be equal. Since we are not working with rep-
rion to evaluate the existence of the predicted ORF. A query etitions of the same experiment, no normalization is per-
of the false positive predictions on ‘the non-redundant pro- formed before merging the datasets. However, differences
tein sequences’ (containing non-redundant sequences from in overall signal strengths between different experiments
GenBank translations together with sequences from Ref- can be caused either by differences in expression profiles of
seq, PDB, SwissProt, PIR and PRF (40)) has been per- the organisms, varying growth conditions, or technical vari-
formed using protein-protein BLAST (pBLAST). A max- ance introduced when performing the study. To filter candi-
imum cut-off value of 0.1 for the expect (E) value is taken. date ORFs with signal strengths indistinguishable from the
The E value gives the expected amount of hits covering a background noise, minimum cut-off values are estimated
similar alignment given the size of the database. For the sake for each dataset using the S-curve methodology (30) (Sup-
of clarity, false postive predictions are considered as possi- plementary Figure S1). Interestingly, datasets containing a
ble proteoforms or novel proteins, and are thus labeled as high amount of low expression values give rise to more strin-
such. Specifically, proteoforms constitute false positive pre- gent cut-off values (e.g. S. aureus). In case of a clear dis-
dictions with a varying start site compared to the positively tinction between expressed and non-expressed genes, a rel-
labeled ORFs. Novel proteins cover any predicted ORF for atively low cut-off value is obtained (e.g. E. coli). Therefore,
which no previous annotation was present. depending on the quality of the data, the amount of sam-
ples selected from each dataset can vary greatly. The positive
RESULTS samples and the fitted S-curves for the E. coli and S. aureus
datasets are plotted in Figure 2. In the case of an incorrectly
S-curve estimation for cut-off values filters high-quality from
annotated dataset, a decreased correlation between the cov-
low-quality data
erage and RPKM of the positive samples is expected, with
To normalize the total signal counts between multiple a shift of the data points towards lower RPKM and cover-
datasets, the expression rates of the different experiments age values. As these elements create a more gradual fit of the
PAGE 5 OF 10 Nucleic Acids Research, 2019, Vol. 47, No. 6 e36

lower bend point of the S-curve on the data, these estimated


cut-off values will be higher.

High performance values for predictions in the context of both


single and multiple start codons
For the purpose of evaluating the performance, the test set
is filtered to exclude any positively labeled data with low
expression rates. As these genes are not being expressed,
positive samples with non-existent or low ribo-seq data are

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


filtered out (see Table 1). In parallel with the selection of
the training set, minimum cut-off values have been deter-
mined using the fitted S-curve. Table 2 shows the perfor-
mances of all the models on the independent dataset. Even Figure 3. The precision-recall curves of the different networks on the E.
though DeepRibo is trained on a dataset for which a maxi- coli dataset. the precision-recall curves are given in case of the multiple start
site and the single start site set-up. The full model (full line), combining the
mum of one positively labeled ORF within two stop codons RNN and CNN outperforms both the single CNN (dashed) and RNN
is present, this is not reflected into the predictions of the (dotted) architecture.
model. As genome assemblies are annotated using a max-
imum of one start codon for each stop codon, AUC and
PR AUC scores are overall better including only the highest S. coelicolor) have on average 19.2%, ␣-Proteobacteria (C.
ranked start site for each stop codon. The performance of crescentus) 6.3%, ␥ -proteobacteria (E. coli, S. typhimurium)
the model varies only slightly between the different experi- 4.5% and Firmicutes (B. subtilis, S. aureus) 4.2% leaderless
mental set-ups. A PR AUC as high as 0.965 and 0.943 on the genes (41). Unlike leaderless genes, SD-led genes are de-
test set is obtained for S. aureus and E. coli, respectively. Al- fined by the consensus sequence “AGGAGG”, present 0–
though the existence of multiple start sites within prokary- 20 nt upstream of the TIS. Previous studies revealed no pat-
otes has been confirmed (13), it can be expected that the pre- tern downstream of the TIS of leaderless genes (35). The
dictions have shifted distributions between different regions overall lower performances of M. smegmatis and S. coeli-
due to a varying ribo-seq signal. However, even when con- color suggest a correlation with the fraction of leaderless
sidering the predictions which allow multiple ORFs sharing genes in the genome of the evaluated organisms. In con-
a stop site, PR AUC scores are as high as 0.874. trast, no correlation between the performances of the CNN
on Actinobacteria is observed, showing results competitive
with those of other set-ups. A high discrepancy of perfor-
DeepRibo combines sequence information and ribosome pro- mances is observed however between the performances of
filing data
the RNNs, with an PR AUC as low as 0.175 for M. smeg-
To confirm the ability of the neural network to apply the ri- matis. Investigation on the ribo-seq data showed a high frac-
bosome profiling signal for its predictions, custom models tion of duplicated reads (92%), resulting in the lowest count
have been trained on either the sequence of the ribosome of unique reads per positively labeled ORF (459.9) of all
binding region (based on the CNN) or the ribosome pro- used datasets. This is more than four times lower than S.
filing data (based on the RNN). The architectures of the coelicolor (1952.0) and C. crescentus (2109.7), and well be-
models are kept similar, except for the loss of the recurrent low S. aureus (8114.3), B. Subtilis (8328.6), S. typhimurium
or convolutional section, in case of the model trained on the (23268.5) and E. coli (26908.0). The high correlation be-
DNA sequence and ribo-seq data, respectively. Table 2 lists tween these counts and the performance of the RNNs un-
the performances of both models for each set-up. Figure 3 derlines the importance a high quality data. As a result, ri-
displays the precision-recall curve for the model using E. bosome profiling from M. abscessus (42) has been evaluated
coli as a test set. Related plots for each of the other models, to verify the applicability of DeepRibo on organisms with
each showing similar results, are given by Supplementary a higher fraction of leaderless genes. A PR AUC of 0.865
Figure S5 through S11. Both approaches prove to be effec- and 0.577 for both the complete and RNN model used to
tive at training from their specific data, with AUC values evaluate M. smegmatis was obtained, a score in line with
of 0.965 and 0.987 for the RNN and CNN for S. aureus. the results of the model on the other organisms. The perfor-
Overall, the CNN performs better than the RNN, shown mance using the full model increases slightly when trained
by the PR AUC scores between the two architectures. The on all seven datasets (PR AUC: 0.898). The performance
combination of both neural network partitions brings an is slightly reduced for the RNN model (0.569), indicating
improvement to the performances as compared to the indi- the negative impact of the lower quality ribosome profil-
vidual parts. An increase for the PR AUC score of about ing data of M. smegmatis. Shell et al. (35) proposed a re-
seven percent compared to the CNN and 23 percent com- annotation of 150 genes for M. smegmatis. 30 out of 116
pared to the RNN shows that the model is able to combine re-annotated ORFs present in the test set are present in the
both types of information in a meaningful way. annotations given by DeepRibo (top 4607 predictions).

Evaluation of leaderless and SD-led genes Comparison of DeepRibo with REPARATION


The fraction of genes carrying a Shine-Dalgarno region REPARATION (14) is the only existing tool that per-
varies within each phylum. Actinobacteria (M. smegmatis, forms a similar task for prokaryotes. However, REPARA-
e36 Nucleic Acids Research, 2019, Vol. 47, No. 6 PAGE 6 OF 10

Table 2. The ROC AUC and PR AUC performance values for the different experimental set-ups in which the listed dataset is used as the test set
Gram-negative Gram-positive

Metric Model S. typhimurium E. coli C. crescentus M. smegmatis B. subtilis S. coelicolor S. aureus

MS SS MS SS MS SS MS SS MS SS MS SS MS SS

ROC AUC Full 0.983 0.991 0.991 0.995 0.971 0.973 0.930 0.956 0.985 0.993 0.973 0.966 0.983 0.995
CNN 0.943 0.962 0.969 0.976 0.918 0.946 0.877 0.929 0.956 0.974 0.935 0.949 0.969 0.987
RNN 0.939 0.980 0.934 0.980 0.923 0.958 0.809 0.854 0.942 0.982 0.907 0.913 0.933 0.965
PR AUC Full 0.804 0.910 0.860 0.943 0.710 0.842 0.522 0.717 0.796 0.922 0.777 0.863 0.874 0.965
CNN 0.574 0.706 0.640 0.763 0.562 0.730 0.419 0.627 0.639 0.779 0.622 0.760 0.812 0.910

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


RNN 0.533 0.777 0.531 0.812 0.576 0.781 0.114 0.175 0.508 0.768 0.478 0.637 0.485 0.707
ROC AUC REP - 0.916 - 0.916 - 0.838 - 0.821 - 0.933 - 0.838 - 0.944
PR AUC REP - 0.735 - 0.799 - 0.344 - 0.285 - 0.889 - 0.272 - 0.910

The performance metrics for are given in case multiple start sites are considered possible (MS) and in case each stop codon can only have a single predicted start site (SS).
Performances of DeepRibo using either the DNA sequences as input (CNN) or ribo-seq data (RNN) highlights the improved performance if both features are combined
in one model (Full). The performances on REPARATION (REP) are furthermore given. Note that these models are both trained and evaluated on the listed dataset using
cross-validation.

TION follows a different approach on certain key aspects. fied proteins discussed in literature, is featured by Ecogene
A positive set is created by comparative genomics using (28). Of the 922 proteins, a total amount of 838 ORFs are
all candidate ORFs -given the start codons ATG, GTG or expressed within the E. coli dataset, determined using the S-
TTG- in the target genome. The negative set is assembled curve methodology. The positive predictions are composed
out of the set of all possible ORFs with the start codon of the top 3544 predictions, using the single start site set-
CTG. Specifically, for each set of ORFs, sharing the in ting, in accordance to previous methods. 744 (88.8%) of the
frame stop codon, the longest sequence is taken. REPA- genes have been predicted correctly by the model. 23 (2.7%)
RATION applies Random Forests to distinguish the set verified proteins have TISs differing from the annotation,
of ORFs matched through comparative genomics (ATG, resulting in 815 proteins for which the annotation and ver-
GTG, TTG) from the subset of all ORFs with the start ified protein set agree. None of the predicted TISs in agree-
codon CTG (negative set). In comparison, the negative set ment with the verified proteins were in disagreement with
in our approach is assembled out of all possible ORFs not the labeled dataset. 71 out of 815 (8.7%) TISs present in
positively labeled by the assembly file, ignoring ORFs with the annotations and Ecogene dataset are not picked up by
the start codon CTG for both the positive and negative the model. More importantly, 28 out of the 71 (39.4%) false
set. Therefore, DeepRibo handles a higher fraction of neg- negatives are actually present in the top 4400 ranked predic-
atively labeled data, with no bias (start codon, length) ex- tions. Due to the annotation of novel ORFs by DeepRibo,
istent between the positive and negative set. It can there- some of the positively labeled input samples are bound to
fore be stated that DeepRibo handles a more complex be excluded from the pool of 3544 positive predictions. This
problem. DeepRibo outperforms REPARATION on all means only 43 out of 815 (5.27%) of the false negatives have
seven datasets (Table 2), showing more robust performances predicted TISs up- or downstream of the labeled gene.
as compared to REPARATION. However, the compari-
son should be interpreted with the knowledge that both
tools perform a different function. It should furthermore be N-terminal proteomics based validation of predictions
noted that performances evaluated by REPARATION are Next to the Edman sequencing (Ecogene dataset), mass
also correlated to the quality of the different experiments, spectrometry based proteomics can serve to validate anno-
with performances returned on M. smegmatis, S. coelicolor tations made by DeepRibo. N-terminal proteomics, more
and C. crescentus being unexpectedly low. REPARATION specifically, is a technology that enables us to detect N-
indicates to be more sensitive to the quality of the ribo- terminal peptides compliant with the rules of initiator me-
seq data as compared to DeepRibo. DeepRibo offers sev- thionine processing. 781 such N-termini were previously de-
eral more advantages: (i) no resolution loss of the input ex- termined for E. coli (14). 721 N-terminal peptide sequences
perimental data, (ii) no limits in the amount of datasets a that are aligned with coding sequences are expressed and
single model can be trained on and (iii) applicability of a are therefore present in the test set. 659 out of 721 samples
pre-trained model by the user. Also, (iv) performances have (91.4%) are in accordance with the annotation. 64 (9.7%) of
been evaluated on independent test sets (as compared to us- these are not predicted by the model, of which 34 have dif-
ing cross-validation for each experiment). fering TISs and 30 fell out of the top 3544 predictions. In-
terestingly, of the 62 peptide sequences that indicate a TIS
in disagreement with the RefSeq annotation, 11 have been
Edman degradation assisted validation of predictions
predicted by DeepRibo. Although the presence of a TIS at
Through sequencing of the N-terminal residues of the ma- a site differing from the annotation can be suggested as in-
tured proteome using Edman degradation, the creation of dicated by the ribosome profiling data, this is tangible proof
certain proteins within a cell can be verified. A collection that the annotation is not waterproof, negatively influencing
of 922 proteins within E. coli K-12, featuring all the veri- the performance measure of the model. Figure 4 gives an
PAGE 7 OF 10 Nucleic Acids Research, 2019, Vol. 47, No. 6 e36

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


Figure 4. Venn diagram displaying the distributions of the proteins verified
by Edman sequencing (left) and mass spectrometry (right) within the an-
notations provided by DeepRibo and the NCBI RefSeq database (labels).
Distributions only include expressed ORFs, determined using the S-curve Figure 5. E value distributions for the pBLAST results on newly predicted
methodology. proteins (left) and proteoforms (right) for the different datasets. The E val-
ues are given for the best hit (if existent) for each of the false positives. The
dashed line indicates the E value of 1.
overview of the overlap between the two validation datasets
with the NCBI annotations and predictions. DISCUSSION
The success of deep learning methods on popular topics in-
volving big data is slowly finding its way to the field of bioin-
Multiple sequence alignment of the false positive predictions
formatics involving multi-omics. Although big data created
Multiple proteoforms exist for a large amount of the anno- by high-throughput methods has been available since the ar-
tated proteins. Yet, only one variety of each protein has been rival of second generation sequencing, it has so far mainly
annotated in the genome assembly. Biological variation, been explored using statistical methods, excluding machine
growth condition or growth phase are some of the factors learning. Deep learning has proven to be considerably suc-
influencing protein expression rates. Accordingly, variety cessful, allowing the use of a black box approach when the
in protein expression between different experiments creates interpretation of the features is not desirable or feasible. In
variation from the annotated genome. pBLAST searches this study, we present a deep neural network for the pre-
have been performed to investigate whether false positive cise annotation of expressed proteins on the genome using
predictions could be caused by expressed proteoforms not ribosome profiling data. This tool uses data from in vivo ex-
present in the annotation. A summary is created by simply pression profiles to annotate the genome without the use of
taking the best aligned protein for each of the false positive comparative sequence alignment. DeepRibo learns from in-
predictions. pBLAST searches have been performed on the formation contained in both DNA sequences and ribo-seq,
complete set of false positives for S. aureus and E. coli. Com- using a novel architecture that combines both convolutional
paring the annotations curated by DeepRibo with the mass layers and recurrent memory cells. Results obtained from
spectrometry and Edman sequencing datasets resulted in 34 machine learning models, which are trained and evaluated
and 42 ORFs with differing TISs as proposed by the model. on the same dataset, can be overestimates of their perfor-
These two sets of alternative annotations by the model have mance on new data due to overfitting. The use of a single
furthermore been included for sequence similarity compari- model trained on a variety of existing datasets and evalu-
son. Table 3 gives an overview of the results. As expected, all ated on independent test sets makes due with this problem.
proteoforms have been successfully aligned, given they are Moreover, building the model on a combination of datasets
partly identical to the annotated gene. As much as 73 out trains it to differentiate between useful features present over
of 79 (92.4%) and 198 out of 232 (85.3%) annotated prote- all the datasets and dataset-dependent variations, making
oforms for S. aureus and E. coli have been fully aligned with the need for normalization steps redundant. DeepRibo is
a protein site in the databank, having a shared TIS and stop the first tool for the precise delineation of ORFs in prokary-
site. Of all novel proteins annotated by DeepRibo, more otes trained and validated on multiple datasets. It further-
than half have a match that is fully aligned, summing up more outperforms REPARATION on all datasets tested.
to a total of 15 out of 25 (60%) and 137 out of 258 (53.1%) When evaluating the results of DeepRibo, a certain cut-
protein sequences for S. aureus and E. coli. Interestingly, a off has to be determined to specify the positive predictions
considerable percentage of the novel proteins are described from the negative. To evaluate the model, the amount of
as ‘hypothetical’. The model predictions that annotated a positive ORFs has been set equal to the ORFs present in
differing TIS as compared to the MS and Ecogene dataset the annotations. However, due to novel predictions being
mostly indicate perfect alignment with proteins present in made, a fraction of the annotated samples are bound to
the non-redundant database, with 28 out of 34 (82.4%) and have a rank lower than the top k predictions (especially in
36 out of 43 (83.7%) matches, respectively. Figure 5 gives the a multiple start site setting). This is furthermore reflected
spread of the E values for each of the aligned proteins. A by the fraction of proteins in the validation sets not picked
complete list of the false positive and false negative predic- up by the top k predictions of the model. No cut-off is op-
tions for E. coli and S. aureus, including the two validation timal for every instance and has to be determined in line
sets and the BLAST results is provided in Supplementary with the application of the tool, which postulates the de-
File 2. sired precision/recall.
e36 Nucleic Acids Research, 2019, Vol. 47, No. 6 PAGE 8 OF 10

Table 3. Results of the BLAST search on the false positive set of E. coli and S. aureus, and specifically on the false positives in disagreement with the
annotation of the Mass Spectrometry (MS) and Edman sequencing (Ecogene) dataset
description hypo-
Set-up type # aligned total TIS TIS + stop thetical
S. aureus Proteoform 79 79 77 73 12
Novel protein 25 19 17 15 6
E. coli Proteoforms 232 232 217 198 39
Novel protein 258 204 157 137 106
MS Proteoforms 34 34 22 28 1
Ecogene Proteoforms 43 43 40 36 1

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


These predictions can be divided into proteoforms, which have a TIS that is either up- or downstream of the annotated ORF, or novel proteins, constituting
ORFs with a non-annotated stop site. A BLAST search of these proteins was performed on the non-redundant protein database. A maximum cut-off value
of 0.1 for the E score is taken. The total amount of false positives are given for each type. Taking only the best aligned protein (i.e. highest E score) for
each of the false positive results, the total amount of matches that were aligned by start site or both start and stop site are given. Finally, the total amount
of proteins described as ‘hypothetical’ are given.

The performance of DeepRibo is consistent on all seven


test sets, reaching PR AUC scores of >0.90 for four datasets.
No difference is observed on the performance between
gram-positive and gram-negative bacteria. Even though a
relatively low performance was returned for M. smegma-
tis, evaluation of the dataset and performances returned
for M. abscessus, another member of the Mycobacterium
family, showed no relation with the fraction of leader-
less genes present. Instead, the importance of the qual-
ity of the ribosome profiling experiment is highlighted,
with unique reads per positively labeled ORFs showing
correlation to the performance of the individual RNNs.
Although the absolute number of reads mapped to the
genome is sufficient, a high level of duplication, and there-
Figure 6. DeepRibo example annotations displayed alongside the ribo-
fore lower number of unique reads, results in a lower reso- seq input signal and RefSeq annotations. The data is formatted using the
lution of the ribosome profiling. To guarantee the quality GWIPS-viz browser (43) and is hosted publicly (see Supplementary Data).
of the ribosome profiling experiment several tools are avail- On every track is displayed (from top to bottom): ribo-seq signal (sense:
able: FastQC (http://www.bioinformatics.babraham.ac.uk/ orange, antisense: blue), TISs of all ORF samples present in the test set,
annotations predicted by DeepRibo not in agreement with the RefSeq as-
projects/fastqc) for the evaluation of reads and mQC (https: sembly (Predicted ORF) and the RefSeq genome annotations used to label
//github.com/Biobix/mQC) for the evaluation of mapped the data (Labeled ORF). (A) The highest ranking proteoform prediction
reads. (gene: PqqL, rank: 231) for E. coli. (B) The highest ranking proteoform
Since the majority of the candidate ORFs share their stop prediction (gene: UbiE, rank: 131) for S. aureus. (C) The highest ranking
sites with other samples, selecting only the ORF with the novel protein for E. coli with no pBLAST alignments (rank: 1302). (D)
An example of a predicted proteoform in a region with overlapping genes
highest predicted probabilities within each group gives con- (gene: ybhF, rank: 941).
sistently better performances. Even though an increase in
performance is observable when comparing the single start
site with the multiple start site setting, the performance of ferent regions of interest. Several false positive annota-
the latter is still noteworthy. Specifically, 105 595 out of the tions made by DeepRibo are situated in these regions (Fig-
113 228 (89.5%) candidate ORFs share stop sites with other ure 6D). The inclusion of a padded region around the ribo-
samples in the E. coli dataset. Some regions have as much some profiling signal processed by the RNN has previously
as one hundred possible TISs. Although the model has no increased the resulting performance.
way of processing this information, making a prediction on The applied antibiotic in the ribosome profiling experi-
every sample individually, remarkable PR AUC scores are ment is known to influence the resulting signal. In this study,
achieved on the test sets (MS setting), ranging from 0.710 all experiments apply chloramphenicol treatment, with the
to 0.874 (excluding M. smegmatis). Part of this error is ex- exception of S. coelicolor, which applies thiostrepton. As
pected to be caused by differences in RPKM values existent the overall lower score of S. coelicolor might be related to
between different genome regions. Yet, the models’ perfor- its lower count of unique reads, it is uncertain as to whether
mances indicate this effect to be minimal. Moreover, recent the use of thiostrepton influences the performance of Deep-
studies have discovered genes with multiple translation ini- Ribo. Although this effect seems to be minimal, the effect
tiation sites (13,15,44). As this feature is not supported by of different antibiotic treatments on DeepRibo needs to be
the annotations, correct evaluation of the model in a multi- further investigated. New antibiotic treatments can also of-
ple stop site setting is currently unfeasable. fer improvements to the model’s performance. Meydan et al.
Many prokaryotic systems have closely knit operon struc- (46) discuss the use of the antibiotic retapamulin that in-
tures (45), creating a ribo-seq signal that can overlap dif- creases the resolution of the ribo-seq signal. The increased
PAGE 9 OF 10 Nucleic Acids Research, 2019, Vol. 47, No. 6 e36

resolution offered by retapamulin might thereby improve SUPPLEMENTARY DATA


the resulting annotations thereof, especially for regions con-
Supplementary Data are available at NAR Online.
taining overlapping genes.
In case of the E. coli model, many of the novel predictions
are situated within a pseudogene. Typically, no candidate ACKNOWLEDGEMENTS
ORFs overlapping the complete pseudogene regions were
present in the training/testing samples, as these annotated The authors acknowledge the support of Ghent University.
features cover regions with multiple stop codons. Therefore, Special thanks to Dr Audrey Mannion-Michel for her work
no positively labeled samples are present. However, ribo-seq on GWIPS-viz and her provided help with the ribosome
signal is often measured at these sites, creating a hot-spot for profiling data.

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


novel (false positive) predictions.
The identification of a high amount of novel small open FUNDING
reading frames (sORFs) by the model presents another con-
trast with the annotation. The novel ORF predictions given Special Research Fund [BOF24j2016001002 to P.R.] from
by the models have a median length of 270 and 63 nu- Ghent University; Research Foundation-Flanders (FWO-
cleotides for E. coli and S. aureus. These are well below the Vlaanderen) Postdoctoral Fellowship (to G.M.). Fund-
median length of the annotated genes within each species ing for open access charge: Bijzonder Onderzoeksfonds
(827 and 723). The size of the ORFs influences the power [BOF24j2016001002].
of the statistical methods used for the identification of the Conflict of interest statement. None declared.
sORFs by in silico methods (47) applying sequence align-
ment. It is hypothesized that a higher amount of sORFs REFERENCES
is present in the genome than there are currently present
1. Land,M., Hauser,L., Jun,S.-R., Nookaew,I., Leuze,M.R., Ahn,T.-H.,
in the annotation. DeepRibo annotates a higher amount of Karpinets,T., Lund,O., Kora,G., Wassenaar,T. et al. (2015) Insights
sORFs in comparison to the amount present in the anno- from 20 years of bacterial genome sequencing. Funct. Integrative
tation. Specifically, VanOrsdel et al. recently proposed 32 Genomics, 15, 141–161.
new sORFs for E. coli (48). Of the 21 sORFs present in the 2. Richardson,E.J. and Watson,M. (2013) The automatic annotation of
dataset, 5 were included in the annotation (top 3544) pre- bacterial genomes. Brief. Bioinformatics, 14, 1–12.
3. Fields,A.P., Rodriguez,E.H., Jovanovic,M., Stern-Ginossar,N.,
sented by DeepRibo. In comparison, only one of the pro- Haas,B.J., Mertins,P., Raychowdhury,R., Hacohen,N., Carr,S.A.,
posed sORFs was actually present in the annotations from Ingolia,N.T. et al. (2015) A regression-based analysis of
RefSeq. An example of a novel sORF for E. coli is given ribosome-profiling data reveals a conserved complexity to
in Figure 6C. A distribution of the lengths of the ORFs mammalian translation. Mol. Cell, 60, 816–827.
4. Delcher,A. (1999) Improved microbial gene identification with
for both the positive annotations and novel predictions is GLIMMER. Nucleic Acids Res., 27, 4636–4641.
shown in Supplementary Figure S12. 5. Hyatt,D., Chen,G.L., LoCascio,P.F., Land,M.L., Larimer,F.W. and
Corroborated by the results obtained from the pBLAST Hauser,L.J. (2010) Prodigal: prokaryotic gene recognition and
searches, it is likely that a fraction of the false positives ob- translation initiation site identification. BMC Bioinformatics, 11, 119.
served when evaluating the predictions of the single start 6. Ingolia,N.T., Ghaemmaghami,S., Newman,J.R. and Weissman,J.S.
(2009) Genome-wide analysis in vivo of translation with nucleotide
site setting are due to an annotation that does not fully resolution using ribosome profiling. Science, 324, 218–223.
map the translational complexity of the organisms, subse- 7. Ingolia,N.T., Lareau,L.F. and Weissman,J.S. (2011) Ribosome
quently negatively influencing the performance of Deep- profiling of mouse embryonic stem cells reveals the complexity and
Ribo. This is especially expected for prokaryotes which are dynamics of mammalian proteomes. Cell, 147, 789–802.
8. O’Connor,P.B.F., Li,G.W., Weissman,J.S., Atkins,J.F. and
less known such as S. coelicolor, C. crescentus, M. smegma- Baranov,P.V. (2013) RRNA:mRNA pairing alters the length and the
tis and M. abscessus. In further support, detailed evaluation symmetry of mRNA-protected fragments in ribosome profiling
of the predictions with the ribo-seq signal shows that many experiments. Bioinformatics, 29, 1488–1491.
false positive results are backed by manual evaluation (Fig- 9. Mohammad,F., Woolstenhulme,C.J., Green,R. and Buskirk,A.R.
ure 6 A/B/C). (2016) Clarifying the translational pausing landscape in bacteria by
ribosome profiling. Cell Rep., 14, 686–694.
DeepRibo has proven to be a tool with a novel approach 10. Tech,M., Morgenstern,B. and Meinicke,P. (2006) TICO: a tool for
and high performance. The model enables the discovery of postprocessing the predictions of prokaryotic translation initiation
multiple ORFs sharing a single stop codon, small ORFs or sites. Nucleic Acids Res., 34, W588–W590.
ORFs situated in pseudogenic regions. Training DeepRibo 11. Ou,H.Y., Guo,F.B. and Zhang,C.T. (2004) GS-Finder: a program to
find bacterial gene start sites with a self-training method. Int. J.
is not bound by any number of datasets, it distinguishes use- Biochem. Cell Biol., 36, 535–544.
ful features shared between datasets and can be further im- 12. Zhu,H.-Q., Hu,G.-Q., Ouyang,Z.-Q., Wang,J. and She,Z.-S. (2004)
proved as more data is added. In contrast with sequence Accuracy improvement for identifying translation initiation sites in
alignment methods, the ribosome profiling signal offers ac- microbial genomes. Bioinformatics, 20, 3308–3317.
tual proof for the annotation of ORFs. The exclusion of 13. Nakahigashi,K., Takai,Y., Kimura,M., Abe,N., Nakayashiki,T.,
Shiwa,Y., Yoshikawa,H., Wanner,B.L., Ishihama,Y. and Mori,H.
DNA sequences from genes ensures the model is not biased (2016) Comprehensive identification of translation start sites by
towards gene patterns it is trained on. In conclusion, Deep- tetracycline-inhibited ribosome profiling. DNA Res., 23, 193–201.
Ribo has shown to be a viable tool for the annotation of the 14. Ndah,E., Jonckheere,V., Giess,A., Valen,E., Menschaert,G. and Van
genome without the use of gene similarity algorithms, and Damme,P. (2017) REPARATION: ribosome profiling assisted
(re-)annotation of bacterial genomes. Nucleic Acids Res., 45, e168.
can be applied to aid the discovery of translated proteins in 15. Giess,A., Jonckheere,V., Ndah,E., Chyzyńska,K., Van Damme,P. and
prokaryotes. Valen,E. (2017) Ribosome signatures aid bacterial translation
initiation site identification. BMC Biol., 15, e76.
e36 Nucleic Acids Research, 2019, Vol. 47, No. 6 PAGE 10 OF 10

16. Lee,S., Liu,B., Lee,S., Huang,S.-X., Shen,B. and Qian,S.-B. (2012) 33. Schrader,J.M., Zhou,B., Li,G.W., Lasker,K., Childers,W.S.,
Global mapping of translation initiation sites in mammalian cells at Williams,B., Long,T., Crosson,S., McAdams,H.H., Weissman,J.S.
single-nucleotide resolution. Proc. Natl. Acad. Sci. U.S.A., 109, et al. (2014) The coding and noncoding architecture of the
E2424–E2432. Caulobacter crescentus genome. PLoS Genet., 10, e1004463.
17. Crappé,J., Ndah,E., Koch,A., Steyaert,S., Gawron,D., De 34. Li,G.W., Oh,E. and Weissman,J.S. (2012) The anti-Shine-Dalgarno
Keulenaer,S., De Meester,E., De Meyer,T., Van Criekinge,W., Van sequence drives translational pausing and codon choice in bacteria.
Damme,P. et al. (2015) PROTEOFORMER: Deep proteome Nature, 484, 538–541.
coverage through ribosome profiling and MS integration. Nucleic 35. Shell,S.S., Wang,J., Lapierre,P., Mir,M., Chase,M.R., Pyle,M.M.,
Acids Res., 43, e29. Gawande,R., Ahmad,R., Sarracino,D.A., Ioerger,T.R. et al. (2015)
18. Bazzini,A.A., Johnstone,T.G., Christiano,R., MacKowiak,S.D., Leaderless Transcripts and Small Proteins Are Common Features of
Obermayer,B., Fleming,E.S., Vejnar,C.E., Lee,M.T., Rajewsky,N., the Mycobacterial Translational Landscape. PLOS Genet., 11,
Walther,T.C. et al. (2014) Identification of small ORFs in vertebrates e1005641.

Downloaded from https://academic.oup.com/nar/article-abstract/47/6/e36/5310036 by Ghent University user on 03 June 2019


using ribosome footprinting and evolutionary conservation. EMBO 36. Davis,A.R., Gohara,D.W. and Yap,M.-N.F. (2014) Sequence
J., 33, 981–993. selectivity of macrolide-induced translational attenuation. Proc. Natl.
19. Chew,G.-L., Pauli,A., Rinn,J.L., Regev,A., Schier,A.F. and Valen,E. Acad. Sci. U.S.A., 111, 15379–15384.
(2013) Ribosome profiling reveals resemblance between long 37. Jeong,Y., Kim,J.N., Kim,M.W., Bucca,G., Cho,S., Yoon,Y.J.,
non-coding RNAs and 5’ leaders of coding RNAs. Development, 140, Kim,B.G., Roe,J.H., Kim,S.C., Smith,C.P. et al. (2016) The dynamic
2828–2834. transcriptional and translational landscape of the model antibiotic
20. Xiao,Z., Huang,R., Xing,X., Chen,Y., Deng,H. and Yang,X. (2018) producer Streptomyces coelicolor A3(2). Nat. Commun., 7, 11605.
De novo annotation and characterization of the translatome with 38. Panicker,I.S., Browning,G.F. and Markham,P.F. (2015) The effect of
ribosome profiling data. Nucleic Acids Res. 46, e61. an alternate start codon on heterologous expression of a PhoA fusion
21. Erhard,F., Halenius,A., Zimmermann,C., L’Hernault,A., protein in mycoplasma gallisepticum. PLoS ONE, 10, e0127911.
Kowalewski,D.J., Weekes,M.P., Stevanovic,S., Zimmer,R. and 39. Davis,J. and Goadrich,M. (2006) The relationship between
D”olken,L. (2018) Improved Ribo-seq enables identification of precision-recall and ROC curves. In: Proceedings of the 23rd
cryptic translation events. Nat. Methods, 15, 363–366. International Conference on Machine Learning. ACM ICML ’06, NY,
22. Staes,A., Impens,F., Damme,P.V., Ruttens,B., Goethals,M., pp. 233–240.
Demol,H., Timmerman,E., Vandekerckhove,J. and Gevaert,K. (2011) 40. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI reference
Selecting protein n-terminal peptides by combined fractional sequences (RefSeq): a curated non-redundant sequence database of
diagonal chromatography. Nat. Protocols, 6, 1130–1141. genomes, transcripts and proteins. Nucleic Acids Res., 35(Suppl. 1),
23. Berry,I.J., Steele,J.R., Padula,M.P. and Djordjevic,S.P. (2016) The D61–D65.
application of terminomics for the identification of protein start sites 41. Zheng,X., Hu,G.-Q., She,Z.-S. and Zhu,H. (2011) Leaderless genes in
and proteoforms in bacteria. PROTEOMICS, 16, 257–272. bacteria: clue to the evolution of translation initiation mechanisms in
24. Hartmann,E.M. and Armengaud,J. (2014) N-terminomics and prokaryotes. BMC Genomics, 12, 361.
proteogenomics, getting off to a good start. PROTEOMICS, 14, 42. Miranda-CasoLuengo,A.A., Staunton,P.M., Dinan,A.M.,
2637–2646. Lohan,A.J. and Loftus,B.J. (2016) Functional characterization of the
25. Van Damme,P., Gawron,D., Van Criekinge,W. and Menschaert,G. Mycobacterium abscessus genome coupled with condition specific
(2014) N-terminal proteomics and ribosome profiling provide a transcriptomics reveals conserved molecular strategies for host
comprehensive view of the alternative translation initiation landscape adaptation and persistence. BMC Genomics, 17, 553.
in mice and men. Mol. Cell. Proteomics, 13, 1245–1261. 43. Michel,A.M., Fox,G.M., Kiran,A., De Bo,C., O’Connor,P.B.,
26. Alipanahi,B., Delong,A., Weirauch,M.T. and Frey,B.J. (2015) Heaphy,S.M., Mullan,J.P., Donohue,C.A., Higgins,D.G. and
Predicting the sequence specificities of DNA- and RNA-binding Baranov,P.V. (2014) GWIPS-viz: Development of a ribo-seq genome
proteins by deep learning. Nat. Biotechnol., 33, 831–838. browser. Nucleic Acids Res., 42, D859–D864.
27. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. 44. Dai,Y., Shortreed,M.R., Scalf,M., Frey,B.L., Cesnik,A.J., Solntsev,S.,
(1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. Schaffer,L.V. and Smith,L.M. (2017) Elucidating Escherichia coli
28. Zhou,J. and Rudd,K.E. (2013) EcoGene 3.0. Nucleic Acids Res., 41, proteoform families using intact-mass proteomics and a global PTM
D613–D624 . discovery database. J. Proteome Res., 16, 4156–4165.
29. Zhu,H., Hu,G.-Q., Yang,Y.-F., Wang,J. and She,Z.-S. (2007) MED: a 45. Pallejà,A., Harrington,E.D. and Bork,P. (2008) Large gene overlaps
new non-supervised gene prediction algorithm for bacterial and in prokaryotic genomes: result of functional constraints or
archaeal genomes. BMC Bioinformatics, 8, 97. mispredictions? BMC Genomics, 9, 335.
30. Lutz,R.W., Stahel,W.A. and Lutz,W.K. (2002) Statistical procedures 46. Meydan,S., Vázquez-Laslop,N. and Mankin,A.S. (2018) Genes
to test for linearity and estimate threshold doses for tumor induction within genes in bacterial genomes. Microbiol. Spectrum, 6,
with nonlinear dose-response relationships in bioassays for doi:10.1128/microbiolspec.RWR-0020-2018.
carcinogenicity. Regul. Toxicol. Pharmacol., 36, 331–337. 47. Pauli,A., Valen,E. and Schier,A.F. (2015) Identifying (non-)coding
31. Paszke,A., Gross,S., Chintala,S., Chanan,G., Yang,E., DeVito,Z., RNAs and small peptides: Challenges and opportunities. BioEssays,
Lin,Z., Desmaison,A., Antiga,L. and Lerer,A. (2017) Automatic 37, 103–112.
differentiation in PyTorch. In NIPS-W. 48. VanOrsdel,C.E., Kelly,J.P., Burke,B.N., Lein,C.D., Oufiero,C.E.,
32. Li,G.W., Burkhardt,D., Gross,C. and Weissman,J.S. (2014) Sanchez,J.F., Wimmers,L.E., Hearn,D.J., Abuikhdair,F.J.,
Quantifying absolute protein synthesis rates reveals principles Barnhart,K.R. et al. (2018) Identifying new small proteins in
underlying allocation of cellular resources. Cell, 157, 624–635. Escherichia coli. Prpteomics, 18, 1700064.

You might also like