Kermit A1 512
Kermit A1 512
Kermit A1 512
6 e36
doi: 10.1093/nar/gkz061
Received September 26, 2018; Revised January 02, 2019; Editorial Decision January 23, 2019; Accepted January 30, 2019
* To
whom correspondence should be addressed. Tel: +32 926 45931; Email: [email protected]
Correspondence may also be addressed to Gerben Menschaert. Tel: +32 926 49922; Email: [email protected]
C The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work
is properly cited. For commercial re-use, please contact [email protected]
e36 Nucleic Acids Research, 2019, Vol. 47, No. 6 PAGE 2 OF 10
mation, resulting in more precise ORF and TIS validation possible ORFs meeting a minimum signal strength. As ri-
and thus genome annotation. (24,25). bosome profiling changes according to the expression pro-
In this article, we present DeepRibo, a novel neural net- file of the organism at the time of the experiment, no sig-
work implementation applying ribosome profiling data and nal is present along several parts of the genome. Practically,
binding site patterns for the precise annotation of TISs in it is not possible to make any predictions about these re-
prokaryotes. The use of artificial neural networks, which gions based on the expression data. Before selection of the
have proven to be highly effective in solving complex prob- positive and negative data, all candidate ORFs containing
lems given the availability of sufficient data, is still confined low ribosome profiling signal are therefore not considered
to few applications in the field of bioinformatics. Exam- when training/evaluating the model. The remaining data
ples are the use of convolutional neural networks for the is afterwards labeled using the annotations retrieved from
Table 1. The ribosome profiling datasets used to train and validate DeepRibo
Original data S-curve selection
Dataset Negative set Positive set Negative set Positive set
S. typhimurium (14) 432 983 4938 117 301 3586
E. coli (32) 439 895 4144 148 921 3544
C. crescentus (33) 274 390 3855 52 637 2179
M. smegmatis (35) 576 574 6716 148 909 4607
B. subtilis (34) 417 850 4154 91 010 2798
S. coelicolor (37) 547 814 7 766 27 421 1342
S. aureus (36) 311 296 2767 21 601 852
To obtain a more balanced distribution of the labels and RPKM, each dataset has been filtered by applying a minimum threshold on coverage and RPKM.
Cut-off values have been determined by estimating the lower bend point of the fitted S-curve.
start sites are annotated within one region while not obtain-
ing TISs in another region. To compare the model with the
annotations retrieved from NCBI (that do not support mul-
tiple start sites), focus is given to only the highest predic-
tion probability between two stop codons (single start site
setting). In order to obtain a set of postive predictions, a
threshold on the probability scores has to be set, determin-
ing the annotation of the top k ranked predicted ORFs. In
this study, the threshold for each organism is set in order to
obtain an equal amount of positive predictions as positively
labeled ORFs.
Table 2. The ROC AUC and PR AUC performance values for the different experimental set-ups in which the listed dataset is used as the test set
Gram-negative Gram-positive
MS SS MS SS MS SS MS SS MS SS MS SS MS SS
ROC AUC Full 0.983 0.991 0.991 0.995 0.971 0.973 0.930 0.956 0.985 0.993 0.973 0.966 0.983 0.995
CNN 0.943 0.962 0.969 0.976 0.918 0.946 0.877 0.929 0.956 0.974 0.935 0.949 0.969 0.987
RNN 0.939 0.980 0.934 0.980 0.923 0.958 0.809 0.854 0.942 0.982 0.907 0.913 0.933 0.965
PR AUC Full 0.804 0.910 0.860 0.943 0.710 0.842 0.522 0.717 0.796 0.922 0.777 0.863 0.874 0.965
CNN 0.574 0.706 0.640 0.763 0.562 0.730 0.419 0.627 0.639 0.779 0.622 0.760 0.812 0.910
The performance metrics for are given in case multiple start sites are considered possible (MS) and in case each stop codon can only have a single predicted start site (SS).
Performances of DeepRibo using either the DNA sequences as input (CNN) or ribo-seq data (RNN) highlights the improved performance if both features are combined
in one model (Full). The performances on REPARATION (REP) are furthermore given. Note that these models are both trained and evaluated on the listed dataset using
cross-validation.
TION follows a different approach on certain key aspects. fied proteins discussed in literature, is featured by Ecogene
A positive set is created by comparative genomics using (28). Of the 922 proteins, a total amount of 838 ORFs are
all candidate ORFs -given the start codons ATG, GTG or expressed within the E. coli dataset, determined using the S-
TTG- in the target genome. The negative set is assembled curve methodology. The positive predictions are composed
out of the set of all possible ORFs with the start codon of the top 3544 predictions, using the single start site set-
CTG. Specifically, for each set of ORFs, sharing the in ting, in accordance to previous methods. 744 (88.8%) of the
frame stop codon, the longest sequence is taken. REPA- genes have been predicted correctly by the model. 23 (2.7%)
RATION applies Random Forests to distinguish the set verified proteins have TISs differing from the annotation,
of ORFs matched through comparative genomics (ATG, resulting in 815 proteins for which the annotation and ver-
GTG, TTG) from the subset of all ORFs with the start ified protein set agree. None of the predicted TISs in agree-
codon CTG (negative set). In comparison, the negative set ment with the verified proteins were in disagreement with
in our approach is assembled out of all possible ORFs not the labeled dataset. 71 out of 815 (8.7%) TISs present in
positively labeled by the assembly file, ignoring ORFs with the annotations and Ecogene dataset are not picked up by
the start codon CTG for both the positive and negative the model. More importantly, 28 out of the 71 (39.4%) false
set. Therefore, DeepRibo handles a higher fraction of neg- negatives are actually present in the top 4400 ranked predic-
atively labeled data, with no bias (start codon, length) ex- tions. Due to the annotation of novel ORFs by DeepRibo,
istent between the positive and negative set. It can there- some of the positively labeled input samples are bound to
fore be stated that DeepRibo handles a more complex be excluded from the pool of 3544 positive predictions. This
problem. DeepRibo outperforms REPARATION on all means only 43 out of 815 (5.27%) of the false negatives have
seven datasets (Table 2), showing more robust performances predicted TISs up- or downstream of the labeled gene.
as compared to REPARATION. However, the compari-
son should be interpreted with the knowledge that both
tools perform a different function. It should furthermore be N-terminal proteomics based validation of predictions
noted that performances evaluated by REPARATION are Next to the Edman sequencing (Ecogene dataset), mass
also correlated to the quality of the different experiments, spectrometry based proteomics can serve to validate anno-
with performances returned on M. smegmatis, S. coelicolor tations made by DeepRibo. N-terminal proteomics, more
and C. crescentus being unexpectedly low. REPARATION specifically, is a technology that enables us to detect N-
indicates to be more sensitive to the quality of the ribo- terminal peptides compliant with the rules of initiator me-
seq data as compared to DeepRibo. DeepRibo offers sev- thionine processing. 781 such N-termini were previously de-
eral more advantages: (i) no resolution loss of the input ex- termined for E. coli (14). 721 N-terminal peptide sequences
perimental data, (ii) no limits in the amount of datasets a that are aligned with coding sequences are expressed and
single model can be trained on and (iii) applicability of a are therefore present in the test set. 659 out of 721 samples
pre-trained model by the user. Also, (iv) performances have (91.4%) are in accordance with the annotation. 64 (9.7%) of
been evaluated on independent test sets (as compared to us- these are not predicted by the model, of which 34 have dif-
ing cross-validation for each experiment). fering TISs and 30 fell out of the top 3544 predictions. In-
terestingly, of the 62 peptide sequences that indicate a TIS
in disagreement with the RefSeq annotation, 11 have been
Edman degradation assisted validation of predictions
predicted by DeepRibo. Although the presence of a TIS at
Through sequencing of the N-terminal residues of the ma- a site differing from the annotation can be suggested as in-
tured proteome using Edman degradation, the creation of dicated by the ribosome profiling data, this is tangible proof
certain proteins within a cell can be verified. A collection that the annotation is not waterproof, negatively influencing
of 922 proteins within E. coli K-12, featuring all the veri- the performance measure of the model. Figure 4 gives an
PAGE 7 OF 10 Nucleic Acids Research, 2019, Vol. 47, No. 6 e36
Table 3. Results of the BLAST search on the false positive set of E. coli and S. aureus, and specifically on the false positives in disagreement with the
annotation of the Mass Spectrometry (MS) and Edman sequencing (Ecogene) dataset
description hypo-
Set-up type # aligned total TIS TIS + stop thetical
S. aureus Proteoform 79 79 77 73 12
Novel protein 25 19 17 15 6
E. coli Proteoforms 232 232 217 198 39
Novel protein 258 204 157 137 106
MS Proteoforms 34 34 22 28 1
Ecogene Proteoforms 43 43 40 36 1
16. Lee,S., Liu,B., Lee,S., Huang,S.-X., Shen,B. and Qian,S.-B. (2012) 33. Schrader,J.M., Zhou,B., Li,G.W., Lasker,K., Childers,W.S.,
Global mapping of translation initiation sites in mammalian cells at Williams,B., Long,T., Crosson,S., McAdams,H.H., Weissman,J.S.
single-nucleotide resolution. Proc. Natl. Acad. Sci. U.S.A., 109, et al. (2014) The coding and noncoding architecture of the
E2424–E2432. Caulobacter crescentus genome. PLoS Genet., 10, e1004463.
17. Crappé,J., Ndah,E., Koch,A., Steyaert,S., Gawron,D., De 34. Li,G.W., Oh,E. and Weissman,J.S. (2012) The anti-Shine-Dalgarno
Keulenaer,S., De Meester,E., De Meyer,T., Van Criekinge,W., Van sequence drives translational pausing and codon choice in bacteria.
Damme,P. et al. (2015) PROTEOFORMER: Deep proteome Nature, 484, 538–541.
coverage through ribosome profiling and MS integration. Nucleic 35. Shell,S.S., Wang,J., Lapierre,P., Mir,M., Chase,M.R., Pyle,M.M.,
Acids Res., 43, e29. Gawande,R., Ahmad,R., Sarracino,D.A., Ioerger,T.R. et al. (2015)
18. Bazzini,A.A., Johnstone,T.G., Christiano,R., MacKowiak,S.D., Leaderless Transcripts and Small Proteins Are Common Features of
Obermayer,B., Fleming,E.S., Vejnar,C.E., Lee,M.T., Rajewsky,N., the Mycobacterial Translational Landscape. PLOS Genet., 11,
Walther,T.C. et al. (2014) Identification of small ORFs in vertebrates e1005641.