Published online February 27, 2004
1414±1421 Nucleic Acids Research, 2004, Vol. 32, No. 4
DOI: 10.1093/nar/gkh286
Identifying secretomes in people, puffer®sh and pigs
Eric W. Klee1,3, Daniel F. Carlson4, Scott C. Fahrenkrug3,4, Stephen C. Ekker2,3 and
Lynda B. M. Ellis1,3,*
1
Laboratory Medicine and Pathology, 2Genetics, Cell Biology and Development, 3Arnold and Mabel Beckman
Center for Transposon Research, University of Minnesota, Minneapolis, MN 55455, USA and 4Animal Science,
University of Minnesota, St Paul, MN 55108, USA
ABSTRACT
The proteins processed by the secretory pathway
(secretome) are critical players in the development
of multi-cellular eukaryotic organisms but have yet
to be comprehensively studied at the genomic level.
In this study, we use the Target P algorithm to
predict human (13±20% of proteins found in
individual datasets) and Fugu (14%) secretomes
based on analysis of their nearly complete
proteomes. We combine internal processing with
prediction software to automate secreted protein
identi®cation and overcome one of the major
challenges associated with EST data: identi®cation
of the minority of clones that encode N-terminallycomplete proteins. We discuss the use of these
methods to predict secreted proteins in EST-based
consensus sequence sets, and we validate these
predictions
using
an
assay
for
cell-free
cotranslational translocation. Analysis of TIGR
Porcine Gene Index 4.0 as a test dataset resulted in
the identi®cation of 352 N-terminally-complete,
putative secreted proteins. In functional agreement
with our predictions, 34 of 40 (85%) of these cDNAs
were veri®ed to be cotranslationally translocated in
an in vitro translation system. The methods
developed here are speci®cally designed to accept
partial open reading frames and improve secreted
protein predictions in eukaryotic transcriptomes,
and are valuable for the analysis and annotation of
eukaryotic EST databases.
INTRODUCTION
Secreted proteins, including ligands and receptors, are critical
to both short- and long-range intercellular signaling during the
development and growth of multi-cellular organisms.
Additionally, membrane proteins mediate cellular responses
to a myriad of environmental cues. The development of
embryos, and the differentiation of tissues important to animal
production and vertebrate reproduction, respond to both
intrinsic and external signals, a response likely regulated by
secreted proteins. Functional understanding of these types of
proteins could provide insight into diverse sets of biological
processes critical to agricultural animal performance and
human disease; these proteins are a high-priority target for
functional annotation.
As de®ned by Tjalsma (1), the term `secretome' applies to
all proteins that are synthesized and processed through the
secretory pathway, along with the protein secretion machinery. Many proteins are secreted by targeting the endoplasmic
reticulum (ER) membrane by signal peptides, which, if Type1, are on average 20 amino acids long in eukaryotes and are
located at the amino terminus of nascent polypeptides (2). The
signal peptide of the nascent polypeptide is recognized by the
signal recognition particle (SRP), a cytoplasmic ribonucleoprotein consisting of six different subunits and 7SL RNA. The
nascent chain±ribosome±SRP complex associates with the
SRP receptor on the ER membrane and SRP is released from
the complex. At the ER membrane, the nascent chain±
ribosome complex associates with the protein translocation
channel. This channel provides a closed and aqueous environment through which the hydrophilic nascent peptide chain
can be cotranslationally translocated. Following transport to
the ER lumen, the signal peptide is cleaved from the protein at
a peptide bond preceded by two small neutral residues (3). The
protein then undergoes folding, modi®cation and transport
from the ER to locations such as the plasma membrane, extra
cellular space or organelles.
A publicly available secreted protein database would
provide a source of protein targets for use in research on
agricultural animal performance, embryonic development,
and human disease. Three projects identifying secreted
proteins in Candida albicans, mouse and human have recently
been published. The C.albicans project computationally
identi®ed soluble proteins that possessed N-terminal signal
sequences and lacked transmembrane domains, GPI anchor
sites and mitochondrial targeting sequences, from open
reading frames (ORFs) obtained from the yeast genome (4).
Unfortunately, many eukaryotes genomes have not been
sequenced and higher eukaryote genomes contain signi®cant
intron splicing, causing problems in identi®cation of translation initiation sites, and creation of partial ORFs.
Grimmond et al. studied a subset of the mouse genome
representing the portion of the secretome found in an EST
database, which encodes proteins with signal sequences and
*To whom correspondence should be addressed at: Mayo Mail Code 609, 420 SE Delaware Street, Minneapolis, MN 55455, USA. Tel: +1 612 625 9122;
Fax: +1 612 624 6404; Email:
[email protected]
Nucleic Acids Research, Vol. 32 No. 4 ã Oxford University Press 2004; all rights reserved
Downloaded from https://academic.oup.com/nar/article-abstract/32/4/1414/1038442 by guest on 01 June 2020
Received October 29, 2003; Revised January 7, 2004; Accepted January 26, 2004
Nucleic Acids Research, 2004, Vol. 32, No. 4
MATERIALS AND METHODS
Sequence data
Four secreted protein sequence sets were constructed using
Homo sapiens and Takifugu rubripes (Fugu) proteomes.
Human sequences were obtained from the International
Protein Index (IPI) database, 03/03/02 download (URL:
http://www.ebi.ac.uk/IPI/); NCBI RefSeq database, 01/02/02
download (14); and NCBI GenScan database, 02/04/02
download (15). Fugu sequences were obtained from the
Joint Genome Institute, assembly 2.0, 12/06/01 download
(16).
15 616 tentative consensus swine sequences were obtained
from TIGR Porcine Gene Index, release 4.0, 02/01/02. The
ENSEMBL known human proteome dataset (17), version
15.33 (NCBI 33 assembly), downloaded 7/02/03, was used to
further annotate porcine±human homologous sequences.
N-terminal subsequences (125 amino acids) of the human
and Fugu protein datasets were submitted to the Center for
Biological Computation's TargetP v1.01 (18). Protein sequences predicted to be secreted by TargetP, con®gured for
non-plant analysis, with cleavage site prediction and winnertake-all selection, were assigned to their respective secreted
protein sequence set. Sequences unique to each human
secreted protein sequence set were estimated by comparing
a cumulative sequence set (three human secreted protein
sequence sets appended together) to itself using BLAST.
Sequences matching only with themselves, at a threshold of E
< 1e±10 (19), were classi®ed as unique.
Secreted protein identi®cation
Secreted protein identi®cation is carried out in a multi-step
process involving sequence comparison to reference secreted
protein sequence sets, prediction of signal peptides, identi®cation of putative start codons and N-terminal alignments.
First, the target sequence set is compared to each reference
secreted protein set using NCBI BLAST v2.1.2 (20), with a
selection threshold of 1e±10. All other parameters have default
values. Target sequences possessing at least one homolog
meeting the selection threshold are placed in a homolog
sequence set; one homolog set is created for each reference
set. The nucleotide sequences in the homolog sets are
translated to protein sequences using BioPerl's
CodonTable module (21). The frame used for this translation
corresponds to that used to align with the highest scoring
homolog of each reference set.
Each homologous sequence set was independently subjected to signal peptide prediction, translation start codon
identi®cation and N-terminal alignment. Signal peptide prediction was performed by TargetP 1.01 Server, using default
parameters (18). Sequences predicted to contain a signal
peptide in the ®rst 125 N-terminal amino acids of each target
protein sequence made up the signal peptide positive sequence
set. Target sequences were also analyzed for the presence of at
least one `ATG' in the ®rst 150 5¢ base pairs, without reference
to ATG context. Those sequences containing a putative start
codon made up the ATG positive sequence set. N-terminal
alignments of target protein sequences with their homologous
reference secreted proteins were carried out using an index
residue pair obtained from the BLAST output (Fig. 1). The
index residue pair is the ®rst reported sequence positions in the
BLAST high-scoring pair alignment. Relative to the index
residue pair, an N-terminal offset was calculated for the
sequence pair. Target sequences with offsets less than or equal
to a designated threshold (50 amino acids) made up the
N-terminally aligned sequence set.
For each reference set, target sequences belonging to the
signal peptide positive sequence set, the ATG positive
sequence set and the N-terminally aligned sequence set,
comprise the putative secreted protein sequence set. All
putative secreted protein sequence sets were combined, and
redundant sequences were removed, to create a non-redundant, putative secreted, protein set.
Test sequence selection
We selected putative secreted protein sequences containing at
least one USDA Meat Animal Research Center (MARC) clone
in the ®rst 35 nucleotides of the parent TIGR consensus
sequence, since these clones were available to us to use in the
CTT assay. To increase the utility of the test sequences for
further comparative and functional studies, they were enriched
in proteins with unknown function, based on homology
(BLAST threshold of E < 1e±10) to proteins in the ENSEMBL
human proteome dataset.
Downloaded from https://academic.oup.com/nar/article-abstract/32/4/1414/1038442 by guest on 01 June 2020
lacking transmembrane domains (i.e. candidate ligands and
related molecules) (5). This study avoided complications
arising from partial ORFs by using the RIKEN RPS, fully
sequenced, full-length cDNAs, a type of data not often
available for other vertebrates. The human secretome project
identi®ed a set of `novel' transcripts possessing signal
peptides or transmembrane domains (6). Neither study represented a comprehensive genome-wide scan for annotating the
vertebrate secretome. These projects nevertheless demonstrate
a broad interest in secretome databases and illustrate the need
for methods that analyze ESTs and identify those encoding
full-length proteins.
Prediction of secretory proteins in mammalian genomic and
EST sequences has been reviewed (7). It has been shown that
secreted protein prediction programs have been designed to
effectively identify signal peptides in full-length protein
sequences (8±10). However, these programs were not
designed to analyze partial ESTs and it is expected the
accuracy of predictions would deteriorate. ESTs are dif®cult
targets for signal sequence prediction because they have a high
inaccuracy rate (»2%) and are intrinsically 3¢ biased (11).
Consensus sequence clusters such as NCBI UniGenes (12) and
TIGR Gene Indices (13) provide increased sequence quality
and length, but these sequences may still lack the correct 5¢
end.
In this study, we describe our computation of human and
Fugu secretomes based on public proteomes. We combine
internal processing steps with public prediction software to
more fully-automate secreted protein identi®cation. We use
these methods to predict secreted proteins in TIGR Porcine
Gene Index. This large mammal model organism is used since
we have access to porcine clones to validate our predictions
using an assay for cell-free cotranslational translocation
(CTT). The methods described in this study are speci®cally
designed to accept partial ORFs and improve secreted protein
predictions in eukaryotic transcriptomes.
1415
1416
Nucleic Acids Research, 2004, Vol. 32, No. 4
ENSEMBL-based annotation
Putative secreted protein sequences were compared to the
ENSEMBL known human peptide dataset version 15.33, to
further annotate the sequences and estimate the number of
novel sequences identi®ed. The two datasets were compared
using BLAST (E < 1e±10). Protein annotation for their best
human homolog was obtained from the ENSEMBL
Description Field. Sequences were considered to have
unknown function when the Description Field contained `No
Description'.
Cotranslational translocation (CTT)
Clones from MARC 1PIG and 2PIG cDNA libraries (22) that
were predicted to encode secreted proteins were grown
overnight at 37°C in 13 LB broth, 50 mg/ml carbenicillin.
Plasmid DNA was isolated using standard alkaline lysis. To
verify identity, each clone was 5¢-end sequenced and
compared using pairwise BLAST to its GenBank EST entry.
Transcription and translation reactions were carried out using
Sp6 TNTâ Quick Coupled Transcription/Translation Systems
(Promega, Madison, WI). Each reaction contained 20 ml of
TNTâ Quick Master Mix, 2.0 ml of [35S]methionine (1000 Ci/
mmol at 10 Ci/ml) (Amersham, Little Chalfont, UK), 0.5 mg of
plasmid DNA, 0.5 ml of Promega Canine Pancreatic
Microsomal Membranes (Promega) and nuclease-free water
to a ®nal volume of 25 ml. Reactions were incubated for
90 min at 30°C. After the incubation, a 2.5 ml aliquot was
removed (pre-protease total) and prepared for SDS±PAGE
analysis to assess whether the cDNA produced a protein
product.
The remainder of the reaction was incubated for 30 min on
ice at 1 mg/ml proteinase K, and then quenched with 1 ml of
Complete Proteinase Inhibitor EDTA free (Roche,
Indianapolis, IN) at a concentration of one tablet per 300 ml
of water. After a 15 min incubation on ice in the presence of
the inhibitor, the entire reaction was diluted to 150 ml at a ®nal
concentration of 110 mM KOAc, 20 mM K±Hepes, and 2 mM
Mg(OAc)2 [KHM]. The entire reaction mixture was then
placed on a 0.5 M sucrose/KHM cushion and centrifuged at 40
000 r.p.m. (68 000 RCF) for 5 min at 4°C in a Beckman
Optima TLX Ultracentrifuge in a TLA 100.3 rotor.
Three different fractions were processed for SDS±PAGE
analysis from each sample; pre-protease total, pellet and
supernatant. The presence of protein in the pre-protease
sample indicated the cDNA was effectively transcribed and
translated. Secreted proteins would be expected in the pellet,
since they are protected from Proteinase K in the lumen of
microsomes. Proteinase K treatment of the supernatant
fraction was expected to degrade any non-secreted proteins
or secreted proteins not cotranslationally translocated, possibly synthesized by the surplus of translation components in
Downloaded from https://academic.oup.com/nar/article-abstract/32/4/1414/1038442 by guest on 01 June 2020
Figure 1. Construction of homologous sequence pair alignments. (A) Homologous proteins are aligned using an index residue pair obtained from the BLAST
output. (B) The alignment is expanded if necessary to include the N-termini of both sequences. Relative to the index residue pair, an N-terminal offset is calculated for the sequences pair and offsets less than or equal to a threshold (50 amino acids) are selected.
Nucleic Acids Research, 2004, Vol. 32, No. 4
Table 1. Number of human and Fugu total protein, and derived secreted
protein, reference sequences.
Table 2. Analysis of TIGR porcine index version 4.0
Sequence set
Total proteins
Secreted proteins
Reference
secreted sets
H.sapiens
IPI
RefSeq
GenScan
T.rubripes
54
33
54
38
8752
6716
7302
5310
687
524
113
633
Homologs
ATG
Aligned
Signal
peptide
ATG, aligned,
signal peptide
H.sapiens
IPI
RefSeq
GenScan
T.rubripes
2792
2379
2646
2121
2442
2077
2291
1824
1007
820
888
570
522
450
462
333
294
228
247
148
Non-redundant
3934
3422
1487
626
352
Number of sequences in the index that have homologs with one or more
sequences in the reference secreted protein sequence sets. Of the index
sequences with homologs, the number that: have an ATG; are N-terminally
aligned with their homolog; have a signal peptide; and meet all three of
these criteria. The number of non-redundant index sequences for each
column is given, with the predicted porcine secreted protein set in bold.
putative, N-terminally complete, secreted proteins, make up
2.3% of the total porcine gene index and are available in the
Supplementary Material. While most of the 352 sequences had
homologs in more than one reference secreted protein
sequence set, 38 had a homolog in only one set: 21 in IPI,
eight in GenScan, two in RefSeq and seven in Fugu.
Annotation using human homologs
RESULTS
Reference secretome
Secreted protein sequence sets were constructed from full
length human and Fugu protein sequence sets. These protein
sets composed between 13 and 20% of the respective protein
sequence sets (Table 1) and are available as Supplementary
Material. Three human protein sequence sets were analyzed to
maximize proteome coverage. Each set introduced unique (as
de®ned in Materials and Methods) sequences into the human
secretome. The IPI Human secreted protein sequence set
contained 14.9% unique sequences, NCBI GenScan, 23.2%
unique sequences, and NCBI RefSeq, 3.6% unique sequences.
These were combined to create a human secretome containing
10 688 non-redundant protein sequences. 49% of the 5310
sequences in the Fugu secretome had no homolog in the
human secretome.
Porcine secretome analysis
TIGR Porcine Gene Index, release 4.0 was analyzed using
methods presented in Secreted Protein Identi®cation. 3934
(25.2%) of the porcine nucleotide sequences are homologous
to at least one secreted protein in the reference sequence sets,
22.6% homologous to a human protein and 13.6% homologous to a Fugu protein (Table 2). Inspection of the
homologous porcine sequences shows that 3422 (87.0%)
possess an ATG (putative start site) within the ®rst 150 5¢
bases. Following best-hit frame translation, 1487 (37.6%) of
the porcine proteins aligned with their homolog at the Nterminus, within the designated threshold. 626 (16.2%) of the
homologous protein sequences were predicted to possess
N-terminal signal peptides. Collapsing across all three criteria
352 (9%) of the porcine sequences possessed a 5¢ ATG,
acceptably aligned near the N-terminus, and possessed a
predicted signal peptide. These porcine sequences, encoding
The 352 putative secreted porcine protein sequences were
further annotated by BLAST comparison to the ENSEMBL
known human peptide sequence set. ENSEMBL human
homologs were identi®ed for 344 sequences (98%). Fortysix of the 344 were considered to have unknown function since
the Description Field of their best ENSEMBL homolog
contained `No Description'. The remaining eight did not have
an ENSEMBL human homolog. One had a Fugu reference
homolog, and the remaining seven had homologs from our,
more inclusive, human reference secreted protein sequence
sets.
We examined ENSEMBL homologs for the 352 sequences
we identi®ed, speci®cally looking for annotation of TM
domains and signal peptides. Eighty-six homologs contained
annotated signal peptides. The ENSEMBL annotation for
human homologs of 40 sequences selected for assessment by
cell-free CTT is shown in Table 3. Three of the 40 sequences
did not possess a human homolog, while 17 were homologous
with proteins lacking description and the remaining 20 (50%)
were homologous with annotated human proteins. The
sequences selected for cell-free testing were enriched for
unknown function, to maximize the information gained by
further study.
Assessment for cell-free CTT
A cell-free test for CTT was performed on 46 of the putative
secreted proteins, selected from 190 of the 352 putative
secreted proteins which contained a clone in MARC1PIG and
MARC2PIG libraries within 35 bases of the consensus
sequence 5¢ end. In order to provide insight as to the
performance of our algorithm in de novo secreted protein
identi®cation, our selection of 46 clones for CTT analysis was
skewed towards proteins of unknown function. Six sequences
failed to yield signi®cant translation product. Even under this
rigorous challenge 34 of the 40 translated sequences (85%)
Downloaded from https://academic.oup.com/nar/article-abstract/32/4/1414/1038442 by guest on 01 June 2020
the reactions. SDS sample buffer was added to all fractions to
a concentration of 2% SDS, 66 mM Tris, 4 M urea, 0.01%
bromophenol blue and 5% BME. Pre-protease, supernatant
and pellet fractions (2, 1 and 11%) along with 6 ml of
Kaleidoscope Protein Standards (Bio-Rad), were resolved on a
4±10% gradient Ready Gel (Bio-Rad) at 150 V for ~45 min.
Gel running buffer was composed of 27 mM Tris, 190 mM
glycine, 5.4 mM NaN3 and 0.1% SDS. Gels were ®xed in
90:5:5 water, isopropyl alcohol and glacial acetic acid for 10
min followed by a 1 h treatment with Auto¯uor (National
Diagnostics, Atlanta, GA), dried, exposed to X-ray ®lm
overnight, and developed.
1417
1418
Nucleic Acids Research, 2004, Vol. 32, No. 4
Table 3. Annotation of putative secreted proteins
CTT results
ENSEMBL annotation of human homologs
TC32232
Secreted
TC41141
TC34159
TC40379
Secreted
Secreted
Secreted
TC46174
TC46246
TC34642
TC39921
TC39343
TC36368
TC37131
TC42510
TC39473
TC46432
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
TC46084
Secreted
TC42760
TC45618
TC39859
TC33636
TC41349
TC43242
TC37052
TC42487
TC44870
TC40355
TC34663
TC41473
TC41691
TC32518
TC38861
TC36737
TC45770
TC37624
TC39439
TC46416
TC36933
TC38793
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Secreted
Not secreted
Not secreted
Not secreted
TC35935
TC32823
TC39862
Not secreted
Not secreted
Not secreted
Major epididymis-speci®c protein E4 precursor (HE4) (epididymal secreted protein e4)
(wap four-disul®de core domain protein 2)
Apolipoprotein c-II precursor (Apo-CII)
Presenilin-like protein 1 (EC 3.4.99.-) (SPPL2B protein)
Small inducible cytokine A21 precursor (CCL21) (beta chemokine exodus-2) (6ckine)
(secondary lymphoid-tissue chemokine) (SLC)
IG superfamily protein
Acrosomal protein SP-10 precursor (acrosomal vesicle protein-1)
Dolichol phosphate-mannose biosynthesis regulatory protein
Vacuolar proton-ATPase subunit
Cathepsin Z precursor (EC 3.4.22.-) (cathepsin X) (cathepsin P)
Cathepsin W precursor (EC 3.4.22.-) (lymphopain)
Calreticulin 3 precursor (calreticulin 2)
Liver-expressed antimicrobial peptide 2 precursor (LEAP-2)
Cathepsin H precursor (EC 3.4.22.16)
UDP-galnac:polypeptide
N-acetylgalactosaminyltransferase T10;
UDP-galnac:polypeptide
N-acetylgalactosaminyltransferase T14
Solute carrier family 22 (organic anion/cation transporter), member 9; organic
anion transporter 4
Glycoprotein VI (platelet); platelet glycoprotein VI
MCM10 minichromosome maintenance de®cient 10; homolog of yeast MCM10
No description
No description
No description
No description
No description
No description
No description
No description
No description
No description
No description
No description
No description
No description
No homolog
No homolog
No homolog
Cathepsin S precursor (EC 3.4.22.27)
Roundabout homolog 4, magic roundabout
Tumor necrosis factor-inducible protein TSG-6 precursor (TNF-stimulated gene 6 protein)
(hyaluronate-binding protein)
No description
No description
No description
The TC identi®er for 40 porcine sequences subjected to CTT validation, CTT results and ENSEMBL description ®eld annotation of their highest scoring
human homolog, if present.
were secreted according to cell-free CTT (Table 3). Examples
of positive and negative results are shown in Figure 2.
DISCUSSION
Human and Fugu secretomes
We computed human and Fugu secretomes, containing
ligands and receptors, through the analysis of publicly
available proteomes by a signal peptide prediction program.
The resulting human secreted protein sequence sets varied
from 13 to 20% of their parent protein set (Table 1). These
were combined to create a total human secretome of 10 688
unique protein sequences. The Fugu secretome contained
5310 (14%) proteins. The human secretome had a 51%
overlap with the Fugu secretome, less than the 61% overlap reported between mouse and human (5). A greater
disparity between human and Fugu is not surprising, since
these are more evolutionarily distant vertebrates and
indeed bracket both ends of the vertebrate lineage. Mouse
sequences were not included in our study as they were not
available when we began our experimental analysis of the
porcine EST clones. However, our preliminary analysis of
mouse data shows very close agreement in size for the
computed mouse, human and Fugu secretomes. We conjecture
that the secretomes for a majority of other vertebrates are
composed of secreted proteins found in the logical union of the
human and Fugu.
Downloaded from https://academic.oup.com/nar/article-abstract/32/4/1414/1038442 by guest on 01 June 2020
TC 4.0
Nucleic Acids Research, 2004, Vol. 32, No. 4
We selected three human protein sets for analysis in this
study to maximize the coverage of the proteome since no `gold
standard' human protein set exists. When the secreted protein
sequence sets derived from each human protein set were
compared, hundreds of unique sequences were found in each.
Since sequence comparisons are performed against each
reference secreted protein sequence set independently and
equivalent value is given for homology with one or more
proteins in any of the secretomes, there is negligible bias
associated with this redundancy between the reference sets.
Signal peptide prediction on incomplete protein
sequences
Nielsen et al. (23) state that signal peptide prediction on EST
sequences when the start codon may not be present, and on
sequences with N-terminal TM domains, can result in false
positive signal peptide predictions. We developed a test case
for this, truncating several known non-secreted proteins, and
known secreted with and without internal TM domains. Our
results con®rm Nielsen and coworkers' statement
(Supplementary Material). Ab initio signal peptide prediction
programs are designed to analyze full-length protein sequences and suffer reduced performance when used to analyze
truncated, incomplete protein sequences, such as those derived
from ESTs. Even though most proteins containing one or more
TM domains also contain a SP, these misidenti®cations may
lead to the incorrect assumption that the analyzed sequence is
N-terminally complete. Since EST data are inherently fragmented, and even consensus sequences built from EST data
are not necessarily full-length, SP predictions on these data are
often not valid. We demonstrated the methods employed in
this study are useful since they correctly predicted only those
secreted proteins with N-terminally complete sequences.
Cell-free CTT
Forty-six sequences were tested for translation and translocation into pancreatic microsomes. Detectable protein was
translated from 40 of the sequences; 34 of these were
translocated into microsomes. This is a conservative validation method. It has a low false-positive rate, since proteins are
protected from digestion only when completely translocated
into microsomes. However, a receptor or other surfaceanchored protein could be translocated and still digested if a
majority of the protein remains outside the microsome.
Six cDNAs were predicted to be secreted, but failed CTT.
Two are highly homologous to known secreted proteins. The
reason these proteins were not protected from proteolysis is
not known, but the observation reveals the importance of
annotating protein function by multiple techniques, and
suggests possible false-negative results from the cell-free
CTT assay. Consequently, this assay should be seen as a
method for con®rmation, not rejection, of our predictions.
Alternative methods would be needed to de®nitively determine whether those cDNAs failing cell-free CTT are in fact
secreted (6,24).
Porcine secretome
Our analysis identi®ed 352 putative secreted sequences in the
porcine gene index, representing the ®rst attempt to identify
the porcine secretome. Extrapolating from the CTT results,
299 6 6 (95% con®dence interval) of the 352 sequences
identi®ed are cotranslationally translocated. This extrapolation ignores any bias incurred from selection of sequences for
validation, including the requirement for MARC clones in the
5¢ end of the consensus sequence assembly and enrichment
with proteins of unknown function.
Analysis of the porcine data contains several examples
where only one reference set contained a homolog to a
putative secreted protein. Since each reference set was created
based on different criteria, one may conjecture that putative
secreted proteins with homologs in multiple sets are more
likely to be true proteins. However, less well-characterized
secreted proteins may have fewer homologs, and homologs in
fewer reference protein sets.
Computationally derived protein sets such as GenScan,
which do not use homology as a criterion for inclusion, may
contain a higher percentage of conserved hypothetical or novel
proteins. Two of the eight putative secreted porcine sequences
only homologous to GenScan proteins were selected for CTT
analysis. These sequences both exhibited CTT, and possessed
ENSEMBL human protein homologs with unknown function.
This supports the above possibility.
Of the 352 putative secreted porcine sequences, 98%
possess human ENSEMBL homologs, con®rming the close
homology of these two species. Only 86 of these ENSEMBL
homologs were annotated to contain signal peptides, fewer
than expected. Our methods to identify secreted proteins thus
add value to protein annotation.
Assumptions
In our analysis, a protein identi®ed as secreted is required to
have a homolog in the reference secreted protein sequence
sets. Consequently, our methods do not identify putative
secreted proteins lacking such homologs. This may occur
because the protein sequence sets and derived reference
secreted protein sets are incomplete, in part due to lack of
quality 5¢ annotation of eukaryotic genomes. We miss proteins
that have distant homologs or proteins that are unique to the
query organism. Additionally, our approach does not identify
Downloaded from https://academic.oup.com/nar/article-abstract/32/4/1414/1038442 by guest on 01 June 2020
Figure 2. Cell-free assay of CTT. Samples were run on a 4±20% SDS polyacrylamide gradient gel. L, P and S represent pre-proteinase lysate, pellet
and supernatant fractions, respectively. (A) Gel of signal peptide control
RNA (b-lactamase). This demonstrates the ability of the system to differentiate secreted from non-secreted proteins. Bands are present in the preproteinase lysate fraction as well as in the pellet, an indication that the
product was protected from cleavage by the microsomes. Further evidence
that CTT has occurred is a decrease in peptide size going from the pre-protease lysate to the pellet, due to signal peptide cleavage. (B) Gel of cDNAs.
TC43242 and TC45770 code for secreted proteins; TC38793 does not.
1419
1420
Nucleic Acids Research, 2004, Vol. 32, No. 4
proteins secreted by other mechanisms (25,26). Since the
functional genomic studies that are being carried out on these
putative secreted proteins are expensive and time-consuming,
these studies should bene®t from N-terminally complete
sequences and the minimization of false positives.
Comparison with other secretome projects
Future directions
Further development of our system offers the possibility of a
more re®ned annotation of the vertebrate secretome.
Development may include implementation of improved
translational start site identi®cation, differentiation between
secreted ligands and receptors, and more detailed homology
selection criteria. These improvements will allow us to
identify a larger percentage of a target secretome and better
discriminate between proteins within this complex dataset.
The validated putative secreted proteins identi®ed by us in
this study are well suited for analysis by comparative and
functional genomic techniques. For example, eight of the 34
validated porcine sequences have likely zebra®sh homologs in
the current EST dataset for this organism (data not shown),
whose sequence information is suitable for morpholino-based
`knockdown' studies using the zebra®sh embryo (27). The
members of the secretome will continue to receive scienti®c
emphasis in part because of the key roles these genes play in
development and disease.
CONCLUSIONS
We have developed data freely available to the greater
scienti®c community, including human, Fugu and porcine
secretome databases. We have also developed methods for the
analysis of eukaryotic EST sequences that reliably identify
N-terminally-complete, secreted proteins, suitable for functional genomic studies. Our methods are useful for the analysis
and annotation of ESTs, especially for organisms that do not
have fully sequenced genomes.
Supplementary Material is available at NAR Online.
ACKNOWLEDGEMENTS
This work was supported by the National Institutes of Health
(RO1-GM63904), an NLM Predoctoral Training Fellowship
(NLM-0704) to E.K., and the Arnold and Mabel Beckman
Center for Transposon Research at the University of
Minnesota.
REFERENCES
1. Tjalsma,H., Bolhuis,A., Jongbloed,J.D., Bron,S. and van Dijl,J.M. (2000)
Signal peptide-dependent protein transport in Bacillus subtilis: a genomebased survey of the secretome. Microbiol. Mol. Biol. Rev., 64, 515±547.
2. Plath,K., Mothes,W., Wilkinson,B.M., Stirling,C.J. and Rapoport,T.A.
(1998) Signal sequence recognition in posttranslational protein transport
across the yeast ER membrane. Cell, 94, 795±807.
3. von Heijne,G. (1986) A new method for predicting signal sequence
cleavage sites. Nucleic Acids Res., 14, 4683±4690.
4. Lee,S.A., Wormsley,S., Kamoun,S., Lee,A.F., Joiner,K. and Wong,B.
(2003) An analysis of the Candida albicans genome database for soluble
secreted proteins using computer-based prediction algorithms. Yeast, 20,
595±610.
5. Grimmond,S.M., Miranda,K.C., Yuan,Z., Davis,M.J., Hume,D.A.,
Yagi,K., Tominaga,N., Bono,H., Hayashizaki,Y., Okazaki,Y. et al.
(2003) The mouse secretome: funcational classi®cation of the proteins
secreted into the extracellular environment. Genome Res., 13,
1350±1359.
6. Clark,H.F., Gurney,A.L., Abaya,E., Baker,K., Baldwin,D., Brush,J.,
Chen,J., Chow,B., Chui,C., Crowley,C. et al. (2003) The secreted protein
discovery initiative (SPDI), a large-scale effort to identify novel human
secreted and transmembrane proteins: a bioinformatics assessment.
Genome Res., 13, 2265±2270.
7. Ladunga,I. (2000) Large-scale predictions of secretory proteins from
mammalian genomic and EST sequences. Curr. Opin. Biotechnol., 11,
13±18.
8. Menne,K., Hermjakob,H. and Apweiler,R. (2000) A comparison of
signal sequence prediction methods using a test set of signal peptides.
Bioinformatics, 16, 741±742.
9. Chou,K. (2002) Prediction of protein signal sequences. Curr. Protein
Pept. Sci., 3, 615±622.
10. Nair,R. and Rost,B. (2002) Sequence conserved for subcellular
localization. Protein Sci., 11, 2836±2847.
11. Adams,M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M.,
Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B.,
Moreno,R.F. et al. (1991) Complementary DNA sequencing: expressed
sequence tag and human genome project. Science, 252, 1651±1656.
12. Pontius,J.U., Wagner,L. and Schuler,G.D. (2003) UniGene: a uni®ed
view of the transcriptome. The NCBI Handbook. National Library of
Medicine, Bethesda, MD, USA.
13. Quackenbush,J., Cho,J., Lee,D., Liang,F., Holt,I., Karamycheva,S.,
Parvizi,B., Pertea,G., Sultana,R. and White,J. (2001) The TIGR Gene
Indices: Analysis of gene transcript sequences in highly sampled
eukaryotic species. Nucleic Acids Res., 29, 159±164.
14. Kim,D. Pruitt,K., Tatusova,T. and Maglott,D. (2003) NCBI Reference
Sequence Project: update and current status. Nucleic Acids Res., 31,
34±37.
15. Yeh,R.-F., Lim,L.P. and Burge,C.B. (2001) Computational inference of
homologous gene structures in the human genome. Genome Res., 11,
803±816.
16. Aparicio,S., Chapman,J., Stupka,E., Putnam,N., Chia,J.M., Dehal,P.,
Christoffels,A., Rash,S., Hoon,S., Smit,A. et al. (2002) Whole-genome
shotgun assembly and analysis of the genome of Fugu rubripes. Science,
297, 1301±1310.
17. Clamp,M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y.,
Clark,L., Cox,T., Cuff,J., Curwen,V. et al. (2003) Ensembl 2002:
accommodating comparative genomics. Nucleic Acids Res., 31, 38±42.
Downloaded from https://academic.oup.com/nar/article-abstract/32/4/1414/1038442 by guest on 01 June 2020
Our project was designed to develop a method for the analysis
of all proteins targeted to the secretory pathway using type I
signal sequences, including ligand and receptor molecules.
We used the current, publicly available proteomes from Fugu
and humans as they represent two opposite extremes of the
vertebrate lineage. Depending on the protein sequence set
used for our analysis, we identi®ed 13±20% of the human and
~14% of the Fugu proteins as secreted.
Although no similar comprehensive analysis using as
template an entire vertebrate genome has been previously
published for analysis of the secretome, two recent studies
from mouse and human are noteworthy. Grimmond et al. (5)
used a proprietary and full-length cDNA library for the
identi®cation of candidate ligands encoded by the mouse
secretome. Clark et al. (6) used human genomic and EST
public and private data for the assessment of novel members of
the human secretome. Neither example included a genomewide survey, but instead focused on the identi®cation of
unannotated members of the secretome. Indeed, these datasets
overlap and nicely complement the study described herein,
offering an opportunity for the expansion of the reference
protein dataset secretomes using distinct methodological
approaches.
SUPPLEMENTARY MATERIAL
Nucleic Acids Research, 2004, Vol. 32, No. 4
23. Fahrenkrug,S.C., Smith,T.P., Freking,B.A., Cho,J., White,J., Vallet,J.,
Wise,T., Rohrer,G., Pertea,G., Sultana,R. et al. (2002) Porcine gene
discovery by normalized cDNA-library sequencing and EST cluster
assembly. Mamm. Genome, 13, 475±478.
24. Moffatt,P., Salois,P., Gaumond,M.H., St-Amant,N., Godin,E. and
Lanctot,C. (2002) Engineered viruses to select genes encoding secreted
and membrane-bound proteins in mammalian cells. Nucleic Acids Res.,
30, 4285±4294.
25. Hughes,R.C. (1999) Secretion of galectin family of mammalian
carbohydrate-binding proteins. Biochim. Biophys. Acta, 1473, 172±185.
26. Shurety,W., Merino-Trigo,A., Brown,D., Hume,D.A. and Stow,J.L.
(2000) Localization and post-Golgi traf®cking of tumor necrosis factor-a
in macrophages. J. Interferon Cytokine Res., 20, 427±438.
27. Nasevicius,A. and Ekker,S. (2000) Effective targeted gene `knockdown'
in zebra®sh. Nature Genet., 26, 216±220.
Downloaded from https://academic.oup.com/nar/article-abstract/32/4/1414/1038442 by guest on 01 June 2020
18. Emanuelsson,O., Nielsen,H., Brunak,S. and von Heijne,G. (2000)
Predicting subcellular localization of proteins based on their N-terminal
amino acid sequence. J. Mol. Biol., 300, 1005±1016.
19. Klee,E.W., Ekker,S.C. and Ellis,L.B.M. (2001) Target selection for
Danio rerio functional genomics. Genesis, 30, 123±125.
20. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990)
Basic local alignment search tool. J. Mol. Biol., 215, 403±410.
21. Stajich,J.E., Block,D., Boulez,K., Brenner,S.E., Chervitz,S.A.,
Dagdigian,C., Fuellen,G., Gilbert,J.G., Korf,I., Lapp,H. et al. (2002) The
Bioperl toolkit: Perl modules for the life sciences. Genome Res., 12,
1611±1618.
22. Nielsen,H., Brunak,S. and von Heijne,G. (1999) Machine learning
approaches for the prediction of signal peptides and other protein sorting
signals. Protein Eng., 12, 3±9.
1421