Papers by Frank Eisenhaber

Molecular Biology Intelligence Unit, 2006
M athematical interpretation and integration of experimental data for the goal of biological theo... more M athematical interpretation and integration of experimental data for the goal of biological theory development has had litde, if no, impact on previous progress in life sciences compared with the sophistication of experimental approaches themselves. The genesis of recent spectacular breakthroughs in molecular biology that led to the discovery of the enzymatic function of several nonmetabolic enzymes illustrates that this relationship is beginning to change. The development of high-throughput technologies, for example of complete genome sequencing, leads to large amounts of quantified data on biological systems without direct link to biological function that require formalized and complex mathematical approaches for their interpretation. The research success in life sciences depends increasingly on the ability of researchers in experimental and theoretical biology to jointly focus on important questions. Currendy, theoretical methods have best chances to contribute to new biological insight independently of experiments in the area of genome text interpretation and especially for gene function prediction. Experimental studies can help progress in the development of theoretical methods by providing verified, sufFiciendy large and variable sequence datasets for the exploration of sequence-function relationships.
Journal of biological engineering, Jan 3, 2009
The current knowledge of genes and proteins comes from 'naturally designed' coding and ... more The current knowledge of genes and proteins comes from 'naturally designed' coding and non-coding regions. It would be interesting to move beyond natural boundaries and make user-defined parts. To explore this possibility we made six non-natural proteins in E. coli. We also studied their potential tertiary structure and phenotypic outcomes.
Protein Science, 2009
Nuclear mitotic apparatus protein (NuMA) is an essential vertebrate component in organizing micro... more Nuclear mitotic apparatus protein (NuMA) is an essential vertebrate component in organizing microtubule ends at spindle poles. The NuMA-dynactin/dynein motor multiprotein complex not only explains the transport of NuMA along spindle fibers but also is linked to the process of microtubule focusing. The interaction sites of NuMA to dynein/dynactin have not been mapped. In the yet functionally uncharacterized N terminus of NuMA, we predict a calponin-homology (CH) domain, a motif with binding activity for actin-like molecules. We substantiate the primary sequence analysis-based prediction with secondary structure and fold recognition analysis, and we propose the N-terminal CH domain of NuMA as a likely interaction site for actin-related protein 1 (Arp1) protein of the dynactin/dynein complex.

Journal of Molecular Biology, 1998
Predicting' function from sequence using computational tools is a highly complicated procedure th... more Predicting' function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, raneing from genomic sequence to that of complex cellular processes. Protern function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal w i l l be to elucidate the mapping between genotype and phenotype.

Journal of Biological Chemistry, 2011
Background: Type 2 peroxisomal targeting signals (PTS2) tag proteins for import into peroxisomes.... more Background: Type 2 peroxisomal targeting signals (PTS2) tag proteins for import into peroxisomes. Results: Characterization of structural properties of PTS2 allows the prediction of novel PTS2 and identification of the binding site on the receptor PEX7. Conclusion: PTS2 forms helical structures that bind to a groove on PEX7. Significance: Understanding the recognition of PTS2 by its receptor is a critical step in peroxisomal protein transport. The import of a subset of peroxisomal matrix proteins is mediated by the peroxisomal targeting signal 2 (PTS2). The results of our sequence and physical property analysis of known PTS2 signals and of a mutational study of the least characterized amino acids of a canonical PTS2 motif indicate that PTS2 forms an amphipathic helix accumulating all conserved residues on one side. Three-dimensional structural modeling of the PTS2 receptor PEX7 reveals a groove with an evolutionarily conserved charge distribution complementary to PTS2 signals. Mammalian two-hybrid assays and cross-complementation of a mutation in PTS2 by a compensatory mutation in PEX7 confirm the interaction site. An unstructured linker region separates the PTS2 signal from the core protein. This additional information on PTS2 signals was used to generate a PTS2 prediction algorithm that enabled us to identify novel PTS2 signals within human proteins and to describe KChIP4 as a novel peroxisomal protein.
Current Protein & Peptide Science, 2010
SBASE is a project initiated to detect known domain types and predicting domain architectures usi... more SBASE is a project initiated to detect known domain types and predicting domain architectures using sequence similarity searching (

BMC Genomics, 2010
Background: Algorithms designed to predict protein disorder play an important role in structural ... more Background: Algorithms designed to predict protein disorder play an important role in structural and functional genomics, as disordered regions have been reported to participate in important cellular processes. Consequently, several methods with different underlying principles for disorder prediction have been independently developed by various groups. For assessing their usability in automated workflows, we are interested in identifying parameter settings and threshold selections, under which the performance of these predictors becomes directly comparable. Results: First, we derived a new benchmark set that accounts for different flavours of disorder complemented with a similar amount of order annotation derived for the same protein set. We show that, using the recommended default parameters, the programs tested are producing a wide range of predictions at different levels of specificity and sensitivity. We identify settings, in which the different predictors have the same false positive rate. We assess conditions when sets of predictors can be run together to derive consensus or complementary predictions. This is useful in the framework of proteome-wide applications where high specificity is required such as in our inhouse sequence analysis pipeline and the ANNIE webserver. Conclusions: This work identifies parameter settings and thresholds for a selection of disorder predictors to produce comparable results at a desired level of specificity over a newly derived benchmark dataset that accounts equally for ordered and disordered regions of different lengths.

BMC Biochemistry, 2010
Background The LGI2 (leucine-rich, glioma inactivated 2) gene, a prime candidate for partial epil... more Background The LGI2 (leucine-rich, glioma inactivated 2) gene, a prime candidate for partial epilepsy with pericentral spikes, belongs to a family encoding secreted, beta-propeller domain proteins with EPTP/EAR epilepsy-associated repeats. In another family member, LGI1 (leucine-rich, glioma inactivated 1) mutations are responsible for autosomal dominant lateral temporal epilepsy (ADLTE). Because a few LGI1 disease mutations described in the literature cause secretion failure, we experimentally analyzed the secretion efficiency and subcellular localization of several LGI1 and LGI2 mutant proteins corresponding to observed non-synonymous single nucleotide polymorphisms (nsSNPs) affecting the signal peptide, the leucine-rich repeats and the EAR propeller. Results Mapping of disease-causing mutations in the EAR domain region onto a 3D-structure model shows that many of these mutations co-localize at an evolutionary conserved surface region of the propeller. We find that wild-type LGI2 ...

Bioinformatics, 1999
MOTIVATION: Computer-based selection of entries from sequence databases with respect to a related... more MOTIVATION: Computer-based selection of entries from sequence databases with respect to a related functional description, e.g. with respect to a common cellular localization or contributing to the same phenotypic function, is a difficult task. Automatic semantic analysis of annotations is not only hampered by incomplete functional assignments. A major problem is that annotations are written in a rich, non-formalized language and are meant for reading by a human expert. This person can extract from the text considerably more information than is immediately apparent due to his extended biological background knowledge and logical reasoning. APPROACH: A technique of automated annotation evaluation based on a combination of lexical analysis and the usage of biological rule libraries has been developed. The proposed algorithm generates new functional descriptors from the annotation of a given entry using the semantic units of the annotation as prepositions for implications executed in acc...

Journal of Bioinformatics and Computational Biology, 2011
E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HM... more E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined thres...

Biology Direct
Background Although Escherichia coli (E. coli) is the most studied prokaryote organism in the his... more Background Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. Results The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs...

BMC Biology
BackgroundEscherichia coli(E. coli) has been one of the most studied model organisms in the histo... more BackgroundEscherichia coli(E. coli) has been one of the most studied model organisms in the history of life sciences. Initially thought just to be commensal bacteria,E. colihas shown wide phenotypic diversity including pathogenic isolates with great relevance to public health. Though pangenome analysis has been attempted several times, there is no systematic functional characterization of theE. colisubgroups according to the gene profile.ResultsSystematically scanning for optimal parametrization, we have built theE. colipangenome from 1324 complete genomes. The pangenome size is estimated to be ~25,000 gene families (GFs). Whereas the core genome diminishes as more genomes are added, the softcore genome (≥95% of strains) is stable with ~3000 GFs regardless of the total number of genomes. Apparently, the softcore genome (with a 92% or 95% generation threshold) can define the genome of a bacterial species listing the critically relevant, evolutionarily most conserved or important clas...

Genome Biology, 2003
Three different protein prenyltransferases (farnesyltransferase and geranylgeranyltransferases I ... more Three different protein prenyltransferases (farnesyltransferase and geranylgeranyltransferases I and II) catalyze the attachment of prenyl lipid anchors 15 or 20 carbons long to the carboxyl termini of a variety of eukaryotic proteins. Farnesyltransferase and geranylgeranyltransferase I both recognize a 'Ca 1 a 2 X' motif on their protein substrates; geranylgeranyltransferase II recognizes a different, non-CaaX motif. Each enzyme has two subunits. The genes encoding CaaX protein prenyltransferases are considerably longer than those encoding non-CaaX subunits, as a result of longer introns. Alternative splice forms are predicted to occur, but the extent to which each splice form is translated and the functions of the different resulting isoforms remain to be established. Farnesyltransferase-inhibitor drugs have been developed as anti-cancer agents and may also be able to treat several other diseases. The effects of these inhibitors are complicated, however, by the overlapping substrate specificities of geranylgeranyltransferase I and farnesyltransferase.
Nature Biotechnology, 2018

PROTEOMICS, 2018
The mentioning of gene names in the body of the scientific literature 1901-2017 and their fractio... more The mentioning of gene names in the body of the scientific literature 1901-2017 and their fractional counting is used as a proxy to assess the level of biological function discovery. A literature score of one has been defined as full publication equivalent (FPE), the amount of literature necessary to achieve one publication solely dedicated to a gene. It has been found that less than 5000 human genes have each at least 100 FPEs in the available literature corpus. This group of elite genes (4817 protein-coding genes, 119 non-coding RNAs) attracts the overwhelming majority of the scientific literature about genes. Yet, thousands of proteins have never been mentioned at all, 2000 further proteins have not even one FPE of literature and, for 4600 additional proteins, the FPE count is below 10. The protein function discovery rate measured as numbers of proteins first mentioned or crossing a threshold of accumulated FPEs in a given year has grown until 2000 but is in decline thereafter. This drop is partially offset by function discoveries for non-coding RNAs. The full human genome sequencing does not boost the function discovery rate. Since 2000, the fastest growing group in the literature is that with at least 500 FPEs per gene.

Biology Direct, 2016
Background: While the local-mode HMMER3 is notable for its massive speed improvement, the slower ... more Background: While the local-mode HMMER3 is notable for its massive speed improvement, the slower glocalmode HMMER2 is more exact for domain annotation by enforcing full domain-to-sequence alignments. Since a unit of domain necessarily implies a unit of function, local-mode HMMER3 alone remains insufficient for precise function annotation tasks. In addition, the incomparable E-values for the same domain model by different HMMER builds create difficulty when checking for domain annotation consistency on a large-scale basis. Results: In this work, both the speed of HMMER3 and glocal-mode alignment of HMMER2 are combined within the xHMMER3x2 framework for tackling the large-scale domain annotation task. Briefly, HMMER3 is utilized for initial domain detection so that HMMER2 can subsequently perform the glocal-mode, sequence-to-full-domain alignments for the detected HMMER3 hits. An E-value calibration procedure is required to ensure that the search space by HMMER2 is sufficiently replicated by HMMER3. We find that the latter is straightforwardly possible for 80% of the models in the Pfam domain library (release 29). However in the case of the remaining~20% of HMMER3 domain models, the respective HMMER2 counterparts are more sensitive. Thus, HMMER3 searches alone are insufficient to ensure sensitivity and a HMMER2-based search needs to be initiated. When tested on the set of UniProt human sequences, xHMMER3x2 can be configured to be between 7× and 201× faster than HMMER2, but with descending domain detection sensitivity from 99.8 to 95.7% with respect to HMMER2 alone; HMMER3's sensitivity was 95.7%. At extremes, xHMMER3x2 is either the slow glocal-mode HMMER2 or the fast HMMER3 with glocal-mode. Finally, the E-values to false-positive rates (FPR) mapping by xHMMER3x2 allows E-values of different model builds to be compared, so that any annotation discrepancies in a large-scale annotation exercise can be flagged for further examination by dissectHMMER. Conclusion: The xHMMER3x2 workflow allows large-scale domain annotation speed to be drastically improved over HMMER2 without compromising for domain-detection with regard to sensitivity and sequence-to-domain alignment incompleteness. The xHMMER3x2 code and its webserver (for Pfam release 27, 28 and 29) are freely available at http://xhmmer3x2.bii.a-star.edu.sg/. Reviewers: Reviewed by Thomas Dandekar, L. Aravind, Oliviero Carugo and Shamil Sunyaev. For the full reviews, please go to the Reviewers' comments section.

Current Opinion in Structural Biology, 2017
Contemporary protein structure is a result of the trade off between the laws of physics and the e... more Contemporary protein structure is a result of the trade off between the laws of physics and the evolutionary selection. The polymer nature of proteins played a decisive role in establishing the basic structural and functional units of soluble proteins. We discuss how these elementary building blocks work in the hierarchy of protein domain structure, co-translational folding, as well as in enzymatic activity and molecular interactions. Next, we consider modulators of the protein function, such as intermolecular interactions, disorder-to-order transitions, and allosteric signaling, acting via interference with the protein's structural dynamics. We also discuss the post-translational modifications, which is a complementary intricate mechanism evolved for regulation of protein functions and interactions. In conclusion, we assess an anticipated contribution of discussed topics to the future advancements in the field.

Biology Direct, 2015
Background: Annotation transfer for function and structure within the sequence homology concept e... more Background: Annotation transfer for function and structure within the sequence homology concept essentially requires protein sequence similarity for the secondary structural blocks forming the fold of a protein. A simplistic similarity approach in the case of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc.) is not justified and a pertinent source for mistaken homologies. The latter is either due to positional sequence conservation as a result of a very simple, physically induced pattern or integral sequence properties that are critical for function. Furthermore, against the backdrop that the number of well-studied proteins continues to grow at a slow rate, it necessitates for a search methodology to dive deeper into the sequence similarity space to connect the unknown sequences to the well-studied ones, albeit more distant, for biological function postulations. Results: Based on our previous work of dissecting the hidden markov model (HMMER) based similarity score into fold-critical and the non-globular contributions to improve homology inference, we propose a framework-dissectHMMER, that identifies more fold-related domain hits from standard HMMER searches. Subsequent statistical stratification of the fold-related hits into cohorts of functionally-related domains allows for the function postulation of the query sequence. Briefly, the technical problems as to how to recognize non-globular parts in the domain model, resolve contradictory HMMER2/HMMER3 results and evaluate fold-related domain hits for homology, are addressed in this work. The framework is benchmarked against a set of SCOP-to-Pfam domain models. Despite being a sequence-to-profile method, dissectHMMER performs favorably against a profile-to-profile based method-HHsuite/HHsearch. Examples of function annotation using dissectHMMER, including the function discovery of an uncharacterized membrane protein Q9K8K1_BACHD (WP_010899149.1) as a lactose/H+ symporter, are presented. Finally, dissectHMMER webserver is made publicly available at http://dissecthmmer.bii.a-star.edu.sg. Conclusions: The proposed framework-dissectHMMER, is faithful to the original inception of the sequence homology concept while improving upon the existing HMMER search tool through the rescue of statistically evaluated false-negative yet fold-related domain hits to the query sequence. Overall, this translates into an opportunity for any novel protein sequence to be functionally characterized. Reviewers: This article was reviewed by Masanori Arita, Shamil Sunyaev and L. Aravind.
Uploads
Papers by Frank Eisenhaber