h i g h l i g h t s • The I-TASSER gateway is a server for protein structure and function predict... more h i g h l i g h t s • The I-TASSER gateway is a server for protein structure and function prediction. • The XSEDE-Comet supercomputer provides the computational backend for the gateway. • The gateway has become a popular service for the biological community.
bioRxiv (Cold Spring Harbor Laboratory), Nov 27, 2021
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. We prop... more Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. We proposed a new method (TripletGO) to deduce GO terms of proteincoding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profiling, genetic sequence alignment, protein sequence alignment and naïve probability, respectively. TripletGO was tested on a large set of 5,754 genes from 8 species (human, mouse, arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2,433 proteins with available expression data from the CAFA3 experiment and achieved function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet-network based profiling method with the feature space mapping technique which can accurately recognize function patterns from transcript expressions. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanglab.ccmb.med.umich.edu/TripletGO/.
Motivation: The full description of nucleic acid conformation involves eight torsion angles per n... more Motivation: The full description of nucleic acid conformation involves eight torsion angles per nucleotide. To simplify this description, we previously developed a representation of the nucleic acid backbone that assigns each nucleotide a pair of pseudo-torsion angles (eta and theta defined by P and C4 0 atoms; or eta 0 and theta 0 defined by P and C1 0 atoms). A Java program, AMIGOS II, is currently available for calculating eta and theta angles for RNA and for performing motif searches based on eta and theta angles. However, AMIGOS II lacks the ability to parse DNA structures and to calculate eta 0 and theta 0 angles. It also has little visualization capacity for 3D structure, making it difficult for users to interpret the computational results. Results: We present AMIGOS III, a PyMOL plugin that calculates the pseudo-torsion angles eta, theta, eta 0 and theta 0 for both DNA and RNA structures and performs motif searching based on these angles. Compared to AMIGOS II, AMIGOS III offers improved pseudo-torsion angle visualization for RNA and faster nucleic acid worm database generation; it also introduces pseudo-torsion angle visualization for DNA and nucleic acid worm visualization. Its integration into PyMOL enables easy preparation of tertiary structure inputs and intuitive visualization of involved structures.
Deep learning-based contact prediction in C-I-TASSER; violin plots for portions of residues predi... more Deep learning-based contact prediction in C-I-TASSER; violin plots for portions of residues predicted by TMHMM2.0 to be within transmembrane helices for JCVI-syn3.0 proteins that are annotated versus unannotated by C-I-TASSER/ COFACTOR with C-score > 0.5 for specific GO terms in the MF, BP, and CC aspects; structural alignment of the predicted structure of MMSYN1_0877 with the nine closest structural homologues identified in the Protein Data Bank; substrate binding domains for ECF systems targeting riboflavin; and a random PPI network for syn3.0, where 2483 of all 95,703 protein pairs are randomly selected as the positive PPI pairs (PDF)
Recent improvements in computational and experimental techniques for obtaining protein structures... more Recent improvements in computational and experimental techniques for obtaining protein structures have resulted in an explosion of 3D coordinate data. To cope with the ever-increasing sizes of structure databases, this work proposes the Protein Data Compression (PDC) format, which compresses coordinates and temperature factors of full-atomic and Cα-only protein structures. Without loss of precision, PDC results in 69% to 78% smaller file sizes than Protein Data Bank (PDB) and macromolecular Crystallographic Information File (mmCIF) files with standard GZIP compression. It uses ∼60% less space than existing compression algorithms specific to macromolecular structures. PDC optionally performs lossy compression with minimal sacrifice of precision, which allows reduction of file sizes by another 79%. Conversion between PDC, mmCIF and PDB formats is typically achieved within 0.02 s. The compactness and fast reading/writing speed of PDC make it valuable for storage and analysis of large quantity of tertiary structural data.
Background: Although mmCIF is the current official format for deposition of protein and nucleic a... more Background: Although mmCIF is the current official format for deposition of protein and nucleic acid structures to the protein data bank (PDB) database, the legacy PDB format is still the primary supported format for many structural bioinformatics tools. Therefore, reliable software to convert mmCIF structure files to PDB files is needed. Unfortunately, existing conversion programs fail to correctly convert many mmCIF files, especially those with many atoms and/or long chain identifies. Results: This study proposed BeEM, which converts any mmCIF format structure files to PDB format. BeEM conversion faithfully retains all atomic and chain information, including chain IDs with more than 2 characters, which are not supported by any existing mmCIF to PDB converters. The conversion speed of BeEM is at least ten times faster than existing converters such as MAXIT and Phenix. Part of the reason for the speed improvement is the avoidance of conversion between numerical values and text strings. Conclusion: BeEM is a fast and accurate tool for mmCIF-to-PDB format conversion, which is a common procedure in structural biology. The source code is available under the BSD licence at https:// github. com/ kad-ecoli/ BeEM/.
Despite considerable research progress on SARS-CoV-2, the direct zoonotic origin (intermediate ho... more Despite considerable research progress on SARS-CoV-2, the direct zoonotic origin (intermediate host) of the virus remains ambiguous. The most definitive approach to identify the intermediate host would be the detection of SARS-CoV-2-like coronaviruses in wild animals. However, due to the high number of animal species, it is not feasible to screen all the species in the laboratory. Given that binding to ACE2 proteins is the first step for the coronaviruses to invade host cells, we propose a computational pipeline to identify potential intermediate hosts of SARS-CoV-2 by modeling the binding affinity between the Spike receptor-binding domain (RBD) and host ACE2. Using this pipeline, we systematically examined 285 ACE2 variants from mammals, birds, fish, reptiles, and amphibians, and found that the binding energies calculated for the modeled Spike-RBD/ACE2 complex structures correlated closely with the effectiveness of animal infection as determined by multiple experimental data sets. Built on the optimized binding affinity cutoff, we suggest a set of 96 mammals, including 48 experimentally investigated ones, which are permissive to SARS-CoV-2, with candidates from primates, rodents, and carnivores at the highest risk of infection. Overall, this work not only suggests a limited range of potential intermediate SARS-CoV-2 hosts for further experimental investigation, but also, more importantly, it proposes a new structure-based approach to general zoonotic origin and susceptibility analyses that are critical for human infectious disease control and wildlife protection.
In this article, we report 3D structure prediction results by two of our best server groups ("Zha... more In this article, we report 3D structure prediction results by two of our best server groups ("Zhang-Server" and "QUARK") in CASP14. These two servers were built based on the D-I-TASSER and D-QUARK algorithms, which integrated four newly developed components into the classical protein folding pipelines, I-TASSER and QUARK, respectively. The new components include: (i) a new multiple sequence alignment (MSA) collection tool, DeepMSA2, which is extended from the DeepMSA program; (ii) a contact-based domain boundary prediction algorithm, FUpred, to detect protein domain boundaries; (iii) a residual convolutional neural network-based method, DeepPotential, to predict multiple spatial restraints by co-evolutionary features derived from the MSA; and (iv) optimized spatial restraint energy potentials to guide the structure assembly simulations. For 37 FM targets, the average TM-scores of the first models produced by D-I-TASSER and D-QUARK were 96% and 112% higher than those constructed by I-TASSER and QUARK, respectively. The data analysis indicates noticeable improvements produced by each of the four new components, especially for the newly added spatial restraints from DeepPotential and the well-tuned force field that combines spatial restraints, threading templates, and generic knowledge-based potentials. However, challenges still exist in the current pipelines. These include difficulties in modeling multi-domain proteins due to low accuracy in interdomain distance prediction and modeling protein domains from oligomer complexes, as the coevolutionary analysis cannot distinguish inter-chain and intra-chain distances. Specifically tuning the deep learning-based predictors for multi-domain targets and protein complexes may be helpful to address these issues. This is the author manuscript accepted for publication and has undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as
The LOMETS2 server (https://zhanglab.ccmb.med. umich.edu/LOMETS/) is an online meta-threading ser... more The LOMETS2 server (https://zhanglab.ccmb.med. umich.edu/LOMETS/) is an online meta-threading server system for template-based protein structure prediction. Although the server has been widely used by the community over the last decade, the previous LOMETS server no longer represents the stateof-the-art due to aging of the algorithms and unsatisfactory performance on distant-homology template identification. An extension of the server built on cutting-edge methods, especially techniques developed since the recent CASP experiments, is urgently needed. In this work, we report the recent advancements of the LOMETS2 server, which comprise a number of major new developments, including (i) new state-of-the-art threading programs, including contact-map-based threading approaches, (ii) deep sequence search-based sequence profile construction and (iii) a new web interface design that incorporates structure-based function annotations. Large-scale benchmark tests demonstrated that the integration of the deep profiles and new threading approaches into LOMETS2 significantly improve its structure modeling quality and template detection, where LOMETS2 detected 176% more templates with TM-scores >0.5 than the previous LOMETS server for Hard targets that lacked homologous templates. Meanwhile, the newly incorporated structure-based function prediction helps extend the usefulness of the online server to the broader biological community.
Motivation: The recently released PyMod GUI integrates many of the individual steps required for ... more Motivation: The recently released PyMod GUI integrates many of the individual steps required for protein sequence-structure analysis and homology modeling within the interactive visualization capabilities of PyMOL. Here we describe the improvements introduced into the version 2.0 of PyMod. Results: The original code of PyMod has been completely rewritten and improved in version 2.0 to extend PyMOL with packages such as Clustal Omega, PSIPRED and CAMPO. Integration with the popular web services ESPript and WebLogo is also provided. Finally, a number of new MODELLER functionalities have also been implemented, including SALIGN, modeling of quaternary structures, DOPE scores, disulfide bond modeling and choice of heteroatoms to be included in the final model.
We report the results of two fully automated structure prediction pipelines, "Zhang-Server" and "... more We report the results of two fully automated structure prediction pipelines, "Zhang-Server" and "QUARK", in CASP13. The pipelines were built upon the C-I-TASSER and C-QUARK programs, which in turn are based on I-TASSER and QUARK but with three new modules: (a) a novel multiple sequence alignment (MSA) generation protocol to construct deep sequence-profiles for contact prediction; (b) an improved metamethod, NeBcon, which combines multiple contact predictors, including ResPRE that predicts contact-maps by coupling precision-matrices with deep residual convolutional neural-networks; and (c) an optimized contact potential to guide structure assembly simulations. For 50 CASP13 FM domains that lacked homologous templates, average TM-scores of the first models produced by C-I-TASSER and C-QUARK were 28% and 56% higher than those constructed by I-TASSER and QUARK, respectively. For the first time, contact-map predictions demonstrated usefulness on TBM domains with close homologous templates, where TM-scores of C-I-TASSER models were significantly higher than those of I-TASSER models with a P-value <.05. Detailed data analyses showed that the success of C-I-TASSER and C-QUARK was mainly due to the increased accuracy of deep-learning-based contact-maps, as well as the careful balance between sequence-based contact restraints, threading templates, and generic knowledge-based potentials. Nevertheless, challenges still remain for predicting quaternary structure of multi-domain proteins, due to the difficulties in domain partitioning and domain reassembly. In addition, contact prediction in terminal regions was often unsatisfactory due to the sparsity of MSAs. Development of new contact-based domain partitioning and assembly methods and training contact models on sparse MSAs may help address these issues.
International Journal of Molecular Sciences, Mar 9, 2021
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
Acta Crystallographica Section D: Structural Biology, Nov 22, 2017
Helical transmembrane proteins are a ubiquitous and important class of proteins, but present diff... more Helical transmembrane proteins are a ubiquitous and important class of proteins, but present difficulties for crystallographic structure solution. Here, the effectiveness of the AMPLE molecular replacement pipeline in solving-helical transmembrane-protein structures is assessed using a small library of eight ideal helices, as well as search models derived from ab initio models generated both with and without evolutionary contact information. The ideal helices prove to be surprisingly effective at solving higher resolution structures, but ab initioderived search models are able to solve structures that could not be solved with the ideal helices. The addition of evolutionary contact information results in a marked improvement in the modelling and makes additional solutions possible.
Motivation: Comparison of RNA 3D structures can be used to infer functional relationship of RNA m... more Motivation: Comparison of RNA 3D structures can be used to infer functional relationship of RNA molecules. Most of the current RNA structure alignment programs are built on size-dependent scales, which complicate the interpretation of structure and functional relations. Meanwhile, the low speed prevents the programs from being applied to large-scale RNA structural database search. Results: We developed an open-source algorithm, RNA-align, for RNA 3D structure alignment which has the structure similarity scaled by a size-independent and statistically interpretable scoring metric. Large-scale benchmark tests show that RNA-align significantly outperforms other state-of-the-art programs in both alignment accuracy and running speed. The major advantage of RNA-align lies at the quick convergence of the heuristic alignment iterations and the coarsegrained secondary structure assignment, both of which are crucial to the speed and accuracy of RNA structure alignments. Availability and implementation: https://zhanglab.ccmb.med.umich.edu/RNA-align/.
Prokaryotes and some unicellular eukaryotes routinely overcome evolutionary pressures with the he... more Prokaryotes and some unicellular eukaryotes routinely overcome evolutionary pressures with the help of horizontally acquired genes. In contrast, it is unusual for multicellular eukaryotes to adapt through horizontal gene transfer (HGT). Recent studies identified several cases of adaptive acquisition in the gut-dwelling multicellular fungal phylum Neocallimastigomycota. Here, we add to these cases the acquisition of a putative bacterial toxin immunity gene, PoNi, by an ancient common ancestor of four extant Neocallimastigomycota genera through HGT from an extracellular Ruminococcus bacterium. The PoNi homologs in these fungal genera share extraordinarily high (>70%) amino acid sequence identity with their bacterial donor xenolog, providing definitive evidence of HGT as opposed to lineage-specific gene retention. Furthermore, PoNi genes are nested on native sections of chromosomal DNA in multiple fungal genomes and are also found in polyadenylated fungal transcriptomes, confirming that these genes are authentic fungal genomic regions rather than sequencing artifacts from bacterial contamination. The HGT event, which is estimated to have occurred at least 66 (±10) million years ago in the gut of a Cretaceous mammal, gave the fungi a putative toxin immunity protein (PoNi) which likely helps them survive toxinmediated attacks by bacterial competitors in the mammalian gut microbiome.
h i g h l i g h t s • The I-TASSER gateway is a server for protein structure and function predict... more h i g h l i g h t s • The I-TASSER gateway is a server for protein structure and function prediction. • The XSEDE-Comet supercomputer provides the computational backend for the gateway. • The gateway has become a popular service for the biological community.
bioRxiv (Cold Spring Harbor Laboratory), Nov 27, 2021
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. We prop... more Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. We proposed a new method (TripletGO) to deduce GO terms of proteincoding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profiling, genetic sequence alignment, protein sequence alignment and naïve probability, respectively. TripletGO was tested on a large set of 5,754 genes from 8 species (human, mouse, arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2,433 proteins with available expression data from the CAFA3 experiment and achieved function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet-network based profiling method with the feature space mapping technique which can accurately recognize function patterns from transcript expressions. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanglab.ccmb.med.umich.edu/TripletGO/.
Motivation: The full description of nucleic acid conformation involves eight torsion angles per n... more Motivation: The full description of nucleic acid conformation involves eight torsion angles per nucleotide. To simplify this description, we previously developed a representation of the nucleic acid backbone that assigns each nucleotide a pair of pseudo-torsion angles (eta and theta defined by P and C4 0 atoms; or eta 0 and theta 0 defined by P and C1 0 atoms). A Java program, AMIGOS II, is currently available for calculating eta and theta angles for RNA and for performing motif searches based on eta and theta angles. However, AMIGOS II lacks the ability to parse DNA structures and to calculate eta 0 and theta 0 angles. It also has little visualization capacity for 3D structure, making it difficult for users to interpret the computational results. Results: We present AMIGOS III, a PyMOL plugin that calculates the pseudo-torsion angles eta, theta, eta 0 and theta 0 for both DNA and RNA structures and performs motif searching based on these angles. Compared to AMIGOS II, AMIGOS III offers improved pseudo-torsion angle visualization for RNA and faster nucleic acid worm database generation; it also introduces pseudo-torsion angle visualization for DNA and nucleic acid worm visualization. Its integration into PyMOL enables easy preparation of tertiary structure inputs and intuitive visualization of involved structures.
Deep learning-based contact prediction in C-I-TASSER; violin plots for portions of residues predi... more Deep learning-based contact prediction in C-I-TASSER; violin plots for portions of residues predicted by TMHMM2.0 to be within transmembrane helices for JCVI-syn3.0 proteins that are annotated versus unannotated by C-I-TASSER/ COFACTOR with C-score > 0.5 for specific GO terms in the MF, BP, and CC aspects; structural alignment of the predicted structure of MMSYN1_0877 with the nine closest structural homologues identified in the Protein Data Bank; substrate binding domains for ECF systems targeting riboflavin; and a random PPI network for syn3.0, where 2483 of all 95,703 protein pairs are randomly selected as the positive PPI pairs (PDF)
Recent improvements in computational and experimental techniques for obtaining protein structures... more Recent improvements in computational and experimental techniques for obtaining protein structures have resulted in an explosion of 3D coordinate data. To cope with the ever-increasing sizes of structure databases, this work proposes the Protein Data Compression (PDC) format, which compresses coordinates and temperature factors of full-atomic and Cα-only protein structures. Without loss of precision, PDC results in 69% to 78% smaller file sizes than Protein Data Bank (PDB) and macromolecular Crystallographic Information File (mmCIF) files with standard GZIP compression. It uses ∼60% less space than existing compression algorithms specific to macromolecular structures. PDC optionally performs lossy compression with minimal sacrifice of precision, which allows reduction of file sizes by another 79%. Conversion between PDC, mmCIF and PDB formats is typically achieved within 0.02 s. The compactness and fast reading/writing speed of PDC make it valuable for storage and analysis of large quantity of tertiary structural data.
Background: Although mmCIF is the current official format for deposition of protein and nucleic a... more Background: Although mmCIF is the current official format for deposition of protein and nucleic acid structures to the protein data bank (PDB) database, the legacy PDB format is still the primary supported format for many structural bioinformatics tools. Therefore, reliable software to convert mmCIF structure files to PDB files is needed. Unfortunately, existing conversion programs fail to correctly convert many mmCIF files, especially those with many atoms and/or long chain identifies. Results: This study proposed BeEM, which converts any mmCIF format structure files to PDB format. BeEM conversion faithfully retains all atomic and chain information, including chain IDs with more than 2 characters, which are not supported by any existing mmCIF to PDB converters. The conversion speed of BeEM is at least ten times faster than existing converters such as MAXIT and Phenix. Part of the reason for the speed improvement is the avoidance of conversion between numerical values and text strings. Conclusion: BeEM is a fast and accurate tool for mmCIF-to-PDB format conversion, which is a common procedure in structural biology. The source code is available under the BSD licence at https:// github. com/ kad-ecoli/ BeEM/.
Despite considerable research progress on SARS-CoV-2, the direct zoonotic origin (intermediate ho... more Despite considerable research progress on SARS-CoV-2, the direct zoonotic origin (intermediate host) of the virus remains ambiguous. The most definitive approach to identify the intermediate host would be the detection of SARS-CoV-2-like coronaviruses in wild animals. However, due to the high number of animal species, it is not feasible to screen all the species in the laboratory. Given that binding to ACE2 proteins is the first step for the coronaviruses to invade host cells, we propose a computational pipeline to identify potential intermediate hosts of SARS-CoV-2 by modeling the binding affinity between the Spike receptor-binding domain (RBD) and host ACE2. Using this pipeline, we systematically examined 285 ACE2 variants from mammals, birds, fish, reptiles, and amphibians, and found that the binding energies calculated for the modeled Spike-RBD/ACE2 complex structures correlated closely with the effectiveness of animal infection as determined by multiple experimental data sets. Built on the optimized binding affinity cutoff, we suggest a set of 96 mammals, including 48 experimentally investigated ones, which are permissive to SARS-CoV-2, with candidates from primates, rodents, and carnivores at the highest risk of infection. Overall, this work not only suggests a limited range of potential intermediate SARS-CoV-2 hosts for further experimental investigation, but also, more importantly, it proposes a new structure-based approach to general zoonotic origin and susceptibility analyses that are critical for human infectious disease control and wildlife protection.
In this article, we report 3D structure prediction results by two of our best server groups ("Zha... more In this article, we report 3D structure prediction results by two of our best server groups ("Zhang-Server" and "QUARK") in CASP14. These two servers were built based on the D-I-TASSER and D-QUARK algorithms, which integrated four newly developed components into the classical protein folding pipelines, I-TASSER and QUARK, respectively. The new components include: (i) a new multiple sequence alignment (MSA) collection tool, DeepMSA2, which is extended from the DeepMSA program; (ii) a contact-based domain boundary prediction algorithm, FUpred, to detect protein domain boundaries; (iii) a residual convolutional neural network-based method, DeepPotential, to predict multiple spatial restraints by co-evolutionary features derived from the MSA; and (iv) optimized spatial restraint energy potentials to guide the structure assembly simulations. For 37 FM targets, the average TM-scores of the first models produced by D-I-TASSER and D-QUARK were 96% and 112% higher than those constructed by I-TASSER and QUARK, respectively. The data analysis indicates noticeable improvements produced by each of the four new components, especially for the newly added spatial restraints from DeepPotential and the well-tuned force field that combines spatial restraints, threading templates, and generic knowledge-based potentials. However, challenges still exist in the current pipelines. These include difficulties in modeling multi-domain proteins due to low accuracy in interdomain distance prediction and modeling protein domains from oligomer complexes, as the coevolutionary analysis cannot distinguish inter-chain and intra-chain distances. Specifically tuning the deep learning-based predictors for multi-domain targets and protein complexes may be helpful to address these issues. This is the author manuscript accepted for publication and has undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as
The LOMETS2 server (https://zhanglab.ccmb.med. umich.edu/LOMETS/) is an online meta-threading ser... more The LOMETS2 server (https://zhanglab.ccmb.med. umich.edu/LOMETS/) is an online meta-threading server system for template-based protein structure prediction. Although the server has been widely used by the community over the last decade, the previous LOMETS server no longer represents the stateof-the-art due to aging of the algorithms and unsatisfactory performance on distant-homology template identification. An extension of the server built on cutting-edge methods, especially techniques developed since the recent CASP experiments, is urgently needed. In this work, we report the recent advancements of the LOMETS2 server, which comprise a number of major new developments, including (i) new state-of-the-art threading programs, including contact-map-based threading approaches, (ii) deep sequence search-based sequence profile construction and (iii) a new web interface design that incorporates structure-based function annotations. Large-scale benchmark tests demonstrated that the integration of the deep profiles and new threading approaches into LOMETS2 significantly improve its structure modeling quality and template detection, where LOMETS2 detected 176% more templates with TM-scores >0.5 than the previous LOMETS server for Hard targets that lacked homologous templates. Meanwhile, the newly incorporated structure-based function prediction helps extend the usefulness of the online server to the broader biological community.
Motivation: The recently released PyMod GUI integrates many of the individual steps required for ... more Motivation: The recently released PyMod GUI integrates many of the individual steps required for protein sequence-structure analysis and homology modeling within the interactive visualization capabilities of PyMOL. Here we describe the improvements introduced into the version 2.0 of PyMod. Results: The original code of PyMod has been completely rewritten and improved in version 2.0 to extend PyMOL with packages such as Clustal Omega, PSIPRED and CAMPO. Integration with the popular web services ESPript and WebLogo is also provided. Finally, a number of new MODELLER functionalities have also been implemented, including SALIGN, modeling of quaternary structures, DOPE scores, disulfide bond modeling and choice of heteroatoms to be included in the final model.
We report the results of two fully automated structure prediction pipelines, "Zhang-Server" and "... more We report the results of two fully automated structure prediction pipelines, "Zhang-Server" and "QUARK", in CASP13. The pipelines were built upon the C-I-TASSER and C-QUARK programs, which in turn are based on I-TASSER and QUARK but with three new modules: (a) a novel multiple sequence alignment (MSA) generation protocol to construct deep sequence-profiles for contact prediction; (b) an improved metamethod, NeBcon, which combines multiple contact predictors, including ResPRE that predicts contact-maps by coupling precision-matrices with deep residual convolutional neural-networks; and (c) an optimized contact potential to guide structure assembly simulations. For 50 CASP13 FM domains that lacked homologous templates, average TM-scores of the first models produced by C-I-TASSER and C-QUARK were 28% and 56% higher than those constructed by I-TASSER and QUARK, respectively. For the first time, contact-map predictions demonstrated usefulness on TBM domains with close homologous templates, where TM-scores of C-I-TASSER models were significantly higher than those of I-TASSER models with a P-value <.05. Detailed data analyses showed that the success of C-I-TASSER and C-QUARK was mainly due to the increased accuracy of deep-learning-based contact-maps, as well as the careful balance between sequence-based contact restraints, threading templates, and generic knowledge-based potentials. Nevertheless, challenges still remain for predicting quaternary structure of multi-domain proteins, due to the difficulties in domain partitioning and domain reassembly. In addition, contact prediction in terminal regions was often unsatisfactory due to the sparsity of MSAs. Development of new contact-based domain partitioning and assembly methods and training contact models on sparse MSAs may help address these issues.
International Journal of Molecular Sciences, Mar 9, 2021
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
Acta Crystallographica Section D: Structural Biology, Nov 22, 2017
Helical transmembrane proteins are a ubiquitous and important class of proteins, but present diff... more Helical transmembrane proteins are a ubiquitous and important class of proteins, but present difficulties for crystallographic structure solution. Here, the effectiveness of the AMPLE molecular replacement pipeline in solving-helical transmembrane-protein structures is assessed using a small library of eight ideal helices, as well as search models derived from ab initio models generated both with and without evolutionary contact information. The ideal helices prove to be surprisingly effective at solving higher resolution structures, but ab initioderived search models are able to solve structures that could not be solved with the ideal helices. The addition of evolutionary contact information results in a marked improvement in the modelling and makes additional solutions possible.
Motivation: Comparison of RNA 3D structures can be used to infer functional relationship of RNA m... more Motivation: Comparison of RNA 3D structures can be used to infer functional relationship of RNA molecules. Most of the current RNA structure alignment programs are built on size-dependent scales, which complicate the interpretation of structure and functional relations. Meanwhile, the low speed prevents the programs from being applied to large-scale RNA structural database search. Results: We developed an open-source algorithm, RNA-align, for RNA 3D structure alignment which has the structure similarity scaled by a size-independent and statistically interpretable scoring metric. Large-scale benchmark tests show that RNA-align significantly outperforms other state-of-the-art programs in both alignment accuracy and running speed. The major advantage of RNA-align lies at the quick convergence of the heuristic alignment iterations and the coarsegrained secondary structure assignment, both of which are crucial to the speed and accuracy of RNA structure alignments. Availability and implementation: https://zhanglab.ccmb.med.umich.edu/RNA-align/.
Prokaryotes and some unicellular eukaryotes routinely overcome evolutionary pressures with the he... more Prokaryotes and some unicellular eukaryotes routinely overcome evolutionary pressures with the help of horizontally acquired genes. In contrast, it is unusual for multicellular eukaryotes to adapt through horizontal gene transfer (HGT). Recent studies identified several cases of adaptive acquisition in the gut-dwelling multicellular fungal phylum Neocallimastigomycota. Here, we add to these cases the acquisition of a putative bacterial toxin immunity gene, PoNi, by an ancient common ancestor of four extant Neocallimastigomycota genera through HGT from an extracellular Ruminococcus bacterium. The PoNi homologs in these fungal genera share extraordinarily high (>70%) amino acid sequence identity with their bacterial donor xenolog, providing definitive evidence of HGT as opposed to lineage-specific gene retention. Furthermore, PoNi genes are nested on native sections of chromosomal DNA in multiple fungal genomes and are also found in polyadenylated fungal transcriptomes, confirming that these genes are authentic fungal genomic regions rather than sequencing artifacts from bacterial contamination. The HGT event, which is estimated to have occurred at least 66 (±10) million years ago in the gut of a Cretaceous mammal, gave the fungi a putative toxin immunity protein (PoNi) which likely helps them survive toxinmediated attacks by bacterial competitors in the mammalian gut microbiome.
Uploads
Papers by chengxin zhang