Sequence motifs that are potentially recognized by DNA-binding proteins occur far more often in g... more Sequence motifs that are potentially recognized by DNA-binding proteins occur far more often in genomic DNA than do observed in vivo protein-DNA interactions. To determine how chromatin influences the utilization of particular DNA-binding sites, we compared the in vivo genome-wide binding location of the yeast transcription factor Leu3 to the binding location observed on the same genomic DNA in the absence of any protein cofactors. We found that the DNA-sequence motif recognized by Leu3 in vitro and in vivo was functionally indistinguishable, but Leu3 bound different genomic locations under the two conditions. Accounting for nucleosome occupancy in addition to DNA-sequence motifs significantly improved the prediction of protein-DNA interactions in vivo, but not the prediction of sites bound by purified Leu3 in vitro. Use of histone modification data does not further improve binding predictions, presumably because their effect is already manifest in the global histone distribution. Measurements of nucleosome occupancy in strains that differ in Leu3 genotype show that low nucleosome occupancy at loci bound by Leu3 is not a consequence of Leu3 binding. These results permit quantitation of the epigenetic influence that chromatin exerts on DNA binding-site selection, and provide evidence for an instructive, functionally important role for nucleosome occupancy in determining patterns of regulatory factor targeting genome-wide.
Hidden Markov models representing 167 protein sequence families were used to infer the presence o... more Hidden Markov models representing 167 protein sequence families were used to infer the presence or absence of homologs within the transcriptomes of 183 algal species/strains. Statistical analyses of the distribution of HMM hits across major clades of algae, or at branch points on the phylogenetic tree of 98 chlorophytes, confirmed and extended known cases of metabolic loss and gain, most notably the loss of the mevalonate pathway for terpenoid synthesis in green algae but not, as we show here, in the streptophyte algae. Evidence for novel events was found as well, most remarkably in the recurrent and coordinated gain or loss of enzymes for the glyoxylate shunt. We find, as well, a curious pattern of retention (or re-gain) of HMG-CoA synthase in chlorophytes that have otherwise lost the mevalonate pathway, suggesting a novel, co-opted function for this enzyme in select lineages. Finally, we find striking, phylogenetically linked distributions of coding sequences for three pathways th...
More than 3 percent of the protein sequences inferred from the Caenorhabditis elegans genome cont... more More than 3 percent of the protein sequences inferred from the Caenorhabditis elegans genome contain sequence motifs characteristic of zinc-binding structural domains, and of these more than half are believed to be sequence-specific DNA-binding proteins. The distribution of these zinc-binding domains among the genomes of various organisms offers insights into the role of zinc-binding proteins in evolution. In addition, the complete genome sequence of C. elegans provides an opportunity to analyze, and perhaps predict, pathways of transcriptional regulation.
Using a physically principled method of scoring genomic sequences for the potential to be bound b... more Using a physically principled method of scoring genomic sequences for the potential to be bound by transcription factors, we have developed an algorithm for assessing the conservation of predicted binding occupancy that does not rely on sequence alignment of promoters. The method, which we call orthologweighting, assesses the degree to which the predicted binding occupancy of a transcription factor in a reference gene is also predicted in the promoters of orthologous genes. The analysis was performed separately for over 100 different transcription factors in S. cerevisiae. Statistical significance was evaluated by simulation using permuted versions of the position weight matrices. Orthologweighting produced about twice as many significantly high scoring genes as were obtained from the S. cerevisiae genome alone. Gene Ontology analysis found a similar twofold enrichment of genes. Both analyses suggest that orthologweighting improves the prediction of true regulatory targets. Interestingly, the method has only a marginal effect on the prediction of binding by chromatin immunoprecipitation (ChIP) assays. We suggest several possibilities for reconciling this result with the improved enrichment that we observe for functionally related promoters and for promoters that are under positive selection.
Journal of integrative bioinformatics, Jan 18, 2012
Advances in high throughput sequencing technology have enabled the identification of transcriptio... more Advances in high throughput sequencing technology have enabled the identification of transcription factor (TF) binding sites in genome scale. TF binding studies are important for medical applications and stem cell research. Somatic cells can be reprogrammed to a pluripotent state by the combined introduction of factors such as Oct4, Sox2, c-Myc, Klf4. These reprogrammed cells share many characteristics with embryonic stem cells (ESCs) and are known as induced pluripotent stem cells (iPSCs). The signaling requirements for maintenance of human and murine embryonic stem cells (ESCs) differ considerably. Genome wide ChIP-seq TF binding maps in mouse stem cells include Oct4, Sox2, Nanog, Tbx3, Smad2 as well as group of other factors. ChIP-seq allows study of new candidate transcription factors for reprogramming. It was shown that Nr5a2 could replace Oct4 for reprogramming. Epigenetic modifications play important role in regulation of gene expression adding additional complexity to transc...
The sequence specificity of DNA-binding proteins is the primary mechanism by which the cell recog... more The sequence specificity of DNA-binding proteins is the primary mechanism by which the cell recognizes genomic features. Here, we describe systematic determination of yeast transcription factor DNAbinding specificities. We obtained binding specificities for 112 DNA-binding proteins representing 19 distinct structural classes. One-third of the binding specificities have not been previously reported. Several binding sequences have striking genomic distributions relative to transcription start sites, supporting their biological relevance and suggesting a role in promoter architecture. Among these are Rsc3 binding sequences, containing the core CGCG, which are found preferentially $100 bp upstream of transcription start sites. Mutation of RSC3 results in a dramatic increase in nucleosome occupancy in hundreds of proximal promoters containing a Rsc3 binding element, but has little impact on promoters lacking Rsc3 binding sequences, indicating that Rsc3 plays a broad role in targeting nucleosome exclusion at yeast promoters.
The structure of the Drosophila engrailed homeodomain has been solved by molecular replacement an... more The structure of the Drosophila engrailed homeodomain has been solved by molecular replacement and refined to an R-factor of 19.7% at a resolution of 2.1 A. This structure offers a high-resolution view of an important family of DNA-binding proteins and allows comparison to the structure of the same protein bound to DNA. The most significant difference between the current structure and that of the 2.8-A engrailed-DNA complex is the close packing of an extended strand against the rest of the protein in the unbound protein. Structural features of the protein not previously noted include a "herringbone" packing of 4 aromatic residues in the core of the protein and an extensive network of salt bridges that covers much of the helix 1-helix 2 surface. Other features that may play a role in stabilizing the native state include the interaction of buried carbonyl oxygen atoms with the edge of Phe 49 and a bias toward statistically preferred side-chain dihedral angles. There is substantial disorder at both ends of the 61 amino acid protein. A 51-amino acid variant of engrailed (residues 6-56) was synthesized and shown by CD and thermal denaturation studies to be structurally and thermodynamically similar to the full-length domain.
Homeodomains are 60 amino acid DNA binding domains found in numerous eukaryotic transcription fac... more Homeodomains are 60 amino acid DNA binding domains found in numerous eukaryotic transcription factors. The homeodomain family is a useful system for studying sequence-structure relationships because several hundred sequences are known and the structures of several homeodomains have been determined. Covariation of amino acid residues in the homeodomain family has been investigated to see whether strongly covariant residue pairs can be understood in terms of the structure and function of these domains. Among 16 strongly covariant pairs examined, 2 are explained by the ability to form salt bridges, and 9 appear related to the DNA binding function of the proteins. For the remaining 5 pairs, the rationale for covariance remains unclear and the likelihood of artifactual correlations is discussed in the context of experimental and evolutionary biases in the selection of sequences. No significant correlation was found between covariance and structural proximity in the hydrophobic core.
Sequence motifs that are potentially recognized by DNA-binding proteins occur far more often in g... more Sequence motifs that are potentially recognized by DNA-binding proteins occur far more often in genomic DNA than do observed in vivo protein-DNA interactions. To determine how chromatin influences the utilization of particular DNA-binding sites, we compared the in vivo genome-wide binding location of the yeast transcription factor Leu3 to the binding location observed on the same genomic DNA in the absence of any protein cofactors. We found that the DNA-sequence motif recognized by Leu3 in vitro and in vivo was functionally indistinguishable, but Leu3 bound different genomic locations under the two conditions. Accounting for nucleosome occupancy in addition to DNA-sequence motifs significantly improved the prediction of protein-DNA interactions in vivo, but not the prediction of sites bound by purified Leu3 in vitro. Use of histone modification data does not further improve binding predictions, presumably because their effect is already manifest in the global histone distribution. Measurements of nucleosome occupancy in strains that differ in Leu3 genotype show that low nucleosome occupancy at loci bound by Leu3 is not a consequence of Leu3 binding. These results permit quantitation of the epigenetic influence that chromatin exerts on DNA binding-site selection, and provide evidence for an instructive, functionally important role for nucleosome occupancy in determining patterns of regulatory factor targeting genome-wide.
Hidden Markov models representing 167 protein sequence families were used to infer the presence o... more Hidden Markov models representing 167 protein sequence families were used to infer the presence or absence of homologs within the transcriptomes of 183 algal species/strains. Statistical analyses of the distribution of HMM hits across major clades of algae, or at branch points on the phylogenetic tree of 98 chlorophytes, confirmed and extended known cases of metabolic loss and gain, most notably the loss of the mevalonate pathway for terpenoid synthesis in green algae but not, as we show here, in the streptophyte algae. Evidence for novel events was found as well, most remarkably in the recurrent and coordinated gain or loss of enzymes for the glyoxylate shunt. We find, as well, a curious pattern of retention (or re-gain) of HMG-CoA synthase in chlorophytes that have otherwise lost the mevalonate pathway, suggesting a novel, co-opted function for this enzyme in select lineages. Finally, we find striking, phylogenetically linked distributions of coding sequences for three pathways th...
More than 3 percent of the protein sequences inferred from the Caenorhabditis elegans genome cont... more More than 3 percent of the protein sequences inferred from the Caenorhabditis elegans genome contain sequence motifs characteristic of zinc-binding structural domains, and of these more than half are believed to be sequence-specific DNA-binding proteins. The distribution of these zinc-binding domains among the genomes of various organisms offers insights into the role of zinc-binding proteins in evolution. In addition, the complete genome sequence of C. elegans provides an opportunity to analyze, and perhaps predict, pathways of transcriptional regulation.
Using a physically principled method of scoring genomic sequences for the potential to be bound b... more Using a physically principled method of scoring genomic sequences for the potential to be bound by transcription factors, we have developed an algorithm for assessing the conservation of predicted binding occupancy that does not rely on sequence alignment of promoters. The method, which we call orthologweighting, assesses the degree to which the predicted binding occupancy of a transcription factor in a reference gene is also predicted in the promoters of orthologous genes. The analysis was performed separately for over 100 different transcription factors in S. cerevisiae. Statistical significance was evaluated by simulation using permuted versions of the position weight matrices. Orthologweighting produced about twice as many significantly high scoring genes as were obtained from the S. cerevisiae genome alone. Gene Ontology analysis found a similar twofold enrichment of genes. Both analyses suggest that orthologweighting improves the prediction of true regulatory targets. Interestingly, the method has only a marginal effect on the prediction of binding by chromatin immunoprecipitation (ChIP) assays. We suggest several possibilities for reconciling this result with the improved enrichment that we observe for functionally related promoters and for promoters that are under positive selection.
Journal of integrative bioinformatics, Jan 18, 2012
Advances in high throughput sequencing technology have enabled the identification of transcriptio... more Advances in high throughput sequencing technology have enabled the identification of transcription factor (TF) binding sites in genome scale. TF binding studies are important for medical applications and stem cell research. Somatic cells can be reprogrammed to a pluripotent state by the combined introduction of factors such as Oct4, Sox2, c-Myc, Klf4. These reprogrammed cells share many characteristics with embryonic stem cells (ESCs) and are known as induced pluripotent stem cells (iPSCs). The signaling requirements for maintenance of human and murine embryonic stem cells (ESCs) differ considerably. Genome wide ChIP-seq TF binding maps in mouse stem cells include Oct4, Sox2, Nanog, Tbx3, Smad2 as well as group of other factors. ChIP-seq allows study of new candidate transcription factors for reprogramming. It was shown that Nr5a2 could replace Oct4 for reprogramming. Epigenetic modifications play important role in regulation of gene expression adding additional complexity to transc...
The sequence specificity of DNA-binding proteins is the primary mechanism by which the cell recog... more The sequence specificity of DNA-binding proteins is the primary mechanism by which the cell recognizes genomic features. Here, we describe systematic determination of yeast transcription factor DNAbinding specificities. We obtained binding specificities for 112 DNA-binding proteins representing 19 distinct structural classes. One-third of the binding specificities have not been previously reported. Several binding sequences have striking genomic distributions relative to transcription start sites, supporting their biological relevance and suggesting a role in promoter architecture. Among these are Rsc3 binding sequences, containing the core CGCG, which are found preferentially $100 bp upstream of transcription start sites. Mutation of RSC3 results in a dramatic increase in nucleosome occupancy in hundreds of proximal promoters containing a Rsc3 binding element, but has little impact on promoters lacking Rsc3 binding sequences, indicating that Rsc3 plays a broad role in targeting nucleosome exclusion at yeast promoters.
The structure of the Drosophila engrailed homeodomain has been solved by molecular replacement an... more The structure of the Drosophila engrailed homeodomain has been solved by molecular replacement and refined to an R-factor of 19.7% at a resolution of 2.1 A. This structure offers a high-resolution view of an important family of DNA-binding proteins and allows comparison to the structure of the same protein bound to DNA. The most significant difference between the current structure and that of the 2.8-A engrailed-DNA complex is the close packing of an extended strand against the rest of the protein in the unbound protein. Structural features of the protein not previously noted include a "herringbone" packing of 4 aromatic residues in the core of the protein and an extensive network of salt bridges that covers much of the helix 1-helix 2 surface. Other features that may play a role in stabilizing the native state include the interaction of buried carbonyl oxygen atoms with the edge of Phe 49 and a bias toward statistically preferred side-chain dihedral angles. There is substantial disorder at both ends of the 61 amino acid protein. A 51-amino acid variant of engrailed (residues 6-56) was synthesized and shown by CD and thermal denaturation studies to be structurally and thermodynamically similar to the full-length domain.
Homeodomains are 60 amino acid DNA binding domains found in numerous eukaryotic transcription fac... more Homeodomains are 60 amino acid DNA binding domains found in numerous eukaryotic transcription factors. The homeodomain family is a useful system for studying sequence-structure relationships because several hundred sequences are known and the structures of several homeodomains have been determined. Covariation of amino acid residues in the homeodomain family has been investigated to see whether strongly covariant residue pairs can be understood in terms of the structure and function of these domains. Among 16 strongly covariant pairs examined, 2 are explained by the ability to form salt bridges, and 9 appear related to the DNA binding function of the proteins. For the remaining 5 pairs, the rationale for covariance remains unclear and the likelihood of artifactual correlations is discussed in the context of experimental and evolutionary biases in the selection of sequences. No significant correlation was found between covariance and structural proximity in the hydrophobic core.
Uploads
Papers by Neil Clarke