Skip to main content

Tim Hubbard

Followers

4

Following

2

Co-authors

2

Public Views

Interests

Uploads

Papers by Tim Hubbard

Global implementation of genomic medicine: We are not alone

Science translational medicine, Jan 3, 2015

Around the world, innovative genomic-medicine programs capitalize on singular capabilities arisin... more Around the world, innovative genomic-medicine programs capitalize on singular capabilities arising from local health care systems, cultural or political milieus, and unusual selected risk alleles or disease burdens. Such individual efforts might benefit from the sharing of approaches and lessons learned in other locales. The U.S. National Human Genome Research Institute and the National Academy of Medicine recently brought together 25 of these groups to compare projects, to examine the current state of implementation and desired near-term capabilities, and to identify opportunities for collaboration that promote the responsible practice of genomic medicine. Efforts to coalesce these groups around concrete but compelling signature projects should accelerate the responsible implementation of genomic medicine in efforts to improve clinical care worldwide.

Human Genome: Draft Sequence

GENCODE: producing a reference annotation for ENCODE

Background: The GENCODE consortium was formed to identify and map all protein-coding genes within... more Background: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results. Results: The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions. Conclusions: In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.

Lessons learned from the initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig chromosome 17

Genome Biology, 2007

Assessing the pig genome project <p>The sequencing, annotation and comparative analysis of an 8Mb... more

Protein Folds in the All-Β and All-Α Classes

Annual Review of Biophysics and Biomolecular Structure, 1997

▪ Analysis of the structures in the Protein Databank, released in June 1996, shows that the num... more ▪ Analysis of the structures in the Protein Databank, released in June 1996, shows that the number of different protein folds, i.e. the number of different arrangements of major secondary structures and/or chain topologies, is 327. Of these folds, approximately 25% belong to the all-α class, 20% belong to the all-β class, 30% belong to the α/β class, and 25% belong to the α + β class. We describe the types of folds now known for the all-β and all-α classes, emphasizing those that have been discovered recently. Detailed theories for the physical determinants of the structures of most of these folds now exist, and these are reviewed.

New tools and expanded data analysis capabilities at the protein structure prediction center

Proteins: Structure, Function, and Bioinformatics, 2007

We outline the main tasks performed by the Protein Structure Prediction Center in support of the ... more We outline the main tasks performed by the Protein Structure Prediction Center in support of the CASP7 experiment and provide a brief review of the major measures used in the automatic evaluation of predictions. We describe in more detail the software developed to facilitate analysis of modeling success over and beyond the available templates and the adopted Java-based tool enabling visualization of multiple structural superpositions between target and several models/templates. We also give an overview of the CASP infrastructure provided by the Center and discuss the organization of the results web pages available through http:// predictioncenter.org.

Prediction of the structure of GroES and its interaction with GroEL

Proteins: Structure, Function, and Genetics, 1995

The three-dimensional structure of the GroES monomer and its interaction with GroEL has been pred... more The three-dimensional structure of the GroES monomer and its interaction with GroEL has been predicted using a combination of prediction tools and experimental data obtained by biophysical [electron microscope (EM), Fourier transform infrared (FTIR), and nuclear magnetic resonance (NMR)] and biochemical techniques. The GroES monomer, according to the prediction, is composed of eight @-strands forming a @barrel with loose ends. In the model, p-strands 5-8 run along the outer surface of GroES, forming an antiparallel p-sheet with p4 loosely bound to one of the edges. p-strands 1 3 would then be parallel and placed in the interior of the molecule. Loops 1 3 would face the internal cavity of the GroEL-GroES complex, and together with conserved residues in loops 5 and 7, would form the active surface interacting with GroEL.

ITFoM – The IT Future of Medicine

Procedia Computer Science, 2011

If citing, it is advised that you check and use the publisher's definitive version for pagination... more If citing, it is advised that you check and use the publisher's definitive version for pagination, volume/issue, and date of publication details. And where the final published version is provided on the Research Portal, if citing you are again advised to check the publisher's website for any subsequent corrections.

Evidence for Transcript Networks Composed of Chimeric RNAs in Human Cells

PLoS ONE, 2012

The classic organization of a gene structure has followed the Jacob and Monod bacterial gene mode... more The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 59 and 39 transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.

NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence

Nucleic Acids Research, 2005

NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding ... more NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites.

The vertebrate genome annotation (Vega) database

Nucleic Acids Research, 2007

The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) was first made public... more The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) was first made public in 2004 and has been designed to view manual annotation of human, mouse and zebrafish genomic sequences produced at the Wellcome Trust Sanger Institute. Since its initial release, the number of human annotated loci has more than doubled to close to 33 000 and now contains comprehensive annotation on 20 of the 24 human chromosomes, four whole mouse chromosomes and around 40% of the zebrafish Danio rerio genome. In addition, we offer manual annotation of a number of haplotype regions in mouse and human and regions of comparative interest in pig and dog that are unique to Vega.

Landscape of transcription in human cells

Nature, 2012

Eukaryotic cells make many types of primary and processed RNAs that are found either in specific ... more Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.

Prepublication data sharing

Nature, 2009

Rapid release of prepublication data has served the field of genomics well. Attendees at a worksh... more

Domain Insertions in Protein Structures

Journal of Molecular Biology, 2004

Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a... more Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a single domain or a combination of domains. In multi-domain proteins, the domains almost always occur end-to-end, i.e., one domain follows the C-terminal end of another domain. However, there are exceptions to this common pattern, where multi-domain proteins are formed by insertion of one domain (insert) into another domain (parent). Here, we provide a quantitative description of known insertions in the Protein Data Bank (PDB). We found that 9% of domain combinations observed in non-redundant PDB are insertions. Although 90% of all insertions involve only one insert, proteins can clearly have multiple (nested, two-domain and three-domain) inserts. We also observed correlations between the structure and function of a domain and its tendency to be found as a parent or an insert. There is a bias in insert position towards the C terminus of parents. We observed that the atomic distance between the N and C terminus of an insert is significantly smaller when compared to the N-to-C distance in a parent context or a single domain context. Insertions are found always to occur in loop regions of parent domains. Our observations regarding the relationship between domain insertions and the structure, function and evolution of proteins have implications for protein engineering.

Intermediate sequences increase the detection of homology between sequences

Journal of Molecular Biology, 1997

A physical map of the mouse genome

Nature, 2002

A physical map of a genome is an essential guide for navigation, allowing the location of any gen... more A physical map of a genome is an essential guide for navigation, allowing the location of any gene or other landmark in the chromosomal DNA. We have constructed a physical map of the mouse genome that contains 296 contigs of overlapping bacterial clones and 16,992 unique markers. The mouse contigs were aligned to the human genome sequence on the basis of 51,486 homology matches, thus enabling use of the conserved synteny (correspondence between chromosome blocks) of the two genomes to accelerate construction of the mouse map. The map provides a framework for assembly of whole-genome shotgun sequence data, and a tile path of clones for generation of the reference sequence. Definition of the human-mouse alignment at this level of resolution enables identification of a mouse clone that corresponds to almost any position in the human genome. The human sequence may be used to facilitate construction of other mammalian genome maps using the same strategy.

Initial sequencing and comparative analysis of the mouse genome

Nature, 2002

The sequence of the mouse genome is a key informational tool for understanding the contents of th... more The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.

Genome-wide end-sequenced BAC resources for the NOD/MrkTac☆ and NOD/ShiLtJ☆☆ mouse genomes

Genomics, 2010

Bacterial artificial chromosome NOD/MrkTac NOD/ShiLtJ Mouse genome Non-obese diabetic (NOD) Type ... more Bacterial artificial chromosome NOD/MrkTac NOD/ShiLtJ Mouse genome Non-obese diabetic (NOD) Type 1 diabetes T1D Insulin-dependent diabetes IDD Non-obese diabetic (NOD) mice spontaneously develop type 1 diabetes (T1D) due to the progressive loss of insulin-secreting β-cells by an autoimmune driven process. NOD mice represent a valuable tool for studying the genetics of T1D and for evaluating therapeutic interventions. Here we describe the development and characterization by end-sequencing of bacterial artificial chromosome (BAC) libraries derived from NOD/ MrkTac (DIL NOD) and NOD/ShiLtJ (CHORI-29), two commonly used NOD substrains. The DIL NOD library is composed of 196,032 BACs and the CHORI-29 library is composed of 110,976 BACs. The average depth of genome coverage of the DIL NOD library, estimated from mapping the BAC end-sequences to the reference mouse genome sequence, was 7.1-fold across the autosomes and 6.6-fold across the X chromosome. Clones from this library have an average insert size of 150 kb and map to over 95.6% of the reference mouse genome assembly (NCBIm37), covering 98.8% of Ensembl mouse genes. By the same metric, the CHORI-29 library has an average depth over the autosomes of 5.0-fold and 2.8-fold coverage of the X chromosome, the reduced X chromosome coverage being due to the use of a male donor for this library. Clones from this library have an average insert size of 205 kb and map to 93.9% of the reference mouse genome assembly, covering 95.7% of Ensembl genes. We have identified and validated 191,841 single nucleotide polymorphisms (SNPs) for DIL NOD and 114,380 SNPs for CHORI-29. In total we generated 229,736,133 bp of sequence for the DIL NOD and 121,963,211 bp for the CHORI-29. These BAC libraries represent a powerful resource for functional studies, such as gene targeting in NOD embryonic stem (ES) cell lines, and for sequencing and mapping experiments.

New horizons in sequence analysis

Current Opinion in Structural Biology, 1997

An ever increasing number of protein sequences are being compared, partly because of the availabi... more An ever increasing number of protein sequences are being compared, partly because of the availability of full sets of protein sequences from several completed genome-sequencing projects. The resulting problem of scale has shifted the emphasis of sequence analysis method development from sensitivity and flexibility, which relies on manual intervention and interpretation, to the automatic generation of results of known reliability.

Large-Scale Mutagenesis in p19ARF- and p53-Deficient Mice Identifies Cancer Genes and Their Collaborative Networks

Cell, 2008

p53 and p19 ARF are tumor suppressors frequently mutated in human tumors. In a high-throughput sc... more p53 and p19 ARF are tumor suppressors frequently mutated in human tumors. In a high-throughput screen in mice for mutations collaborating with either p53 or p19 ARF deficiency, we identified 10,806 retroviral insertion sites, implicating over 300 loci in tumorigenesis. This dataset reveals 20 genes that are specifically mutated in either p19 ARF-deficient, p53-deficient or wildtype mice (including Flt3, mmu-mir-106a-363, Smg6, and Ccnd3), as well as networks of significant collaborative and mutually exclusive interactions between cancer genes. Furthermore, we found candidate tumor suppressor genes, as well as distinct clusters of insertions within genes like Flt3 and Notch1 that induce mutants with different spectra of genetic interactions. Cross species comparative analysis with aCGH data of human cancer cell lines revealed known and candidate oncogenes (Mmp13, Slamf6, and Rreb1) and tumor suppressors (Wwox and Arfrp2). This dataset should prove to be a rich resource for the study of genetic interactions that underlie tumorigenesis. (D) Identification of CISs near Myc with different kernel sizes. Red line represents insertion density for 300 kb kernel, green line for 30 kb, and the blue line for 5 kb. Blue and red denote sense and antisense insertions, respectively. (E) Insertions from p53 À/À , p19 ARFÀ/À , and wild-type tumors were analyzed together with a 30 kb kernel to determine insertion density over the genome (left panel). The cutoff (p value < 0.05) for significant insertion density is indicated (red line). CISs (p value < 0.05) are indicated by green vertical bars. A list of the insertion density of the 15 most significant CISs is included (right panel).

Global implementation of genomic medicine: We are not alone

Science translational medicine, Jan 3, 2015

Around the world, innovative genomic-medicine programs capitalize on singular capabilities arisin... more Around the world, innovative genomic-medicine programs capitalize on singular capabilities arising from local health care systems, cultural or political milieus, and unusual selected risk alleles or disease burdens. Such individual efforts might benefit from the sharing of approaches and lessons learned in other locales. The U.S. National Human Genome Research Institute and the National Academy of Medicine recently brought together 25 of these groups to compare projects, to examine the current state of implementation and desired near-term capabilities, and to identify opportunities for collaboration that promote the responsible practice of genomic medicine. Efforts to coalesce these groups around concrete but compelling signature projects should accelerate the responsible implementation of genomic medicine in efforts to improve clinical care worldwide.

Human Genome: Draft Sequence

GENCODE: producing a reference annotation for ENCODE

Background: The GENCODE consortium was formed to identify and map all protein-coding genes within... more Background: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results. Results: The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions. Conclusions: In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.

Lessons learned from the initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig chromosome 17

Genome Biology, 2007

Assessing the pig genome project <p>The sequencing, annotation and comparative analysis of an 8Mb... more

Protein Folds in the All-Β and All-Α Classes

Annual Review of Biophysics and Biomolecular Structure, 1997

▪ Analysis of the structures in the Protein Databank, released in June 1996, shows that the num... more ▪ Analysis of the structures in the Protein Databank, released in June 1996, shows that the number of different protein folds, i.e. the number of different arrangements of major secondary structures and/or chain topologies, is 327. Of these folds, approximately 25% belong to the all-α class, 20% belong to the all-β class, 30% belong to the α/β class, and 25% belong to the α + β class. We describe the types of folds now known for the all-β and all-α classes, emphasizing those that have been discovered recently. Detailed theories for the physical determinants of the structures of most of these folds now exist, and these are reviewed.

New tools and expanded data analysis capabilities at the protein structure prediction center

Proteins: Structure, Function, and Bioinformatics, 2007

We outline the main tasks performed by the Protein Structure Prediction Center in support of the ... more We outline the main tasks performed by the Protein Structure Prediction Center in support of the CASP7 experiment and provide a brief review of the major measures used in the automatic evaluation of predictions. We describe in more detail the software developed to facilitate analysis of modeling success over and beyond the available templates and the adopted Java-based tool enabling visualization of multiple structural superpositions between target and several models/templates. We also give an overview of the CASP infrastructure provided by the Center and discuss the organization of the results web pages available through http:// predictioncenter.org.

Prediction of the structure of GroES and its interaction with GroEL

Proteins: Structure, Function, and Genetics, 1995

The three-dimensional structure of the GroES monomer and its interaction with GroEL has been pred... more The three-dimensional structure of the GroES monomer and its interaction with GroEL has been predicted using a combination of prediction tools and experimental data obtained by biophysical [electron microscope (EM), Fourier transform infrared (FTIR), and nuclear magnetic resonance (NMR)] and biochemical techniques. The GroES monomer, according to the prediction, is composed of eight @-strands forming a @barrel with loose ends. In the model, p-strands 5-8 run along the outer surface of GroES, forming an antiparallel p-sheet with p4 loosely bound to one of the edges. p-strands 1 3 would then be parallel and placed in the interior of the molecule. Loops 1 3 would face the internal cavity of the GroEL-GroES complex, and together with conserved residues in loops 5 and 7, would form the active surface interacting with GroEL.

ITFoM – The IT Future of Medicine

Procedia Computer Science, 2011

If citing, it is advised that you check and use the publisher's definitive version for pagination... more If citing, it is advised that you check and use the publisher's definitive version for pagination, volume/issue, and date of publication details. And where the final published version is provided on the Research Portal, if citing you are again advised to check the publisher's website for any subsequent corrections.

Evidence for Transcript Networks Composed of Chimeric RNAs in Human Cells

PLoS ONE, 2012

The classic organization of a gene structure has followed the Jacob and Monod bacterial gene mode... more The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 59 and 39 transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.

NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence

Nucleic Acids Research, 2005

NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding ... more NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites.

The vertebrate genome annotation (Vega) database

Nucleic Acids Research, 2007

The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) was first made public... more The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) was first made public in 2004 and has been designed to view manual annotation of human, mouse and zebrafish genomic sequences produced at the Wellcome Trust Sanger Institute. Since its initial release, the number of human annotated loci has more than doubled to close to 33 000 and now contains comprehensive annotation on 20 of the 24 human chromosomes, four whole mouse chromosomes and around 40% of the zebrafish Danio rerio genome. In addition, we offer manual annotation of a number of haplotype regions in mouse and human and regions of comparative interest in pig and dog that are unique to Vega.

Landscape of transcription in human cells

Nature, 2012

Eukaryotic cells make many types of primary and processed RNAs that are found either in specific ... more Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.

Prepublication data sharing

Nature, 2009

Rapid release of prepublication data has served the field of genomics well. Attendees at a worksh... more

Domain Insertions in Protein Structures

Journal of Molecular Biology, 2004

Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a... more Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a single domain or a combination of domains. In multi-domain proteins, the domains almost always occur end-to-end, i.e., one domain follows the C-terminal end of another domain. However, there are exceptions to this common pattern, where multi-domain proteins are formed by insertion of one domain (insert) into another domain (parent). Here, we provide a quantitative description of known insertions in the Protein Data Bank (PDB). We found that 9% of domain combinations observed in non-redundant PDB are insertions. Although 90% of all insertions involve only one insert, proteins can clearly have multiple (nested, two-domain and three-domain) inserts. We also observed correlations between the structure and function of a domain and its tendency to be found as a parent or an insert. There is a bias in insert position towards the C terminus of parents. We observed that the atomic distance between the N and C terminus of an insert is significantly smaller when compared to the N-to-C distance in a parent context or a single domain context. Insertions are found always to occur in loop regions of parent domains. Our observations regarding the relationship between domain insertions and the structure, function and evolution of proteins have implications for protein engineering.

Intermediate sequences increase the detection of homology between sequences

Journal of Molecular Biology, 1997

A physical map of the mouse genome

Nature, 2002

A physical map of a genome is an essential guide for navigation, allowing the location of any gen... more A physical map of a genome is an essential guide for navigation, allowing the location of any gene or other landmark in the chromosomal DNA. We have constructed a physical map of the mouse genome that contains 296 contigs of overlapping bacterial clones and 16,992 unique markers. The mouse contigs were aligned to the human genome sequence on the basis of 51,486 homology matches, thus enabling use of the conserved synteny (correspondence between chromosome blocks) of the two genomes to accelerate construction of the mouse map. The map provides a framework for assembly of whole-genome shotgun sequence data, and a tile path of clones for generation of the reference sequence. Definition of the human-mouse alignment at this level of resolution enables identification of a mouse clone that corresponds to almost any position in the human genome. The human sequence may be used to facilitate construction of other mammalian genome maps using the same strategy.

Initial sequencing and comparative analysis of the mouse genome

Nature, 2002

The sequence of the mouse genome is a key informational tool for understanding the contents of th... more The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.

Genome-wide end-sequenced BAC resources for the NOD/MrkTac☆ and NOD/ShiLtJ☆☆ mouse genomes

Genomics, 2010

Bacterial artificial chromosome NOD/MrkTac NOD/ShiLtJ Mouse genome Non-obese diabetic (NOD) Type ... more Bacterial artificial chromosome NOD/MrkTac NOD/ShiLtJ Mouse genome Non-obese diabetic (NOD) Type 1 diabetes T1D Insulin-dependent diabetes IDD Non-obese diabetic (NOD) mice spontaneously develop type 1 diabetes (T1D) due to the progressive loss of insulin-secreting β-cells by an autoimmune driven process. NOD mice represent a valuable tool for studying the genetics of T1D and for evaluating therapeutic interventions. Here we describe the development and characterization by end-sequencing of bacterial artificial chromosome (BAC) libraries derived from NOD/ MrkTac (DIL NOD) and NOD/ShiLtJ (CHORI-29), two commonly used NOD substrains. The DIL NOD library is composed of 196,032 BACs and the CHORI-29 library is composed of 110,976 BACs. The average depth of genome coverage of the DIL NOD library, estimated from mapping the BAC end-sequences to the reference mouse genome sequence, was 7.1-fold across the autosomes and 6.6-fold across the X chromosome. Clones from this library have an average insert size of 150 kb and map to over 95.6% of the reference mouse genome assembly (NCBIm37), covering 98.8% of Ensembl mouse genes. By the same metric, the CHORI-29 library has an average depth over the autosomes of 5.0-fold and 2.8-fold coverage of the X chromosome, the reduced X chromosome coverage being due to the use of a male donor for this library. Clones from this library have an average insert size of 205 kb and map to 93.9% of the reference mouse genome assembly, covering 95.7% of Ensembl genes. We have identified and validated 191,841 single nucleotide polymorphisms (SNPs) for DIL NOD and 114,380 SNPs for CHORI-29. In total we generated 229,736,133 bp of sequence for the DIL NOD and 121,963,211 bp for the CHORI-29. These BAC libraries represent a powerful resource for functional studies, such as gene targeting in NOD embryonic stem (ES) cell lines, and for sequencing and mapping experiments.

New horizons in sequence analysis

Current Opinion in Structural Biology, 1997

An ever increasing number of protein sequences are being compared, partly because of the availabi... more An ever increasing number of protein sequences are being compared, partly because of the availability of full sets of protein sequences from several completed genome-sequencing projects. The resulting problem of scale has shifted the emphasis of sequence analysis method development from sensitivity and flexibility, which relies on manual intervention and interpretation, to the automatic generation of results of known reliability.

Large-Scale Mutagenesis in p19ARF- and p53-Deficient Mice Identifies Cancer Genes and Their Collaborative Networks

Cell, 2008

p53 and p19 ARF are tumor suppressors frequently mutated in human tumors. In a high-throughput sc... more p53 and p19 ARF are tumor suppressors frequently mutated in human tumors. In a high-throughput screen in mice for mutations collaborating with either p53 or p19 ARF deficiency, we identified 10,806 retroviral insertion sites, implicating over 300 loci in tumorigenesis. This dataset reveals 20 genes that are specifically mutated in either p19 ARF-deficient, p53-deficient or wildtype mice (including Flt3, mmu-mir-106a-363, Smg6, and Ccnd3), as well as networks of significant collaborative and mutually exclusive interactions between cancer genes. Furthermore, we found candidate tumor suppressor genes, as well as distinct clusters of insertions within genes like Flt3 and Notch1 that induce mutants with different spectra of genetic interactions. Cross species comparative analysis with aCGH data of human cancer cell lines revealed known and candidate oncogenes (Mmp13, Slamf6, and Rreb1) and tumor suppressors (Wwox and Arfrp2). This dataset should prove to be a rich resource for the study of genetic interactions that underlie tumorigenesis. (D) Identification of CISs near Myc with different kernel sizes. Red line represents insertion density for 300 kb kernel, green line for 30 kb, and the blue line for 5 kb. Blue and red denote sense and antisense insertions, respectively. (E) Insertions from p53 À/À , p19 ARFÀ/À , and wild-type tumors were analyzed together with a 30 kb kernel to determine insertion density over the genome (left panel). The cutoff (p value < 0.05) for significant insertion density is indicated (red line). CISs (p value < 0.05) are indicated by green vertical bars. A list of the insertion density of the 15 most significant CISs is included (right panel).