GKV 1157
GKV 1157
GKV 1157
Ensembl 2016
Andrew Yates1 , Wasiu Akanni1 , M. Ridwan Amode1 , Daniel Barrell1,2 , Konstantinos Billis1 ,
Denise Carvalho-Silva1 , Carla Cummins1 , Peter Clapham2 , Stephen Fitzgerald1 ,
Laurent Gil1 , Carlos Garcı́a Girón1 , Leo Gordon1 , Thibaut Hourlier1 , Sarah E. Hunt1 , Sophie
H. Janacek1 , Nathan Johnson1 , Thomas Juettemann1 , Stephen Keenan1 , Ilias Lavidas1 ,
Fergal J. Martin1 , Thomas Maurel1 , William McLaren1 , Daniel N. Murphy1 , Rishi Nag1 ,
Michael Nuhn1 , Anne Parker1 , Mateus Patricio1 , Miguel Pignatelli1 , Matthew Rahtz2 ,
Harpreet Singh Riat1 , Daniel Sheppard1 , Kieron Taylor1 , Anja Thormann1 ,
Received September 19, 2015; Revised October 19, 2015; Accepted October 19, 2015
ABSTRACT INTRODUCTION
The Ensembl project (http://www.ensembl.org) is a Ensembl (http://www.ensembl.org) generates genomic
system for genome annotation, analysis, storage and datasets through a system that is designed to analyse,
dissemination designed to facilitate the access of ge- store and distribute data, and which enables interpreta-
nomic annotation from chordates and key model or- tion through open data release. While acting as a hub of
reference and baseline data similar to the UCSC Genome
ganisms. It provides access to data from 87 species
Browser (1) and RefSeq (2), we also distribute datasets
across our main and early access Pre! websites. This we create and promote standards and interoperability
year we introduced three newly annotated species between genomic resources. We engage with the scientific
and released numerous updates across our sup- community through an active outreach program and
ported species with a concentration on data for the helpdesk. In addition we collaborate with and often play
latest genome assemblies of human, mouse, ze- active leadership roles in projects such as ENCODE (3),
brafish and rat. We also provided two data updates the Genome Reference Consortium (GRC) (4), the Global
for the previous human assembly, GRCh37, through Alliance for Genomics and Health (GA4GH) and GEN-
a dedicated website (http://grch37.ensembl.org). Our CODE (5). Ensembl is updated four to five times per year
tools, in particular the VEP, have been improved with each release representing a data and software freeze.
significantly through integration of additional third This procedure ensures that all our data are consistent, no
party data. REST is now capable of larger-scale anal- matter the method of access. Every release is accompanied
by archived versions of our website and BioMart data
ysis and our regulatory data BioMart can deliver mining tool with a three year rolling retention policy. All
faster results. The website is now capable of dis- public data releases regardless of age are available from our
playing long-range interactions such as those found FTP site, MySQL servers and public Git repositories. In
in cis-regulated datasets. Finally we have launched addition a REST API provides program language agnostic
a website optimized for mobile devices providing access to the current data release.
views of genes, variants and phenotypes. Our data Our analysis methods construct annotation through the
is made available without restriction and all code processing and summarization of experimental evidence.
is available from our GitHub organization site (http: Gene annotation relies on the alignment of cDNAs and pro-
//github.com/Ensembl) under an Apache 2.0 license. teins from resources such as RefSeq and UniProt (6) along-
side building transcription models from RNA-seq align-
* To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: [email protected]
C The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Nucleic Acids Research, 2016, Vol. 44, Database issue D711
ment data. Our Regulatory Build is based on high qual- have followed its recent decision to improve mouse gene an-
ity experimental evidence from projects such as ENCODE notation to a similar level to human and have adopted a new
and Roadmap Epigenomics (7) and is capable of annotat- gene annotation release cycle. Computationally annotated
ing a diverse set of features across many distinct cell types mouse gene annotation is currently merged every release
(8). All gene and regulatory annotation is accessioned and with manual annotation from the HAVANA project (19).
versioned between releases enabling downstream analysis The human genome receives an update every other release
to accurately refer back to these annotations. We also pro- with zebrafish or rat receiving gene annotation updates in
duce comparative genomics resources, which build on top those releases when human does not. We have incorporated
of this gene annotation to calculate gene evolution and or- several minor assembly updates (three for human and two
thology information and use genomic DNA to build whole for mouse) including GRCh38.p3 and GRCm38.p4. Both
genome pairwise and multiple sequence alignments. Finally were released in Ensembl 81 (July 2015). The human and
our variation resources integrate disparate data sources (in- mouse gene annotations are supplemented by three meth-
cluding dbSNP (9), HGMD (10), ClinVar (11)) and present ods to help quantify transcript support and provide subsets
them through a consistent integrated interface. Variant con- of the GENCODE dataset. Transcript support levels (TSL)
based genotype layer to reduce storage, processing and ac- clustering of proteins in linear time compared to our pre-
cess time for these data. In addition, we have redesigned the vious approach. We are also developing two methods of
variation database schema to model individuals with multi- gene tree reconstruction. The first constructs gene trees de
ple samples. novo whilst the second enables gene tree modification by
Alongside variation data we also bring in phenotype, trait removing genes or inserting new genes into the tree. Our
and disease annotations for 14 species totalling 2.8 mil- new methods are being benchmarked via the Quest for Or-
lion annotations of genes, short variants, structural variants thologues service (http://orthology.benchmarkservice.org),
and QTLs. For human these data span over 15 000 pheno- which tests a range of metrics including tree-consistency
types, traits and diseases from 17 sources including ClinVar, approaches, gene ontology and enzyme classification tests
OMIM (Online Mendelian Inheritance in Man) (23) and (35). These developments will address the increasing num-
the NHGRI-EBI GWAS Catalog (24). Eight sources are in- bers of species available for comparative analysis in En-
corporated for other species including RGD (25), OMIA sembl and Ensembl Genomes and improve the stability of
(26), AnimalQTL (27), ZFIN (28). the predicted gene-trees and orthologies.
In addition to the above work we supplement our data
control all downloads using an automatic checksum com- plify extracting data from our resource and now enables
putation. We have also improved the installation script’s the download of sequences, pairwise alignments, multiple
warning handling. In addition version information has been alignments, orthologues and gene-trees. We also provide
added to the cache to improve debugging and data source improved high-resolution images for publication and high
tracking. contrast images for use in presentations. User data import
has been enhanced through better support for UCSC Track
Hubs and we now accept hubs with data from multiple
Web species. Additionally we support composite tracks and have
This year has seen a number of incremental improvements improved track labelling. Finally our website supports track
to our website. Data export has been re-engineered to sim-
D714 Nucleic Acids Research, 2016, Vol. 44, Database issue
BioMart
visibility settings allowing a hub to have a number of tracks
enabled by default. Our BioMart databases continue to be updated every re-
Gene Expression Atlas (GXA) baseline expression data lease in order to provide the latest annotation and imported
is integrated into our website using a GXA provided data (48). We have made protein domain coordinates, tran-
JavaScript widget. Gene expression baseline levels are avail- script length and the previously described GENCODE Ba-
able for a number of studies including FANTOM5 (44) and sic, TSL and APPRIS datasets available. Our relationship
GTEx (45). Our location view has been enhanced with two with biomaRt and Bioconductor/R has resulted in the re-
new interaction modes. In ‘Select’ mode, clicking and drag- lease of dedicated subroutines to allow R developers to eas-
ging with the mouse will select a resizable region and present ily query our live, GRCh37 and archive BioMart services
a menu with options to either zoom or mark a region. The (49). Finally, in release 79 (March 2015), we redesigned our
marked area is shown as a grey box and will remain marked regulation BioMart to improve query performance and to
until actively removed. This mark persists into our image ex- meet the projected demands in data volumes. We have cre-
ports as demonstrated in Figure 1. In ‘Drag’ mode, clicking ated seven datasets targeting distinct classes of data such as
on the image enables rudimentary scrolling navigation drag- regulatory features as annotated by our methods, as well as
ging the image to the left or right. Finally we released a new binding motifs and miRNA targets. Consequently querying
mobile optimized website, as shown in Figure 2, which can for data restricted by genomic location can be retrieved six
be accessed at http://m.ensembl.org or via optional redirec- times faster than using our previous BioMart.
tion from our main website. The website is optimized for re-
duced display sizes and offers targeted views of genes, vari-
OUTREACH AND TRAINING
ants and phenotypes. Mobile users can opt to return to the
full site when they require more advanced functionality. External user support is provided by means of face-to-
face training courses, online training materials, social media
and email help channels. Annually we deliver roughly 100
REST service
workshops at research institutes and conferences around
The last year has seen substantial growth in both data and the world in person and through live webinars. Online
usage for our programming language agnostic REST API training covering five Ensembl courses is available through
(46). All DNA from the previous five human assemblies the EMBL-EBI Train Online interface (http://www.ebi.ac.
Nucleic Acids Research, 2016, Vol. 44, Database issue D715
uk/training/online/subjects/11), while our YouTube chan- 3. ENCODE,Project Consortium. (2012) An integrated encyclopaedia
nel contains 35 training videos (https://www.youtube.com/ of DNA elements in the human genome. Nature, 489, 57–74.
4. Church,D.M., Schneider,V.A., Graves,T., Auger,K., Cunningham,F.,
user/EnsemblHelpdesk). Training material is also available Bouk,N., Chen,H.-C., Agarwala,R., McLaren,W.M., Ritchie,G.R.S.
via our help pages and workshops can be requested via our et al. (2011) Modernizing reference genome assemblies. PLoS Biol., 9,
helpdesk. e1001091.
Queries about working with Ensembl data, in- 5. Harrow,J., Frankish,A., Gonzalez,J.M., Tapanari,E., Diekhans,M.,
terfaces and APIs can be directed to our helpdesk Kokocinski,F., Aken,B.L., Barrell,D., Zadissa,A., Searle,S. et al.
(2012) GENCODE: the reference human genome annotation for The
([email protected]) or our public developers mailing ENCODE Project. Genome Res., 22, 1760–1774.
list ([email protected]). We are active on social media 6. UniProt Consortium. (2014) Activities at the Universal Protein
channels such as Twitter (https://twitter.com/ensembl), Resource (UniProt). Nucleic Acids Res., 42, D191–D198.
Facebook (https://www.facebook.com/Ensembl.org) and 7. Roadmap Epigenomics Consortium. (2015) Integrative analysis of
111 reference human epigenomes. Nature, 518, 317–330.
our blog (http://www.ensembl.info/). For example, we reg- 8. Zerbino,D.R., Wilder,S.P., Johnson,N., Juettemann,T. and
ularly use #citedEnsembl hashtag on Twitter to highlight Flicek,P.R. (2015) The Ensembl Regulatory Build. Genome Biol., 16,
published research that has used Ensembl resources. 56.
Man (OMIM R
), an online catalog of human genes and genetic 37. Forbes,S.A., Beare,D., Gunasekaran,P., Leung,K., Bindal,N.,
disorders. Nucleic Acids Res., 43, D789–D798. Boutselakis,H., Ding,M., Bamford,S., Cole,C., Ward,S. et al. (2015)
24. Welter,D., MacArthur,J., Morales,J., Burdett,T., Hall,P., Junkins,H., COSMIC: exploring the world’s knowledge of somatic mutations in
Klemm,A., Flicek,P., Manolio,T., Hindorff,L. et al. (2014) The human cancer. Nucleic Acids Res., 43, D805–D811.
NHGRI GWAS Catalog, a curated resource of SNP-trait 38. Exome Aggregation Consortium. (2015) Cambridge,
associations. Nucleic Acids Res., 42, D1001–D1006. http://exac.broadinstitute.org.
25. Shimoyama,M., Pons,J., Hayman,G.T., Laulederkind,S.J.F., Liu,W., 39. Tennessen,J.A., Bigham,A.W., O’Connor,T.D., Fu,W., Kenny,E.E.,
Nigam,R., Petri,V., Smith,J.R., Tutaj,M., Wang,S.-J. et al. (2015) The Gravel,S., McGee,S., Do,R., Liu,X., Jun,G. et al. (2012) Evolution
Rat Genome Database 2015: genomic, phenotypic and environmental and functional impact of rare coding variation from deep sequencing
variations and disease. Nucleic Acids Res., 43, D743–D750. of human exomes. Science, 337, 64–69.
26. Lenffer,J., Nicholas,F.W., Castle,K., Rao,A., Gregory,S., 40. Liu,X., Jian,X. and Boerwinkle,E. (2013) dbNSFP v2.0: a Database
Poidinger,M., Mailman,M.D. and Ranganathan,S. (2006) OMIA of Human Non-synonymous SNVs and Their Functional Predictions
(Online Mendelian Inheritance in Animals): an enhanced platform and Annotations. Hum. Mutat., 34, E2393–E2402.
and integration into the Entrez search interface at NCBI. Nucleic 41. Petryszak,R., Burdett,T., Fiorelli,B., Fonseca,N.A.,
Acids Res., 34, D599–D601. Gonzalez-Porta,M., Hastings,E., Huber,W., Jupp,S., Keays,M.,
27. Hu,Z.-L., Park,C.A., Wu,X.-L. and Reecy,J.M. (2013) Animal Kryvych,N. et al. (2014) Expression Atlas update––a database of gene
QTLdb: an improved database tool for livestock animal and transcript expression from microarray- and sequencing-based