GKV 1157

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

D710–D716 Nucleic Acids Research, 2016, Vol.

44, Database issue Published online 19 December 2015


doi: 10.1093/nar/gkv1157

Ensembl 2016
Andrew Yates1 , Wasiu Akanni1 , M. Ridwan Amode1 , Daniel Barrell1,2 , Konstantinos Billis1 ,
Denise Carvalho-Silva1 , Carla Cummins1 , Peter Clapham2 , Stephen Fitzgerald1 ,
Laurent Gil1 , Carlos Garcı́a Girón1 , Leo Gordon1 , Thibaut Hourlier1 , Sarah E. Hunt1 , Sophie
H. Janacek1 , Nathan Johnson1 , Thomas Juettemann1 , Stephen Keenan1 , Ilias Lavidas1 ,
Fergal J. Martin1 , Thomas Maurel1 , William McLaren1 , Daniel N. Murphy1 , Rishi Nag1 ,
Michael Nuhn1 , Anne Parker1 , Mateus Patricio1 , Miguel Pignatelli1 , Matthew Rahtz2 ,
Harpreet Singh Riat1 , Daniel Sheppard1 , Kieron Taylor1 , Anja Thormann1 ,

Downloaded from https://academic.oup.com/nar/article/44/D1/D710/2502651 by guest on 25 July 2024


Alessandro Vullo1 , Steven P. Wilder1 , Amonida Zadissa1 , Ewan Birney1 , Jennifer Harrow2 ,
Matthieu Muffato1 , Emily Perry1 , Magali Ruffier1 , Giulietta Spudich1 , Stephen J. Trevanion1 ,
Fiona Cunningham1 , Bronwen L. Aken1 , Daniel R. Zerbino1 and Paul Flicek1,2,*
1
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton,
Cambridge CB10 1SD, UK and 2 Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge,
CB10 1SA, UK

Received September 19, 2015; Revised October 19, 2015; Accepted October 19, 2015

ABSTRACT INTRODUCTION
The Ensembl project (http://www.ensembl.org) is a Ensembl (http://www.ensembl.org) generates genomic
system for genome annotation, analysis, storage and datasets through a system that is designed to analyse,
dissemination designed to facilitate the access of ge- store and distribute data, and which enables interpreta-
nomic annotation from chordates and key model or- tion through open data release. While acting as a hub of
reference and baseline data similar to the UCSC Genome
ganisms. It provides access to data from 87 species
Browser (1) and RefSeq (2), we also distribute datasets
across our main and early access Pre! websites. This we create and promote standards and interoperability
year we introduced three newly annotated species between genomic resources. We engage with the scientific
and released numerous updates across our sup- community through an active outreach program and
ported species with a concentration on data for the helpdesk. In addition we collaborate with and often play
latest genome assemblies of human, mouse, ze- active leadership roles in projects such as ENCODE (3),
brafish and rat. We also provided two data updates the Genome Reference Consortium (GRC) (4), the Global
for the previous human assembly, GRCh37, through Alliance for Genomics and Health (GA4GH) and GEN-
a dedicated website (http://grch37.ensembl.org). Our CODE (5). Ensembl is updated four to five times per year
tools, in particular the VEP, have been improved with each release representing a data and software freeze.
significantly through integration of additional third This procedure ensures that all our data are consistent, no
party data. REST is now capable of larger-scale anal- matter the method of access. Every release is accompanied
by archived versions of our website and BioMart data
ysis and our regulatory data BioMart can deliver mining tool with a three year rolling retention policy. All
faster results. The website is now capable of dis- public data releases regardless of age are available from our
playing long-range interactions such as those found FTP site, MySQL servers and public Git repositories. In
in cis-regulated datasets. Finally we have launched addition a REST API provides program language agnostic
a website optimized for mobile devices providing access to the current data release.
views of genes, variants and phenotypes. Our data Our analysis methods construct annotation through the
is made available without restriction and all code processing and summarization of experimental evidence.
is available from our GitHub organization site (http: Gene annotation relies on the alignment of cDNAs and pro-
//github.com/Ensembl) under an Apache 2.0 license. teins from resources such as RefSeq and UniProt (6) along-
side building transcription models from RNA-seq align-

* To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: [email protected]


C The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Nucleic Acids Research, 2016, Vol. 44, Database issue D711

ment data. Our Regulatory Build is based on high qual- have followed its recent decision to improve mouse gene an-
ity experimental evidence from projects such as ENCODE notation to a similar level to human and have adopted a new
and Roadmap Epigenomics (7) and is capable of annotat- gene annotation release cycle. Computationally annotated
ing a diverse set of features across many distinct cell types mouse gene annotation is currently merged every release
(8). All gene and regulatory annotation is accessioned and with manual annotation from the HAVANA project (19).
versioned between releases enabling downstream analysis The human genome receives an update every other release
to accurately refer back to these annotations. We also pro- with zebrafish or rat receiving gene annotation updates in
duce comparative genomics resources, which build on top those releases when human does not. We have incorporated
of this gene annotation to calculate gene evolution and or- several minor assembly updates (three for human and two
thology information and use genomic DNA to build whole for mouse) including GRCh38.p3 and GRCm38.p4. Both
genome pairwise and multiple sequence alignments. Finally were released in Ensembl 81 (July 2015). The human and
our variation resources integrate disparate data sources (in- mouse gene annotations are supplemented by three meth-
cluding dbSNP (9), HGMD (10), ClinVar (11)) and present ods to help quantify transcript support and provide subsets
them through a consistent integrated interface. Variant con- of the GENCODE dataset. Transcript support levels (TSL)

Downloaded from https://academic.oup.com/nar/article/44/D1/D710/2502651 by guest on 25 July 2024


sequences are calculated with reference to our gene and reg- are an expression of how well mRNA and EST libraries
ulatory annotation and quantified by standard protein con- align to transcripts across splice junctions. Transcripts are
sequence analyses. assigned a numeric value from one to five indicating the level
Annotation is made available through a set of mature Perl of support. APPRIS is used to identify principle transcript
Application Programming Interfaces (APIs), which broker isoforms of genes from proteomic datasets (20). Finally the
data from our databases. These same APIs are used to build GENCODE Basic representative transcript set prioritizes
our website and analysis methods and they are available ex- full-length protein coding transcripts over partial or non-
ternally for others to use in building their own methods and protein coding transcripts and is based on rules agreed by
tools. Our infrastructure, whilst originally developed to op- the GENCODE consortium (21).
erate on chordates and core model organisms, has been suc- In addition, we have annotated two major assembly up-
cessfully deployed for a wide range of taxa as shown by our dates for rat (Rnor 6.0) and zebrafish (GRCz10). Both gene
sister project Ensembl Genomes (12). Tools such as the En- sets include manual annotation from the HAVANA project.
sembl Variant Effect Predictor (VEP) can be applied to any We have also recently updated our lincRNA annotation
genome due to our common programming interface and methods to using candidate transcript models built from
support for standard data formats (13). All Ensembl soft- RNA-seq data. These are tested for protein coding po-
ware is available under an Apache 2.0 license and is free for tential by searching for Pfam domains and alignments to
all to use. the UniProt database. Models that show no protein cod-
Release 82 (September 2015) makes 69 species available ing potential are labelled as lincRNA. We have applied
from our main website, 18 species from our pre-release this method to generate lincRNAs for rat and sheep and
website (http://pre.ensembl.org) and GRCh37 annotation aim to extend the method to other species over the coming
served from a dedicated website (http://grch37.ensembl. year. We have also produced preliminary transcript models
org). All chordates have been analysed by our gene anno- for Crab-eating macaque (Macaca fascicularis) and sperm
tation methods. Variation data is available for 22 species whale (Physeter macrocephalus) by aligning experimental
and regulatory data is available for human and mouse. All data and homologous proteins; these are available on our
three websites provide sequence search. Additional tools are Pre! website.
available on our main and GRCh37 websites, including the To compare our annotation to external gene sets we have
VEP and an assembly coordinate conversion tool based on imported both Consensus CDS (CCDS) and RefSeq tran-
CrossMap (14). The main website is hosted from four loca- scripts for human, mouse and selected other species, into
tions based in the UK, Singapore, US East and West coasts; our infrastructure (22). We annotate RefSeq transcripts
the final three are deployed on Amazon Web Services. We when their mRNA sequence does not exactly match the un-
also provide the ability to attach standard bioinformatic derlying genomic sequence and provide details on where the
data formats including BigBED, BigWig (15), VCF (16) and mismatch or insertion-deletion occurs e.g. 5 UTR, CDS, 3
BAM (17) to visualize external data in the context of our UTR. RefSeq transcripts that exactly match an overlapping
own data and support the UCSC track hub format to or- Ensembl-generated model, with respect to the entire model
chestrate track configuration (18). or just coding exons, are annotated accordingly. All annota-
This report focuses on new data and important techno- tion is available to downstream analysis tools including the
logical changes to the project. We explain how these im- VEP and via our unified APIs.
provements enhance Ensembl and aid the analysis and in-
terpretation of genomic data.
Variation annotation
Our variation resources integrate essentially all publicly
GENOME ANNOTATION available germline and somatic variant data for 20 verte-
brate species. Over the past year the number of SNPs, indels
Protein coding and non-protein coding gene annotation
and structural variants in the databases has almost doubled
Over the past year, we have concentrated on supporting our to 468 million variants. We have seen a dramatic increase in
most accessed species and annotating a selection of new genotypes for human (206 000 million), cow (160 million)
genomes. As a member of the GENCODE consortium, we and sheep (6300 million). This led us to develop a new VCF-
D712 Nucleic Acids Research, 2016, Vol. 44, Database issue

based genotype layer to reduce storage, processing and ac- clustering of proteins in linear time compared to our pre-
cess time for these data. In addition, we have redesigned the vious approach. We are also developing two methods of
variation database schema to model individuals with multi- gene tree reconstruction. The first constructs gene trees de
ple samples. novo whilst the second enables gene tree modification by
Alongside variation data we also bring in phenotype, trait removing genes or inserting new genes into the tree. Our
and disease annotations for 14 species totalling 2.8 mil- new methods are being benchmarked via the Quest for Or-
lion annotations of genes, short variants, structural variants thologues service (http://orthology.benchmarkservice.org),
and QTLs. For human these data span over 15 000 pheno- which tests a range of metrics including tree-consistency
types, traits and diseases from 17 sources including ClinVar, approaches, gene ontology and enzyme classification tests
OMIM (Online Mendelian Inheritance in Man) (23) and (35). These developments will address the increasing num-
the NHGRI-EBI GWAS Catalog (24). Eight sources are in- bers of species available for comparative analysis in En-
corporated for other species including RGD (25), OMIA sembl and Ensembl Genomes and improve the stability of
(26), AnimalQTL (27), ZFIN (28). the predicted gene-trees and orthologies.
In addition to the above work we supplement our data

Downloaded from https://academic.oup.com/nar/article/44/D1/D710/2502651 by guest on 25 July 2024


with available citations, pass variants though our quality- GRCh37 human assembly support
control procedures and predict the consequence of variants
on our gene sets and regulatory regions. For every pos- Our GRCh37 website, supporting the previous human as-
sible amino acid change in our 10 most popular species, sembly, has received two major releases over the past year.
we run SIFT with enhanced quality information (29) and Our first (March 2015) included new data variant from the
PolyPhen-2 (30) (human only). We also compute Human 1000 Genomes Project phase 3 (36), dbSNP, COSMIC v71
Genome Variation Society (HGVS) nomenclature for every (37) and HGMD. The second update (October 2015) incor-
variant and recently moved to 3 shifting of indels to con- porated variants from the Exome Aggregation Consortium
form to the HGVS specification. (ExAC) (38) and NHLBI Exome Sequencing Project (39).
A full selection of resources is available for GRCh37 in-
cluding BioMart and public MySQL access. Our FTP site
Regulatory annotation (ftp://ftp.ensembl.org/pub/grch37) provides VEP cache and
Ensembl Regulation annotation describes the functional FASTA updates. We also maintain a GRCh37 REST API
role of non-genic genomic elements, in particular enhancers, hosted at http://grch37.rest.ensembl.org.
promoters and insulators, using biochemical assays such
as ChIP-Seq or DNAse1 hypersensitivity. The re-designed WEBSITE, TOOLS AND INFRASTRUCTURE
method, which was deployed last year on 18 human cell
VEP
lines, has been extended to mouse and covers 8 cell lines
and tissues (8). This year we have improved the VEP’s ability to report tran-
In anticipation of high-level cis-regulatory datasets, script attributes such as a transcript’s existence in the GEN-
which will link regulatory elements to neighbouring genes, CODE Basic set and a transcript’s TSL support level (both
we can now render interaction data on our graphical subject to availability). We also record, in the VEP’s out-
genome location view. Interaction elements are described by put, if phenotype/disease data are available. Additionally, it
the existing WashU Epigenome Browser formats and then is possible to report predictions on RefSeq transcripts and
loaded onto our website via user upload or by specifying a the VEP will now indicate if the transcript matches a model
HTTP URL (31). The data are then visualized as arcs span- annotated by Ensembl. Finally the VEP now supports se-
ning across the region in question, as illustrated in Figure lenocysteine modifications.
1. The standalone VEP has been enhanced by several new
plugins. It is now possible to retrieve ExAC allele frequen-
cies from downloaded VCFs, to query for splice site pre-
Comparative annotation
dictions from dbscSNV (40) and finally to retrieve gene ex-
Ensembl’s comparative analysis integrates the genome se- pression levels from Gene Expression Atlas data via their
quences and gene annotations of all available species into a web service API (41). Other plugins allow the VEP to locate
single comprehensive resource. We have updated our whole- the nearest gene to a variant and indicate if a variant has
genome alignments due to updates to both rat and zebrafish been shifted in HGVS notation. Our online tool interface
assemblies. The zebrafish assembly update resulted in re- has also been updated to provide an immediate overview
computing 20 pairwise whole genome alignments and our of a single variant’s consequences. The ‘Instant VEP’ tool
fish Enredo Pecan Ortheus (EPO) multiple alignments (32). queries our live REST API to return consequence data in
We have also retired our fish-specific EPO method result- less than a second.
ing in a single EPO production pipeline applicable to fish, Extensive development work on the VEP has resulted in
mammal and sauropsid multiple sequence alignments. significant reductions in runtime. For example, analysis of
Major development work is on going to move towards a NA12878 from Illumina’s Platinum Genome dataset (anno-
new protein clustering and classification system. Our cur- tating 4 498 138 variants) using the GRCh38 assembly and
rent method is based on clustering blastp distances using release 81 data took 113 min to complete using four com-
hcluster sg (33). It will be replaced with a more straight- pute cores (42). The same analysis performed using release
forward HMM classification based upon PANTHER (34). 77 data and analysis, October 2014, took 199 min to com-
Moving to an HMM classification will enable analysis and plete. To improve the installation procedure we now quality
Nucleic Acids Research, 2016, Vol. 44, Database issue D713

Downloaded from https://academic.oup.com/nar/article/44/D1/D710/2502651 by guest on 25 July 2024


Figure 1. Ensembl’s location view showing the drawing of long-range interaction arcs and new region marking tool. The grey boxes indicate HindIII
fragments and the arcs represent selected significant interactions between promoters and their distal interacting elements, measured using high resolution
Capture Hi-C in the GM12878 lymphoblastoid cell line and displayed for the GRCh38 assembly (43). The summary Ensembl Regulatory Build and the
GM12878 specific regulatory activities are also shown. Our region marking tool is shown as a light grey box surrounding the transcripts PSMD9 and
RP11–87C12.2.

control all downloads using an automatic checksum com- plify extracting data from our resource and now enables
putation. We have also improved the installation script’s the download of sequences, pairwise alignments, multiple
warning handling. In addition version information has been alignments, orthologues and gene-trees. We also provide
added to the cache to improve debugging and data source improved high-resolution images for publication and high
tracking. contrast images for use in presentations. User data import
has been enhanced through better support for UCSC Track
Hubs and we now accept hubs with data from multiple
Web species. Additionally we support composite tracks and have
This year has seen a number of incremental improvements improved track labelling. Finally our website supports track
to our website. Data export has been re-engineered to sim-
D714 Nucleic Acids Research, 2016, Vol. 44, Database issue

(versions NCBI34 to GRCh38) are available by specify-


ing the desired assembly version. Omitting a version as-
sumes the latest assembly. Our VEP endpoint now sup-
ports annotation using HGVS variant nomenclature (e.g.
AGT:c.803T>C) and querying for variants from a protein
has been significantly improved. Building on our release
in 2014, eight endpoints now support batch querying via
the HTTP POST method including our sequence, identifier
lookup and archive endpoints. We also support a number of
GA4GH methods for retrieving sample genotype calls, vari-
ant calls on a reference sequence and for discovering avail-
able variant datasets and are actively working on a GA4GH
variant annotation prototype.

Downloaded from https://academic.oup.com/nar/article/44/D1/D710/2502651 by guest on 25 July 2024


eHive
eHive is the pipeline management system that powers a sig-
nificant proportion of our compute, over 300 CPU years of
compute per year and is versioned outside of the Ensembl
release cycle (47). This year we have released version 2.3,
which now supports a generic guest language interface to fa-
cilitate the writing of runnables in languages other than Perl
through a standardized interprocess communication proto-
col. We have written a reference implementation in Python
allowing Python and Perl code to be executed in the same
workflow. Support for Java is under active development. Fi-
nally version 2.3 enhances our standard modules by improv-
ing their ability to capture and respond to erroneous sys-
tem commands alongside improved security when interact-
Figure 2. The new Ensembl mobile site showing BRCA2 and detailing ing with databases. All development is continuously tested
available synonyms, genomic location and links to external resources such on Travis CI (a public continuous integration service) with
as CCDS. Search is available in the top right corner on all mobile site pages
and each page has the ability to be shared over social media and email.
70% of the code tree covered by tests.

BioMart
visibility settings allowing a hub to have a number of tracks
enabled by default. Our BioMart databases continue to be updated every re-
Gene Expression Atlas (GXA) baseline expression data lease in order to provide the latest annotation and imported
is integrated into our website using a GXA provided data (48). We have made protein domain coordinates, tran-
JavaScript widget. Gene expression baseline levels are avail- script length and the previously described GENCODE Ba-
able for a number of studies including FANTOM5 (44) and sic, TSL and APPRIS datasets available. Our relationship
GTEx (45). Our location view has been enhanced with two with biomaRt and Bioconductor/R has resulted in the re-
new interaction modes. In ‘Select’ mode, clicking and drag- lease of dedicated subroutines to allow R developers to eas-
ging with the mouse will select a resizable region and present ily query our live, GRCh37 and archive BioMart services
a menu with options to either zoom or mark a region. The (49). Finally, in release 79 (March 2015), we redesigned our
marked area is shown as a grey box and will remain marked regulation BioMart to improve query performance and to
until actively removed. This mark persists into our image ex- meet the projected demands in data volumes. We have cre-
ports as demonstrated in Figure 1. In ‘Drag’ mode, clicking ated seven datasets targeting distinct classes of data such as
on the image enables rudimentary scrolling navigation drag- regulatory features as annotated by our methods, as well as
ging the image to the left or right. Finally we released a new binding motifs and miRNA targets. Consequently querying
mobile optimized website, as shown in Figure 2, which can for data restricted by genomic location can be retrieved six
be accessed at http://m.ensembl.org or via optional redirec- times faster than using our previous BioMart.
tion from our main website. The website is optimized for re-
duced display sizes and offers targeted views of genes, vari-
OUTREACH AND TRAINING
ants and phenotypes. Mobile users can opt to return to the
full site when they require more advanced functionality. External user support is provided by means of face-to-
face training courses, online training materials, social media
and email help channels. Annually we deliver roughly 100
REST service
workshops at research institutes and conferences around
The last year has seen substantial growth in both data and the world in person and through live webinars. Online
usage for our programming language agnostic REST API training covering five Ensembl courses is available through
(46). All DNA from the previous five human assemblies the EMBL-EBI Train Online interface (http://www.ebi.ac.
Nucleic Acids Research, 2016, Vol. 44, Database issue D715

uk/training/online/subjects/11), while our YouTube chan- 3. ENCODE,Project Consortium. (2012) An integrated encyclopaedia
nel contains 35 training videos (https://www.youtube.com/ of DNA elements in the human genome. Nature, 489, 57–74.
4. Church,D.M., Schneider,V.A., Graves,T., Auger,K., Cunningham,F.,
user/EnsemblHelpdesk). Training material is also available Bouk,N., Chen,H.-C., Agarwala,R., McLaren,W.M., Ritchie,G.R.S.
via our help pages and workshops can be requested via our et al. (2011) Modernizing reference genome assemblies. PLoS Biol., 9,
helpdesk. e1001091.
Queries about working with Ensembl data, in- 5. Harrow,J., Frankish,A., Gonzalez,J.M., Tapanari,E., Diekhans,M.,
terfaces and APIs can be directed to our helpdesk Kokocinski,F., Aken,B.L., Barrell,D., Zadissa,A., Searle,S. et al.
(2012) GENCODE: the reference human genome annotation for The
([email protected]) or our public developers mailing ENCODE Project. Genome Res., 22, 1760–1774.
list ([email protected]). We are active on social media 6. UniProt Consortium. (2014) Activities at the Universal Protein
channels such as Twitter (https://twitter.com/ensembl), Resource (UniProt). Nucleic Acids Res., 42, D191–D198.
Facebook (https://www.facebook.com/Ensembl.org) and 7. Roadmap Epigenomics Consortium. (2015) Integrative analysis of
111 reference human epigenomes. Nature, 518, 317–330.
our blog (http://www.ensembl.info/). For example, we reg- 8. Zerbino,D.R., Wilder,S.P., Johnson,N., Juettemann,T. and
ularly use #citedEnsembl hashtag on Twitter to highlight Flicek,P.R. (2015) The Ensembl Regulatory Build. Genome Biol., 16,
published research that has used Ensembl resources. 56.

Downloaded from https://academic.oup.com/nar/article/44/D1/D710/2502651 by guest on 25 July 2024


9. Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L.,
Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI database
ACKNOWLEDGEMENTS of genetic variation. Nucleic Acids Res., 29, 308–311.
10. Stenson,P.D., Ball,E.V., Mort,M., Phillips,A.D., Shiel,J.A.,
Thank you to Steve Moss for his previous work on the En- Thomas,N.S.T., Abeysinghe,S., Krawczak,M. and Cooper,D.N.
sembl REST API. We also thank Roy Storey for his eHive (2003) Human Gene Mutation Database (HGMD): 2003 update.
pull request, which triggered the development and deploy- Hum. Mutat., 21, 577–581.
ment of eHive’s test suite. We also thank Mikhail Spivakov 11. Landrum,M.J., Lee,J.M., Riley,G.R., Jang,W., Rubinstein,W.S.,
Church,D.M. and Maglott,D.R. (2014) ClinVar: public archive of
for providing the sample interaction data shown in Figure relationships among sequence variation and human phenotype.
1. Nucleic Acids Res., 42, D980–D985.
12. Kersey,P.J., Allen,J.E., Christensen,M., Davis,P., Falin,L.J.,
Grabmueller,C., Hughes,D.S.T., Humphrey,J., Kerhornou,A.,
FUNDING Khobova,J. et al. (2014) Ensembl Genomes 2013: scaling up access to
Ensembl receives majority funding from the Wellcome genome-wide data. Nucleic Acids Res., 42, D546–D552.
13. McLaren,W., Pritchard,B., Rios,D., Chen,Y., Flicek,P. and
Trust (grant numbers WT095908 and WT098051) with Cunningham,F. (2010) Deriving the consequences of genomic
additional funding for specific project components variants with the Ensembl API and SNP Effect Predictor.
from the National Human Genome Research Institute Bioinformatics, 26, 2069–2070.
(U41HG007234, 1R01HD074078, and U41HG007823), 14. Zhao,H., Sun,Z., Wang,J., Huang,H., Kocher,J.-P. and Wang,L.
the Biotechnology and Biological Sciences Research (2014) CrossMap: a versatile tool for coordinate conversion between
genome assemblies. Bioinformatics, 30, 1006–1007.
Council (BB/I025506/1, BB/I025360/2, BB/K009524/1, 15. Kent,W.J., Zweig,A.S., Barber,G., Hinrichs,A.S. and Karolchik,D.
BB/L024225/1, BB/M018458/1 and BB/M020398/1), (2010) BigWig and BigBed: enabling browsing of large distributed
the Centre for Therapeutic Target Validation (CTTV) datasets. Bioinformatics, 26, 2204–2207.
and the European Molecular Biology Laboratory. The 16. Danecek,P., Auton,A., Abecasis,G., Albers,C.A., Banks,E.,
DePristo,M.A., Handsaker,R., Lunter,G., Marth,G., Sherry,S.T.
research leading to these results has received funding from et al. (2011) The Variant Call Format and VCFtools. Bioinformatics,
the European Union’s Seventh Framework Programme 27, 2156–2158.
[FP7/2007-2013] under grant agreement [HEALTH-F4- 17. Li,H., Handsaker,B., Wysoker,A., Fennell,T., Ruan,J., Homer,N.,
2010-241504] (EURATRANS). The research leading to Marth,G., Abecasis,G. and Durbin,R. (2009) The Sequence
these results has received funding from the European Alignment/Map format and SAMtools. Bioinformatics, 25,
2078–2079.
Union’s Seventh Framework Programme (FP7/2007-2013) 18. Raney,B.J., Dreszer,T.R., Barber,G.P., Clawson,H., Fujita,P.A.,
under grant agreement n◦ 282510 (BLUEPRINT). The Wang,T., Nguyen,N., Paten,B., Zweig,A.S., Karolchik,D. et al. (2014)
research leading to these results has received funding from Track data hubs enable visualization of user-defined genome-wide
the European Union’s Seventh Framework Capacities annotations on the UCSC Genome Browser. Bioinformatics, 30,
1003–1005.
Specific Programme under grant agreement n◦ 284209 19. Wilming,L.G., Gilbert,J.G.R., Howe,K., Trevanion,S., Hubbard,T.
(BioMedBridges). This project has received funding from and Harrow,J.L. (2007) The vertebrate genome annotation (Vega)
the European Union’s Horizon 2020 research and in- database. Nucleic Acids Res., 36, D753–D760.
novation programme under grant agreement n◦ 634143 20. Rodriguez,J.M., Carro,A., Valencia,A. and Tress,M.L. (2015)
(MedBioinformatics). Funding for open access charge: APPRIS WebServer and WebServices. Nucleic Acids Res., 43,
W455–W459.
Wellcome Trust. 21. Frankish,A., Uszczynska,B., Ritchie,G.R., Gonzalez,J.M.,
Conflict of interest statement. None declared. Pervouchine,D., Petryszak,R., Mudge,J.M., Fonseca,N., Brazma,A.,
Guigo,R. et al. (2015) Comparison of GENCODE and RefSeq gene
annotation and the impact of reference geneset on variant effect
REFERENCES prediction. BMC Genomics, 16, S2.
1. Rosenbloom,K.R., Armstrong,J., Barber,G.P., Casper,J., Clawson,H., 22. Pruitt,K.D., Harrow,J., Harte,R.A., Wallin,C., Diekhans,M.,
Diekhans,M., Dreszer,T.R., Fujita,P.A., Guruvadoo,L., Maglott,D.R., Searle,S., Farrell,C.M., Loveland,J.E., Ruef,B.J. et al.
Haeussler,M. et al. (2015) The UCSC Genome Browser database: (2009) The consensus coding sequence (CCDS) project: Identifying a
2015 update. Nucleic Acids Res., 43, D670–D681. common protein-coding gene set for the human and mouse genomes.
2. Pruitt,K.D., Brown,G.R., Hiatt,S.M., Thibaud-Nissen,F., Genome Res., 19, 1316–1323.
Astashyn,A., Ermolaeva,O., Farrell,C.M., Hart,J., Landrum,M.J., 23. Amberger,J.S., Bocchini,C.A., Schiettecatte,F., Scott,A.F. and
McGarvey,K.M. et al. (2014) RefSeq: an update on mammalian Hamosh,A. (2015) OMIM.org: Online Mendelian Inheritance in
reference sequences. Nucleic Acids Res., 42, D756–D763.
D716 Nucleic Acids Research, 2016, Vol. 44, Database issue

Man (OMIM R
), an online catalog of human genes and genetic 37. Forbes,S.A., Beare,D., Gunasekaran,P., Leung,K., Bindal,N.,
disorders. Nucleic Acids Res., 43, D789–D798. Boutselakis,H., Ding,M., Bamford,S., Cole,C., Ward,S. et al. (2015)
24. Welter,D., MacArthur,J., Morales,J., Burdett,T., Hall,P., Junkins,H., COSMIC: exploring the world’s knowledge of somatic mutations in
Klemm,A., Flicek,P., Manolio,T., Hindorff,L. et al. (2014) The human cancer. Nucleic Acids Res., 43, D805–D811.
NHGRI GWAS Catalog, a curated resource of SNP-trait 38. Exome Aggregation Consortium. (2015) Cambridge,
associations. Nucleic Acids Res., 42, D1001–D1006. http://exac.broadinstitute.org.
25. Shimoyama,M., Pons,J., Hayman,G.T., Laulederkind,S.J.F., Liu,W., 39. Tennessen,J.A., Bigham,A.W., O’Connor,T.D., Fu,W., Kenny,E.E.,
Nigam,R., Petri,V., Smith,J.R., Tutaj,M., Wang,S.-J. et al. (2015) The Gravel,S., McGee,S., Do,R., Liu,X., Jun,G. et al. (2012) Evolution
Rat Genome Database 2015: genomic, phenotypic and environmental and functional impact of rare coding variation from deep sequencing
variations and disease. Nucleic Acids Res., 43, D743–D750. of human exomes. Science, 337, 64–69.
26. Lenffer,J., Nicholas,F.W., Castle,K., Rao,A., Gregory,S., 40. Liu,X., Jian,X. and Boerwinkle,E. (2013) dbNSFP v2.0: a Database
Poidinger,M., Mailman,M.D. and Ranganathan,S. (2006) OMIA of Human Non-synonymous SNVs and Their Functional Predictions
(Online Mendelian Inheritance in Animals): an enhanced platform and Annotations. Hum. Mutat., 34, E2393–E2402.
and integration into the Entrez search interface at NCBI. Nucleic 41. Petryszak,R., Burdett,T., Fiorelli,B., Fonseca,N.A.,
Acids Res., 34, D599–D601. Gonzalez-Porta,M., Hastings,E., Huber,W., Jupp,S., Keays,M.,
27. Hu,Z.-L., Park,C.A., Wu,X.-L. and Reecy,J.M. (2013) Animal Kryvych,N. et al. (2014) Expression Atlas update––a database of gene
QTLdb: an improved database tool for livestock animal and transcript expression from microarray- and sequencing-based

Downloaded from https://academic.oup.com/nar/article/44/D1/D710/2502651 by guest on 25 July 2024


QTL/association data dissemination in the post-genome era. Nucleic functional genomics experiments. Nucleic Acids Res., 42, D926–D932.
Acids Res., 41, D871–D879. 42. Illumina, Inc. (2015) Illumina Platinum Genomes.
28. Sprague,J., Bayraktaroglu,L., Clements,D., Conlin,T., Fashena,D., http://www.illumina.com/platinumgenomes/.
Frazer,K., Haendel,M., Howe,D.G., Mani,P., Ramachandran,S. et al. 43. Mifsud,B., Tavares-Cadete,F., Young,A.N., Sugar,R.,
(2006) The Zebrafish Information Network: the zebrafish model Schoenfelder,S., Ferreira,L., Wingett,S.W., Andrews,S., Grey,W.,
organism database. Nucleic Acids Res., 34, D581–D585. Ewels,P.A. et al. (2015) Mapping long-range promoter contacts in
29. Ng,P.C. and Henikoff,S. (2003) SIFT: predicting amino acid changes human cells with high-resolution capture Hi-C. Nat. Genet., 47,
that affect protein function. Nucleic Acids Res., 31, 3812–3814. 598–606.
30. Ivan Adzhubei,D.M.J. (2013) Predicting Functional Effect of Human 44. Lizio,M., Harshbarger,J., Shimoji,H., Severin,J., Kasukawa,T.,
Missense Mutations Using PolyPhen-2. Curr. Protoc. Hum. Genet., Sahin,S., Abugessaisa,I., Fukuda,S., Hori,F., Ishikawa-Kato,S. et al.
doi:10.1002/0471142905.hg0720s76. (2015) Gateways to the FANTOM5 promoter level mammalian
31. Zhou,X., Lowdon,R.F., Li,D., Lawson,H.A., Madden,P.A.F., expression atlas. Genome Biol., 16, 22.
Costello,J.F. and Wang,T. (2013) Exploring long-range genome 45. Lonsdale,J., Thomas,J., Salvatore,M., Phillips,R., Lo,E., Shad,S.,
interactions using the WashU Epigenome Browser. Nat. Methods, 10, Hasz,R., Walters,G., Garcia,F., Young,N. et al. (2013) The
375–376. Genotype-Tissue Expression (GTEx) project. Nat. Genet., 45,
32. Paten,B., Herrero,J., Beal,K., Fitzgerald,S. and Birney,E. (2008) 580–585.
Enredo and Pecan: genome-wide mammalian consistency-based 46. Yates,A., Beal,K., Keenan,S., McLaren,W., Pignatelli,M.,
multiple alignment with paralogs. Genome Res., 18, 1814–1828. Ritchie,G.R.S., Ruffier,M., Taylor,K., Vullo,A. and Flicek,P. (2015)
33. Ruan,J., Li,H., Chen,Z., Coghlan,A., Coin,L.J.M., Guo,Y., The Ensembl REST API: Ensembl Data for Any Language.
Hériché,J.-K., Hu,Y., Kristiansen,K., Li,R. et al. (2008) TreeFam: Bioinformatics, 31, 143–145.
2008 Update. Nucleic Acids Res., 36, D735–D740. 47. Severin,J., Beal,K., Vilella,A.J., Fitzgerald,S., Schuster,M.,
34. Mi,H., Muruganujan,A. and Thomas,P.D. (2013) PANTHER in Gordon,L., Ureta-Vidal,A., Flicek,P. and Herrero,J. (2010) eHive: an
2013: modeling the evolution of gene function, and other gene artificial intelligence workflow system for genomic analysis. BMC
attributes, in the context of phylogenetic trees. Nucleic Acids Res., 41, Bioinformatics, 11, 240.
D377–D386. 48. Kinsella,R.J., Kähäri,A., Haider,S., Zamora,J., Proctor,G.,
35. Sonnhammer,E.L.L., Gabaldón,T., Sousa da Silva,A.W., Martin,M., Spudich,G., Almeida-King,J., Staines,D., Derwent,P., Kerhornou,A.
Robinson-Rechavi,M., Boeckmann,B., Thomas,P.D., Dessimoz,C. et al. (2011) Ensembl BioMarts: a hub for data retrieval across
and Quest for Orthologs consortium. (2014) Big data and other taxonomic space. Database (Oxford), 2011, bar030.
challenges in the quest for orthologs. Bioinformatics, 30, 2993–2998. 49. Durinck,S., Spellman,P.T., Birney,E., Bolstad,B., Dettling,M.,
36. The 1000 Genomes Project Consortium. (2012) An integrated map of Dudoit,S. and Huber,W. (2009) Mapping identifiers for the
genetic variation from 1, 092 human genomes. Nature, 491, 56–65. integration of genomic datasets with the R/Bioconductor package
biomaRt. Nat. Protoc., 4, 1184–1191.

You might also like