Nash 2019
Nash 2019
Nash 2019
doi: 10.1093/bioinformatics/xxxxx
Advance Access Publication Date: DD Month YYYY
Manuscript Category
Abstract
Motivation: Clusters of extremely conserved non-coding elements (CNEs) mark genomic regions
devoted to cis-regulation of key developmental genes in Metazoa. We have recently shown that their
span coincides with that of topologically associating domains (TADs), making them useful for estimat-
ing conserved TAD boundaries in the absence of Hi-C data. The standard approach - detecting CNEs
in genome alignments and then establishing the boundaries of their clusters - requires tuning of sev-
eral parameters and breaks down when comparing closely related genomes.
Results: We present a novel, kurtosis-based measure of pairwise non-coding conservation that re-
quires no pre-set thresholds for conservation level and length of CNEs. We show that it performs ro-
bustly across a large span of evolutionary distances, including across the closely related genomes of
primates for which standard approaches fail. The method is straightforward to implement and enables
detection and comparison of clusters of CNEs and estimation of underlying TADs across a vastly
increased range of Metazoan genomes.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
3.3 Kurtosis-based GRB Identification in Moderately to Dis- forms discrete peaks that are easily distinguished from the genomic
tantly Related Species background (Engström et al. 2007; Kikuta et al. 2007; Akalin et al.
In the past, GRB identification has succeeded for moderate to distant 2009; Harmston et al. 2017). To test how well the kurtosis-based meas-
evolutionary comparisons because the CNE density across the genome ure of conservation can discriminate highly conserved regions of the
genome from non-conserved regions, we used binned kurtosis values to clearly do not coincide with TADs, it is possible that at greater evolu-
identify GRBs from human to moderately and distantly related species tionary distances, the kurtosis-based conservation measure is only identi-
which have previously been used for CNE-based GRB prediction fying the core, highly conserved regions of each GRB, and thus underes-
(Harmston et al. 2017). The number and size of GRBs identified for timating their true extent – possibly because of some turnover of the
each comparison are presented in Table 1. The number of GRBs identi- boundary positions themselves. Based on the concordance between the
al. 2017). Since kurtosis-based GRB prediction is also sequence-based, CNE-based GRB prediction yielded 744 human to rhesus GRBs with a
it is not immune to this problem. For the human-gorilla GRBs a similar mean width of 482.9 kb and 2220 human to gorilla GRBs with a mean
issue is visible. The largest third of GRBs display no visible funnel in the width of 504,4 kb. The number of GRBs identified in human-rhesus is
DI heatmaps, however there is a noisy funnel visible in the rest of the greater than for the other species comparisons used so far, but not ex-
GRBs. Overall these results suggest that kurtosis-based conservation can ceedingly so. For the human-gorilla comparison, however, there were a
identify signatures of non-coding conservation in very closely related very large number of predicted GRBs. In Figure 3C, the average Hi-C DI
species, but that GRB boundary prediction becomes less precise in the is plotted across the predicted GRBs from both sets. We can clearly see
most closely related comparisons. that, for the human-rhesus comparison, the kurtosis-based GRBs have a
Next, we compared our kurtosis-based GRBs to GRBs identified us- stronger peak of the positive and negative DI (at their starts and ends
ing the CNE-based approach described in Harmston et al. 2017. The respectively) than the CNE-based GRBs. There is also a much sharper
boundary effect in the kurtosis-based GRBs, with the DI signal spreading Acknowledgements
well beyond the boundaries of the CNE-based GRBs. In the human- We thank Dr Ge Tan for generating a number of the CNE datasets used in this
gorilla comparison the kurtosis-based GRBs boundaries also coincide analysis, and Dr Nathan Harmston for processing the Hi-C data. We are also grate-
with peaks in the positive and negative DI, while the CNE-based GRBs
ful to Dr Leonie Roos, Dr Anja Baresic, Dr Sasha Murrell and Dr Ben Murrell for
show no enrichment of DI score at either boundary.
4 Discussion References
Akalin,A. et al. (2009) Transcriptional features of genomic regulatory blocks.
In this paper we have defined a novel measure of pairwise sequence
Genome Biol., 10, R38.
conservation based on the kurtosis of the distribution of the lengths of Balanda,K.P. and Macgillivray,H.L. (1988) Kurtosis: A Critical Review. The
sequences perfectly conserved between two genomes. We have shown American Statistician. 42, 111–119.
that the kurtosis-based measure is highly correlated with CNE density Bejerano,G. et al. (2004) Ultraconserved Elements in the Human Genome. Science,
and can be used to generate high quality GRB predictions for moderate 304, 1321–1325.
Bhatia,S. et al. (2014) A survey of ancient conserved non-coding elements in the
to distant species comparisons. Importantly, our method enables accurate PAX6 locus reveals a landscape of interdigitated cis-regulatory archipelagos.
prediction of GRB-scale regulatory domains, but does not identify the Developmental Biology, 387, 214–228.
individual conserved elements themselves. This presents the potential for DeCarlo,L.T. (1997) On the Meaning and Use of Kurtosis. Psychological Methods,
complementary use of kurtosis-based GRB identification and traditional 2, 292–307.
Denoeud,F. et al. (2010) Plasticity of Animal Genome Architecture Unmasked by
CNE identification in future analyses.
Rapid Evolution of a Pelagic Tunicate. Science, 330, 1381-1385.
We have also shown that kurtosis-based GRB prediction far outper- Dixon,J.R. et al. (2012) Topological domains in mammalian genomes identified by
forms CNE-based GRB prediction in closely related species. The identi- analysis of chromatin interactions. Nature, 485, 376–380.
fication of GRBs between human and gorilla is a surprising result as Engström,P.G. et al. (2007) Genomic regulatory blocks underlie extensive micro-
previously it has been impossible to define conserved regulatory do- synteny conservation in insects. Genome Res., 17, 1898–908.
Harmston,N. et al. (2017) Topologically associating domains are ancient fea- tures
mains between such closely related species. Humans and gorillas share that coincide with Metazoan clusters of extreme noncoding conservation. Na-
over 98% of their genome sequence, and so to be able to use sequence ture Communications 8, 441
conservation to define regulatory regions that coincide with TADs is Kikuta,H. et al. (2007) Genomic regulatory blocks encompass multiple neigh-
strong testament to our method’s ability to account for the general back- boring genes and maintain conserved synteny in vertebrates. Genome Res., 17,
545–555.
ground conservation between two genomes.
Kimura-Yoshida,C. et al. (2004) Characterization of the pufferfish Otx2 cis-
Most importantly, unlike CNE-based conservation analysis, our regulators reveals evolutionarily conserved genetic mechanisms for vertebrate
method works without requiring any predefined minimum length or head specification. Development, 131, 57–71.
sequence identity thresholds for a sequence to be considered conserved. Langmead,B. et al. (2009) Ultrafast and memory-efficient alignment of short DNA
Having a threshold-free approach for measuring conservation allows us sequences to the human genome. Genome Biol., 10, R25.
Navratilova,P. et al. (2009) Systematic human / zebrafish comparative identifi
to directly compare the results of species comparisons spanning a range cation of cis-regulatory activity around vertebrate developmental transcription
of evolutionary distances. This feature, combined with the success we factor genes. Developmental Biology, 327, 526–540.
have had in identifying GRB-like structures in extremely closely related Pennacchio,L. et al. (2006) In vivo enhancer analysis of human conserved non-
species, opens up the possibility of systematically investigating the evo- coding sequences. Nature, 444, 499–502.
Rao,S.S.P. et al. (2014) A 3D map of the human genome at kilobase resolution
lutionary dynamics of GRBs in multiple closely related metazoan line-
reveals principles of chromatin looping. Cell, 159, 1665–1680.
ages, potentially yielding a greater understanding of the origin and evo- Ritter,D.I. et al. (2010). The Importance of Being Cis: Evolution of Orthologous
lution of long-range gene regulation in metazoan genomes. Fish and Mammalian Enhancer Activity Research article. Molecular Biology
Further, our method may have utility in the analysis of GRB devel- and Evolution, 27, 2322–2332.
opmental gene regulation in species that have undergone extreme ge- Ross,G.J. (2015) Parametric and nonparametric sequential change detection in R:
The cpm package. Journal of Statistical Software, 66.
nome compaction such as the puffer fish, Tetraodon nigroviridis, and the Ruppert,D. (1987). What is kurtosis? An influence function approach. American
sea squirt, Oikopleura dioica (Denoeud et al. 2010). The tiny size of Statistician, 41, 1–5.
these genome makes it very difficult to define the minimum length a Sandelin,A. et al. (2004) Arrays of ultraconserved non-coding regions span the loci
stretch of conserved sequence should be to be considered a conserved of key developmental genes in vertebrate genomes. BMC Genomics, 5, 99.
Scally,A. et al. (2012) Insights into hominid evolution from the gorilla genome
element, and as described above, comparing the results of this analysis
sequence. Nature, 483, 169–175.
with those performed in larger genomes is problematic. Our method may Spieler,D. et al. (2014). “Restless Legs Syndrome-Associated intronic common
provide the ability to accurately define GRB boundaries in compact variant in Meis1 alters enhancer function in the developing telencephalon”. In:
genomes and therefore deliver insights into the effects of genome com- Genome Research 24.4, pp. 592–603.
paction of long-range gene regulation. Tan,G. (2017) CNEr: CNE Detection and Visualization. R package version 1.16.0.
Woolfe,A. et al. (2005) Highly conserved non-coding sequences are associated
Data with vertebrate development. PLoS Biology, 3, e7.
Zabidi,M. et al. (2014) Enhancer––core-promoter specificity separates develop-
The data generated for this study, and the scripts used to generate the data, can be mental and housekeeping gene regulation. Nature, 518, 556-559.
found at https://github.com/alexander-nash/kurtosis_conservation