Nash 2019

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Bioinformatics, YYYY, 0–0

doi: 10.1093/bioinformatics/xxxxx
Advance Access Publication Date: DD Month YYYY
Manuscript Category

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty1014/5233000 by guest on 19 December 2018


Comparative Genomics
A Novel Measure of Non-coding Genome Con-
servation Identifies Genomic Regulatory
Blocks Within Primates
Alexander J. Nash1, 2 and Boris Lenhard1, 2, 3*
1
Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane
Road, London W12 0NN, UK, 2Institute of Clinical Sciences, Faculty of Medicine, Imperial College
London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK. 3Sars International Centre
for Marine Molecular Biology, University of Bergen, N-5008 Bergen, Norway

*To whom correspondence should be addressed.

Associate Editor: XXXXXXX


Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract
Motivation: Clusters of extremely conserved non-coding elements (CNEs) mark genomic regions
devoted to cis-regulation of key developmental genes in Metazoa. We have recently shown that their
span coincides with that of topologically associating domains (TADs), making them useful for estimat-
ing conserved TAD boundaries in the absence of Hi-C data. The standard approach - detecting CNEs
in genome alignments and then establishing the boundaries of their clusters - requires tuning of sev-
eral parameters and breaks down when comparing closely related genomes.
Results: We present a novel, kurtosis-based measure of pairwise non-coding conservation that re-
quires no pre-set thresholds for conservation level and length of CNEs. We show that it performs ro-
bustly across a large span of evolutionary distances, including across the closely related genomes of
primates for which standard approaches fail. The method is straightforward to implement and enables
detection and comparison of clusters of CNEs and estimation of underlying TADs across a vastly
increased range of Metazoan genomes.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.

ly specific gene expression, and together recapitulate the complex tem-


poral and spatial expression patterns of developmental genes (Navratilo-
1 Introduction
va et al. 2009; Bhatia et al. 2014; Pennacchio et al. 2006; Kimura-
Regulation of developmental genes requires intricate control of the Yoshida et al. 2004; Spieler et al. 2014).
timing, location and magnitude of their gene expression. This fine level The requirement for an enhancer to remain in physical proximity of
of control is primarily provided by multiple enhancers that act together the gene it regulates has constrained the evolution of vertebrate genomes,
to establish highly specific spatiotemporal expression patterns. In meta- resulting in long syntenic arrays of CNEs clustered around their target
zoan genomes, many of these genes are maintained within syntenic ar- genes. Each of these arrays is a functional, long-range regulatory unit
rays of evolutionarily conserved enhancers; these enhancers are known known as a genomic regulatory block (GRB) (Kikuta et al. 2007; Eng-
as conserved non-coding elements (CNEs) (Sandelin et al. 2004; Beje- ström et al. 2007; Ritter et al. 2010). Most GRBs regulate a single
rano et al. 2004; Woolfe et al. 2005). CNEs are extremely highly con- target gene, but often span additional bystander genes. Bystander genes
served, containing stretches of tens to hundreds of base pairs of nearly are unresponsive to this form of long-range regulation, primarily due to
perfectly conserved sequence between humans and teleost fish, surviving differences in promoter structure and associated epigenetic modifications
approximately 450 million years of evolutionary separation (Woolfe et (Engström et al. 2007; Akalin et al. 2009; Zabidi et al. 2014). For a
al. 2005). The vast majority of tested CNEs have been shown to act as GRB to regulate its target gene, there must be frequent physical interac-
transcriptional enhancers which are individually capable of driving high- tion between CNEs and the promoter of the gene they regulate. Indeed,

© The Author(s) 2018. Published by Oxford University Press.


This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any
medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
topologically associated domains (TADs), regions of the genome that 2 Methods
preferentially interact with themselves (Dixon et al. 2012; Rao et al.
2014) have been shown to coincide with GRBs, suggesting they are two 2.1 CNE Identification
manifestations of the same underlying phenomenon (Harmston et al.
CNEs were identified using the R Bioconductor package, CNEr
2017). The physical location of TAD and GRB boundaries are strongly

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty1014/5233000 by guest on 19 December 2018


(Tan 2017, https://github.com/ge11232002/CNEr.). The standard pipe-
correlated in both vertebrates and invertebrates, suggesting that the insu-
line, described in the CNEr vignette, was followed. CNE density across
lation provided by TAD boundaries might serve to prevent the ectopic
the genome was calculated by running a 300 kb sliding window across
interaction of CNEs within a TAD with developmental genes in neigh-
the genome, with 30 kb steps, and calculating the number of CNEs per
boring TADs (Harmston et al. 2017). Further, it was shown that TADs
kb in each window.
that are associated with GRBs have stronger self-interaction than those
The minimum length and identity thresholds for CNE identification
that are not. This is consistent with the observation that developmental
must be adjusted for each species comparison due to the continuous
genes require complex gene regulation mediated by frequent enhancer -
divergence of CNEs since the last common ancestor of the two species
promoter contacts, while other classes of genes, such as housekeeping
being compared. The identification thresholds used for each species
genes, do not (Harmston et al. 2017; Zabidi et al. 2014).
comparison are listed in Supplementary Table S1.
Studying the dynamics of GRB evolution and the functional rela-
tionship between GRBs and TADs relies on robust methods to identify
2.2 CNE-based GRB Identification
GRBs across a wide range of evolutionary timescales. Currently, CNE
identification, and therefore GRB identification, hinges tightly on the CNE-dense regions of the genome were identified using an unsuper-
selection of a threshold beyond which a conserved region is defined as a vised two-state hidden Markov model (HMM) that partitions the genome
CNE. While GRB boundaries in distantly related species are robust to into high and low CNE density regions (as described in Harmston et al.
the CNE identification threshold used (Harmston et al. 2017), in closely 2017). In brief, the genome was segmented into high- and low-density
related species this approach breaks down, as the neighboring, neutrally regions, and those CNEs within the high-density regions, which were
evolving sequence has not diverged enough to be able to non-arbitrarily separated by less than a predefined genomic distance, were merged to
define CNE identification thresholds. Due to the resulting increasing form blocks. Human - rhesus monkey and human - gorilla GRBs were
average length of conserved sequences, it is often necessary to choose generated for this paper, while previously published human - opossum
very long thresholds for minimal CNE length (> 400 bp), thereby casting GRBs were retrieved from Harmston et al. 2017.
doubt on the biological relevance of comparing the distribution of such
elements with those identified in more distant comparisons.
Since GRBs reveal the span of regions populated by enhancers tar- 2.3 Genome-wide Kurtosis Calculation
geting specific developmental genes, and can serve as sequence-based For each species comparison, the kurtosis of the distribution of the
proxies for TADs, their comparative studies are of paramount im- lengths of all identical sequences was calculated in bins across the ge-
portance for understanding long-range gene regulation and the 3D chro- nome. Initially, all runs of 100% sequence identity were extracted from
matin structure that supports it. At the genome level, the critical GRB the pairwise whole-genome alignment and filtered for annotated repeats
features to determine are their boundaries. In this paper we address this and exonic sequences. The genome was then divided into 30kb bins and
problem by defining and exploring a threshold-free measure of pairwise the lengths of all runs of identity within each bin were calculated. 30kb
sequence conservation based on the kurtosis of the distribution of the was selected as a window size as this is the window size we traditionally
lengths of all sequences perfectly conserved between two genomes. use for CNE density calculation, thereby maximizing the comparability
Kurtosis measures the movement of probability mass from the shoulders of the two approaches. The kurtosis of the distribution of lengths in each
of a distribution into its centre and tails, and thus a distribution with high bin was then calculated as follows:
kurtosis can be considered “fat tailed” (Balanda and Macgillivray 1988;
DeCarlo 1997). We use kurtosis to measure the effect of the number of 𝑞!.!! 𝐹 − 𝑞!.!" (𝐹)
𝑅 𝐹 =
extreme observations on the distribution of the lengths of runs of perfect 𝐺!"
sequence identity between two genomes. We show that this measure is
where F is the distribution of the lengths of runs of perfect sequence
highly correlated with CNE density and can be effectively used to predict
identity in a bin, and 𝐺!" is the range of the middle 50% of the distribu-
high quality GRBs for the species comparisons used in Harmston et al.
tion of lengths of all runs of identity, from all bins (background distribu-
2017. Further, we use this kurtosis-based measure to predict GRBs from
tion); calculated as follows:
genome alignments between human and non-human primates, and show
that it is superior to CNE density at these short evolutionary distances. 𝐺!" = 𝑞!.!" 𝐽 − 𝑞!.!" (𝐽)
The ability of our method to detect GRBs across close evolutionary
distances, without the requirement for arbitrary conservation thresholds, where J is the distribution of the lengths of runs of perfect sequence
will enable the study of GRB evolution and the detection of recent line- identity across the whole genome. For each bin, R(F) is a ratio of the
age-specific changes in gross GRB structure. range of 99% of all lengths of runs of identical sequence, in a bin, to the
range of 50% of all lengths of runs of identity for the whole genome. In
practice it measures the number, and extremity, of long runs of perfect
identity, in each bin, compared to the background conservation for the
whole genome. This is an adaptation of the robust kurtosis measure
proposed in (Ruppert 1987).

2.4 Correlation of Kurtosis and CNE density


Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty1014/5233000 by guest on 19 December 2018
Fig. 1. Kurtosis and CNE density are highly correlated. (A) The distribution of the kurtosis values calculated for human to each other species. The kurtosis values were calculated
for the distribution of the lengths of perfectly conserved sequences in bins across the genome for each species comparison separately. (B) The correlation between CNE density and
kurtosis inside and outside of CNE-based GRBs. (C) The correlation of CNE density (calculated for multiple conservation thresholds) and kurtosis inside and outside of GRBs as a
function of evolutionary distance. Here, each point for a species represents the correlation of kurtosis-based conservation with CNE density at a different CNE identification threshold.
CNE density and kurtosis are highly correlated within GRBs, regardless of the evolutionary distance of the comparison. Outside of GRBs there is a decreasing linear relationship be-
tween the correlation and the evolutionary distance of the comparison.

Maximum kurtosis and CNE density was calculated in 90 kb win-


dows across the genome, with 1000 windows randomly sampled from 𝐻! : 𝑋! ~ 𝐹! 𝑥! ; 𝜃! , 𝑖 = 1,2, … , 𝑛,
previously defined human - opossum GRBs and 1000 from non-GRB
regions. This was performed for human to dog, chicken and spotted gar 𝐹! 𝑥; 𝜃! , 𝑖 = 1,2, … , 𝑘,
𝐻! : 𝑋! ~
comparisons at each CNE identification threshold listed in Table S1. The 𝐹! 𝑥; 𝜃! , 𝑖 = 𝑘 + 1, 𝑘 + 2, … , 𝑛,
Spearman’s correlation coefficient between maximum scores in each
where 𝜃! represent the unknown parameters of each distribution. In this
window was then calculated. For the purpose of visualisation, a linear
scenario the two distributions 𝐹! and 𝐹! represent the distribution of
model was fitted to the data for each comparison at each CNE identifica-
values coming from non-GRB and GRB regions of the genome respec-
tion threshold.
tively. The presence of a change point can be tested using a two-sampled
Mann-Whitney test and the null hypothesis rejected if the test statistic
2.5 Kurtosis-based GRB Identification exceeds a predefined cut-off. For a series of observations 𝑥! , . . . , 𝑥! the
Kurtosis-based GRBs were generated by using the change point test statistic is calculated at every 𝑥! , for 1 < k < t, and the maximum test
modelling (CPM) approach to identify change points in the binned kurto- statistic obtained for all values of k is used. As successive observations
sis data, indicating a shift to higher mean kurtosis values (Ross 2015). are made (successive windows along the genome), the test statistic is
Under this framework, kurtosis values in bins across the genome are calculated again at every 𝑥! but now for 1 < k < t +1. If no significant
treated as a series of n independent observations 𝑥! , . . . , 𝑥! . The assump- change point is detected, the next observation, 𝑥!!! is received and the
tion that all observations (genomic windows) are identically distributed testing is performed again on 𝑥! , . . . , 𝑥!!! . However if a change is detect-
according to an undefined distribution 𝐹! , can then be tested by choosing ed at 𝑥! the process begins again with 𝑥!!! as the first observation in the
between the following hypotheses:
new series of observations to be tested. For further details refer to Ross species comparisons (gorilla to opossum) the distributions of kurtosis
2015. This analysis was performed using the cpm package in R, and the values are centred on 4.5. As the evolutionary distance of the comparison
ARL0 parameter was set to 370. This is the least stringent ARL0 value increases, so the number of bins containing a value of zero increases, and
implemented in the package and ensures that all potential change points the median kurtosis value drops. The increasing number of zero bins is
are detected, at the risk of including more false positives. Greater sensi- due to an increasing number of bins that do not contain any alignable

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty1014/5233000 by guest on 19 December 2018


tivity combined with a merging step (described below) was preferred to sequence. This trend shows that, as expected, with increasing evolution-
stringent change point detection that potentially misses GRB boundaries. ary distance there will be larger portions of the genome that are una-
Once significant change points in the binned kurtosis values have lignable due to continual sequence divergence. Since the kurtosis calcu-
been identified, these are treated as potential GRB boundaries. The mean lation does not depend on the total amount of aligned sequence per bin,
kurtosis within each range is then calculated, and adjacent ranges are the kurtosis-based measure is also robust to variation in proportion of
merged if the mean kurtosis in both is above a specified quantile of all other unalignable sequences along the genome, such as repetitive ele-
binned kurtosis values. The quantile used was determined empirically ments. It is also striking that the range of the kurtosis values increases
based on the predicted GRBs ability to recapitulate known GRB bounda- with evolutionary distance. This trend reflects the potential for more
ries. For all species comparisons, the quantile used was 0.7. extreme outliers relative to the genomic background in more distant
comparisons. These extreme outliers are the CNEs that we normally
identify using traditional CNE identification approaches. Another nota-
2.6 Hi-C Data Processing ble feature of the distributions is the increased dispersion with increasing
evolutionary distance. This is most likely due to the increased variability
hESC and IMR90 Hi-C data were obtained from the Gene Expres-
in the number and length of runs of sequence identity from bin to bin in
sion Omnibus (GEO; GSE35156) and processed as described in Harm-
ston et al. 2017. To visualise how well kurtosis-based GRBs recapitulate the more distant comparisons.
TAD boundaries, we produced heatmaps of Hi-C DI within genomic To compare kurtosis and CNE density across the genome, we sam-
pled 1000 random 90 kb windows from previously defined CNE-based
windows centred on GRBs and ordered from largest to smallest GRBs.
human - opossum GRBs, and non-GRB regions of the genome. We then
calculated the maximum kurtosis and CNE density in each window. We
3 Results repeated this using CNE density calculated at multiple thresholds for
each species comparison. Next, we calculated the Spearman’s correlation
3.1 CNE identification coefficient between kurtosis and CNE density for each species compari-
son inside and outside of GRBs. There is a strong correlation between
To test the improvement in performance of the kurtosis-based ap-
kurtosis and CNE density, and this correlation is greater within GRBs
proach, we first produced CNE sets for standard GRB span detection and
than outside GRBs (Figure 1B). This trend is confirmed in Figure 1C,
its comparison with our new method. The first step in this analysis was
which shows the Spearman’s correlation coefficient between kurtosis and
to identify CNEs between human and a range of species chosen to repre-
CNE density, calculated for all CNE identification thresholds used for
sent distinct vertebrate lineages. For each species comparison we used
each species comparison. This data is also presented in detail in Table
multiple CNE identification thresholds to facilitate a comprehensive
S3. It is striking that regardless of the evolutionary distance of the com-
comparison between the proposed kurtosis-based conservation measure
parison, kurtosis and CNE density values are similarly correlated within
and CNE density calculated for a range of conservation thresholds. The
GRBs, whereas outside of GRBs it appears that the correlation drops
results of CNE identification are presented in Supplementary Table S2.
with increasing evolutionary distance. The reduced correlation outside of
As expected, the number of CNEs identified for each species comparison
GRBs may be caused by multiple properties of kurtosis and CNE densi-
decreases as the stringency of the threshold increases. The mean width of
ty:
the elements identified also decreases as the minimum required identity
is increased. In general, the stringency of the threshold used for CNE
(1) Outside of GRBs, CNE density is consistently either zero or
identification is reduced as the evolutionary distance between the species
close to zero, while kurtosis fluctuates around 4.5 from bin to
compared increases. This is to account for the continual sequence diver-
bin, thereby reducing the correlation. Within GRBs, both the
gence in conserved regions during the time that the two genomes have
CNE density and kurtosis will be high in the majority of bins.
been evolving independently. The effect of this sequence divergence is
clearly discernible from the number of CNEs identified in dog, opossum,
chicken and spotted gar at 80% identity over 50bp (30bp in spotted gar). (2) Outside of GRBs, there may be stretches of identical non-coding
The divergence time between human and each of these species ranges sequence that are shorter than the minimum length of the
from 96 - 435 million years, and with the increasing time so the number threshold used for calling CNEs, and are therefore not identi-
of CNEs identified drops from 3,763,684 to just 33,172. fied. These stretches will still result in distributions with rela-
tively high kurtosis. Within GRBs there are many identifiable
3.2 Kurtosis-based conservation is strongly correlated with CNEs and thus both CNE density and kurtosis will be high.
CNE density
Next, for each of the same species comparisons, we calculated the Overall, the consistency of the distribution of kurtosis values for
kurtosis of the distribution of the lengths of perfectly conserved sequenc- species comparisons spanning vastly different evolutionary timescales,
es (i.e. of gapless alignment blocks with no substitutions) in bins across and its high correlation with CNE density in conserved regions of the
the genome. Figure 1A shows the distribution of kurtosis values across genome, suggest that the kurtosis of the lengths of runs of sequence
the genome for each comparison. The distributions are very similar identity can be used as an effective threshold-free proxy for sequence
across species comparisons, illustrating that the results of the method are conservation in genomic windows.
comparable across a wide range of evolutionary distances. In the closer
Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty1014/5233000 by guest on 19 December 2018

3.3 Kurtosis-based GRB Identification in Moderately to Dis- forms discrete peaks that are easily distinguished from the genomic
tantly Related Species background (Engström et al. 2007; Kikuta et al. 2007; Akalin et al.
In the past, GRB identification has succeeded for moderate to distant 2009; Harmston et al. 2017). To test how well the kurtosis-based meas-
evolutionary comparisons because the CNE density across the genome ure of conservation can discriminate highly conserved regions of the
genome from non-conserved regions, we used binned kurtosis values to clearly do not coincide with TADs, it is possible that at greater evolu-
identify GRBs from human to moderately and distantly related species tionary distances, the kurtosis-based conservation measure is only identi-
which have previously been used for CNE-based GRB prediction fying the core, highly conserved regions of each GRB, and thus underes-
(Harmston et al. 2017). The number and size of GRBs identified for timating their true extent – possibly because of some turnover of the
each comparison are presented in Table 1. The number of GRBs identi- boundary positions themselves. Based on the concordance between the

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty1014/5233000 by guest on 19 December 2018


fied in each comparison is similar, but there is a slight decrease in total kurtosis-based GRB predictions and the CNE density for all species
GRBs as the evolutionary distance of the comparison increases. The comparisons, it is likely that CNE-based GRB prediction will suffer from
average width of the identified GRBs also decreases with increasing the same problem.
evolutionary distance. The decreasing average width reflects the erosion Taken together, the concordance between kurtosis-based GRB pre-
of sequence conservation over time, making accurate prediction of GRB dictions and CNE density, the high degree of overlap between kurtosis-
boundaries difficult over large evolutionary distances. This effect is also based and CNE-based GRBs, and the strong correlation between GRB
observed in GRBs identified using CNE density and has been previously and TAD boundaries, suggests that kurtosis-based conservation can be
described (Harmston et al. 2017). The decreasing number of GRBs may used to accurately predict high quality GRBs.
be due to the identification of relatively rapidly evolving GRBs in the
closer comparisons that are not identifiable in the more distant compari- 3.4 Kurtosis-based GRB Identification in Alignments Be-
sons. tween Human and Non-human Primates
Table 1. Kurtosis-based GRB Number and Size CNE identification thresholds necessitate the implementation of an
arbitrary cut-off for what is defined as a CNE and what is not. At the
Query Species Number Mean Width (kb)
edge of the threshold, a single mismatch in two aligned sequences is
sufficient for an otherwise highly conserved region to be declared non-
Dog 559 1,233.1
conserved. In the context of GRB identification, this is seldom a problem
Opossum 487 1,195
for evolutionarily distant species comparisons, but at shorter evolution-
Chicken 426 978.7
ary timescales it becomes increasingly difficult to determine how long a
Spotted Gar 400 804.8
stretch of perfect sequence identity should be for the region to be de-
clared a CNE. This is the context in which kurtosis-based conservation
As an initial assessment of the quality of the identified GRBs, we may see the most utility. By its nature, kurtosis-based conservation takes
visualized CNE density within genomic windows centred on the kurto- into account the background level of conservation for a particular species
sis-based GRBs for each species comparison (Figure 2A). GRBs were comparison, and only defines those regions with unexpectedly long runs
ordered by width, and thus any feature that is enriched within GRBs of identity as highly conserved.
forms a characteristic funnel pattern. From these heatmaps, it is immedi- We predicted GRBs for human to two non-human primates, the rhe-
ately apparent that there is a very strong enrichment of CNE density sus monkey and the gorilla, to test the limits of kurtosis-based conserva-
within kurtosis-based GRBs. This enrichment is robust to the CNE iden- tion for GRB detection. Humans and rhesus monkeys (referred to as
tification threshold used (Figure S1). Interestingly, as the stringency of rhesus from here on) diverged approximately 30 million years ago, while
the CNE identification threshold is increased, there are an increasing humans and gorillas diverged only 8.6 million years ago. Using kurtosis-
number of GRBs that contain no enrichment for CNE density (Figure based conservation, we predicted 523 human - rhesus GRBs (mean width
S1). This likely reflects the ability of the kurtosis-based measure to iden- = 1,279.9 kb) and 483 human - gorilla GRBs (mean width = 1,242.9 kb).
tify runs of non-coding identity that fail to pass the more stringent CNE This is a reassuringly similar number of GRBs to the sets identified by
identification thresholds. comparison to more distant vertebrates, suggesting that even at such
As with CNE-based GRBs, the boundaries of kurtosis-based GRBs short evolutionary timescales the method can predict comparable GRB
are very similar between species comparisons (Figure S2), and there is sets.
significant overlap between the kurtosis-based GRBs and those predicted To assess the quality of these GRBs, we plotted CNE density and
using CNE density (Figure S3). In general, the kurtosis-based method Hi-C DI across genomic windows centred on the kurtosis-based GRB
predicts fewer GRBs than the CNE-based method, with the former ap- predictions, as previously described (Figure 3A-C). For the human-
pearing to be a high-confidence subset of the latter. rhesus GRBs there is a strong enrichment of CNE density within the
To further evaluate the accuracy of the kurtosis-based GRB bounda- predicted GRBs, indicating that for this species comparison kurtosis is a
ries, we took advantage of the fact that GRB boundaries frequently coin- good proxy for conserved non-coding conservation. For the human to
cide with TAD boundaries (Harmston et al. 2017), and hypothesised gorilla comparison there is also a visible CNE density enrichment within
that a better prediction of GRB boundaries should result in an increased the predicted GRBs, but the strength of the enrichment is much reduced.
agreement with TAD boundaries. To test this, we plotted the Hi-C DI The average mismatch rate between human and gorilla is only 1.75%,
from hESC cells within the same GRB-containing genomic windows and therefore it is very surprising that there is any CNE density enrich-
(Figure 2B). In these plots the intensity of red and blue in a region show ment within the kurtosis-based GRBs (Scally et al. 2012). This result is
the frequency with which this region interacts with downstream and strong evidence that kurtosis-based conservation can identify highly
upstream loci respectively. Visualized this way, TADs appear as a span conserved regions of the genome. Examining the DI heatmaps, it is clear
of red followed by a span of blue. For GRBs defined from human to dog, that the rhesus GRBs have a visible funnel, although it is not as strong as
opossum and chicken, there is a very clear funnel present in the DI in the more distant comparisons. The largest rhesus GRBs have the
heatmaps. The funnels have a well-defined red boundary followed by a weakest correspondence with the DI, and appear to span multiple TADs.
well-defined blue boundary, indicating that the GRBs coincide well with These are probably physically close GRBs that have been merged by the
TADs. There is no visible funnel in the human to spotted gar GRBs, with GRB prediction. This may also account for the increased mean GRB
only a hint of a funnel visible in the very largest GRBs, many of which width in this set. Separating adjacent synteny blocks using sequence
are also the most deeply conserved (Figure 2A). While these GRBs conservation alone is a known difficulty in GRB prediction (Harmston et
Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty1014/5233000 by guest on 19 December 2018

al. 2017). Since kurtosis-based GRB prediction is also sequence-based, CNE-based GRB prediction yielded 744 human to rhesus GRBs with a
it is not immune to this problem. For the human-gorilla GRBs a similar mean width of 482.9 kb and 2220 human to gorilla GRBs with a mean
issue is visible. The largest third of GRBs display no visible funnel in the width of 504,4 kb. The number of GRBs identified in human-rhesus is
DI heatmaps, however there is a noisy funnel visible in the rest of the greater than for the other species comparisons used so far, but not ex-
GRBs. Overall these results suggest that kurtosis-based conservation can ceedingly so. For the human-gorilla comparison, however, there were a
identify signatures of non-coding conservation in very closely related very large number of predicted GRBs. In Figure 3C, the average Hi-C DI
species, but that GRB boundary prediction becomes less precise in the is plotted across the predicted GRBs from both sets. We can clearly see
most closely related comparisons. that, for the human-rhesus comparison, the kurtosis-based GRBs have a
Next, we compared our kurtosis-based GRBs to GRBs identified us- stronger peak of the positive and negative DI (at their starts and ends
ing the CNE-based approach described in Harmston et al. 2017. The respectively) than the CNE-based GRBs. There is also a much sharper
boundary effect in the kurtosis-based GRBs, with the DI signal spreading Acknowledgements
well beyond the boundaries of the CNE-based GRBs. In the human- We thank Dr Ge Tan for generating a number of the CNE datasets used in this
gorilla comparison the kurtosis-based GRBs boundaries also coincide analysis, and Dr Nathan Harmston for processing the Hi-C data. We are also grate-
with peaks in the positive and negative DI, while the CNE-based GRBs
ful to Dr Leonie Roos, Dr Anja Baresic, Dr Sasha Murrell and Dr Ben Murrell for
show no enrichment of DI score at either boundary.

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty1014/5233000 by guest on 19 December 2018


comments on the manuscript.
These results conclusively demonstrate that the kurtosis-based con- Funding
servation measure can identify highly conserved regions of the genome,
even in very closely related species, and that kurtosis-based GRB predic- This work was funded by the Medical Research Council [MC UP 1102/1 to B.L.,
tions recapitulate TAD boundaries better than the CNE-based GRB 1584095 to A.N.]
predictions at these evolutionary timescales. Conflict of Interest: none declared.

4 Discussion References
Akalin,A. et al. (2009) Transcriptional features of genomic regulatory blocks.
In this paper we have defined a novel measure of pairwise sequence
Genome Biol., 10, R38.
conservation based on the kurtosis of the distribution of the lengths of Balanda,K.P. and Macgillivray,H.L. (1988) Kurtosis: A Critical Review. The
sequences perfectly conserved between two genomes. We have shown American Statistician. 42, 111–119.
that the kurtosis-based measure is highly correlated with CNE density Bejerano,G. et al. (2004) Ultraconserved Elements in the Human Genome. Science,
and can be used to generate high quality GRB predictions for moderate 304, 1321–1325.
Bhatia,S. et al. (2014) A survey of ancient conserved non-coding elements in the
to distant species comparisons. Importantly, our method enables accurate PAX6 locus reveals a landscape of interdigitated cis-regulatory archipelagos.
prediction of GRB-scale regulatory domains, but does not identify the Developmental Biology, 387, 214–228.
individual conserved elements themselves. This presents the potential for DeCarlo,L.T. (1997) On the Meaning and Use of Kurtosis. Psychological Methods,
complementary use of kurtosis-based GRB identification and traditional 2, 292–307.
Denoeud,F. et al. (2010) Plasticity of Animal Genome Architecture Unmasked by
CNE identification in future analyses.
Rapid Evolution of a Pelagic Tunicate. Science, 330, 1381-1385.
We have also shown that kurtosis-based GRB prediction far outper- Dixon,J.R. et al. (2012) Topological domains in mammalian genomes identified by
forms CNE-based GRB prediction in closely related species. The identi- analysis of chromatin interactions. Nature, 485, 376–380.
fication of GRBs between human and gorilla is a surprising result as Engström,P.G. et al. (2007) Genomic regulatory blocks underlie extensive micro-
previously it has been impossible to define conserved regulatory do- synteny conservation in insects. Genome Res., 17, 1898–908.
Harmston,N. et al. (2017) Topologically associating domains are ancient fea- tures
mains between such closely related species. Humans and gorillas share that coincide with Metazoan clusters of extreme noncoding conservation. Na-
over 98% of their genome sequence, and so to be able to use sequence ture Communications 8, 441
conservation to define regulatory regions that coincide with TADs is Kikuta,H. et al. (2007) Genomic regulatory blocks encompass multiple neigh-
strong testament to our method’s ability to account for the general back- boring genes and maintain conserved synteny in vertebrates. Genome Res., 17,
545–555.
ground conservation between two genomes.
Kimura-Yoshida,C. et al. (2004) Characterization of the pufferfish Otx2 cis-
Most importantly, unlike CNE-based conservation analysis, our regulators reveals evolutionarily conserved genetic mechanisms for vertebrate
method works without requiring any predefined minimum length or head specification. Development, 131, 57–71.
sequence identity thresholds for a sequence to be considered conserved. Langmead,B. et al. (2009) Ultrafast and memory-efficient alignment of short DNA
Having a threshold-free approach for measuring conservation allows us sequences to the human genome. Genome Biol., 10, R25.
Navratilova,P. et al. (2009) Systematic human / zebrafish comparative identifi
to directly compare the results of species comparisons spanning a range cation of cis-regulatory activity around vertebrate developmental transcription
of evolutionary distances. This feature, combined with the success we factor genes. Developmental Biology, 327, 526–540.
have had in identifying GRB-like structures in extremely closely related Pennacchio,L. et al. (2006) In vivo enhancer analysis of human conserved non-
species, opens up the possibility of systematically investigating the evo- coding sequences. Nature, 444, 499–502.
Rao,S.S.P. et al. (2014) A 3D map of the human genome at kilobase resolution
lutionary dynamics of GRBs in multiple closely related metazoan line-
reveals principles of chromatin looping. Cell, 159, 1665–1680.
ages, potentially yielding a greater understanding of the origin and evo- Ritter,D.I. et al. (2010). The Importance of Being Cis: Evolution of Orthologous
lution of long-range gene regulation in metazoan genomes. Fish and Mammalian Enhancer Activity Research article. Molecular Biology
Further, our method may have utility in the analysis of GRB devel- and Evolution, 27, 2322–2332.
opmental gene regulation in species that have undergone extreme ge- Ross,G.J. (2015) Parametric and nonparametric sequential change detection in R:
The cpm package. Journal of Statistical Software, 66.
nome compaction such as the puffer fish, Tetraodon nigroviridis, and the Ruppert,D. (1987). What is kurtosis? An influence function approach. American
sea squirt, Oikopleura dioica (Denoeud et al. 2010). The tiny size of Statistician, 41, 1–5.
these genome makes it very difficult to define the minimum length a Sandelin,A. et al. (2004) Arrays of ultraconserved non-coding regions span the loci
stretch of conserved sequence should be to be considered a conserved of key developmental genes in vertebrate genomes. BMC Genomics, 5, 99.
Scally,A. et al. (2012) Insights into hominid evolution from the gorilla genome
element, and as described above, comparing the results of this analysis
sequence. Nature, 483, 169–175.
with those performed in larger genomes is problematic. Our method may Spieler,D. et al. (2014). “Restless Legs Syndrome-Associated intronic common
provide the ability to accurately define GRB boundaries in compact variant in Meis1 alters enhancer function in the developing telencephalon”. In:
genomes and therefore deliver insights into the effects of genome com- Genome Research 24.4, pp. 592–603.
paction of long-range gene regulation. Tan,G. (2017) CNEr: CNE Detection and Visualization. R package version 1.16.0.
Woolfe,A. et al. (2005) Highly conserved non-coding sequences are associated
Data with vertebrate development. PLoS Biology, 3, e7.
Zabidi,M. et al. (2014) Enhancer––core-promoter specificity separates develop-
The data generated for this study, and the scripts used to generate the data, can be mental and housekeeping gene regulation. Nature, 518, 556-559.
found at https://github.com/alexander-nash/kurtosis_conservation

You might also like