The Open Bioinformatics Journal, 2010, 4, 41-49
41
Open Access
Differences in Promoters of Orthologous Genes
Youlian Pan*,1, Sieu Phan1, Fazel Famili1, Maria Luz Jaramillo2, Anne Engelina Gesina Lenferink2
and Edwin Wang2
1
Institute for Information Technology, NRC, 1200 Montreal Road, Ottawa, Ontario, K1A 0R6, Canada; 2Biotechnology
Research Institute, NRC, 6100 Royalmount Ave., Montreal, Quebec, H4P 2R2, Canada
Abstract: Human genetic experiments are often conducted based on the orthologous genes in other mammals such as
mouse and rat. The resulting conclusions of such experiments are often limited in their applicability to the human situation. This has raised a question as to why the orthologous genes with closely related or even identical coding regions behave differently in various mammals, and motivated us to study the promoter of these genes. We proposed a functional
promoter similarity index (FPSI) based on the number of putative, but statistically significant associations (p 0.05) between transcription factors and their target orthologous genes. We deduced such association through searching known
transcription factor binding sites from promoters of the genes. The FPSI was validated using microarray gene expression
data. We did pair-wise study of seven vertebrate genomes (human, chimpanzee, mouse, rat, dog, chicken, and zebrafish).
The FPSIs of orthologous genes are generally high between human and chimpanzee, with a mean FPSI of 0.79, but gradually decrease when human is compared to the mouse (0.22), rat (0.2), dog (0.2), chicken (0.13) or zebrafish (0.06). We
then performed an analogous analysis for 2128 human cancer-associated genes and the results were similar, but had
significantly improved FPSIs between these human genes and their orthologs in mouse, rat, and dog. The differences in
the promoter regions of orthologous genes appear to be genome wide and negatively correlated with divergence time of
the organisms. Such correlation suggests that the FPSI could be used as a measure of phylogenetic conservation.
Keywords: Orthologs, Promoter Similarity, Transcription Factor, cis-Regulatory Element, Conservation.
INTRODUCTION
Analysis of transcriptional regulation of a gene is one of
the greatest challenges faced by researchers both in biology
and in computational sciences. The availability of genomic
sequences in public databases, such as the University of
California Santa Cruz (UCSC) genome browser [1] and Ensembl genome browser [2], allows for the prediction of cisregulatory elements in the promoter region of a gene which
gives a glimpse of its transcriptional regulation. Tremendous
efforts have been made in this area by many laboratories
around the world and numerous computational tools have
been developed for the identification of cis-elements and
their binding transcription factors (TFs) over the past decades (see reviews [3-5]). In addition, multiple cis-elements
that interact with the same TF have been identified through
biological experiments. Based on the alignment of these ciselements, a consensus motif and a positional weight matrix
(PWM) can be constructed for each TF. These cis-elements
and PWMs are available in public databases, such as
TRANSFAC [6] and JASPAR [7] and can be used to search
for putative TF binding sites (TFBSs) by PWM-based methods, such as Hidden Markov Model and others as reviewed
in [3-5].
Homologous genes are derived from a common ancestral
gene. Two classes of homology can be defined according to
the mode in which these genes have diverged from their last
*Address correspondence to this author at the Institute for Information
Technology, NRC, 1200 Montreal Road, Ottawa, Ontario, K1A 0R6,
Canada; Tel: 1-613-993-0853; Fax: 1-613-952-0215;
E-mail:
[email protected]
1875-0362/10
common ancestor [8, 9]. The first class consists of the
orthologs, which have diverged through speciation, whereas
the second class, the paralogs, resulted from sequence duplication within the same genome. Nevertheless the two classes
cannot be totally separated since paralogs can give rise to
orthologs through subsequent speciation [10, 11]. In addition, a given gene in one genome can have one or more
orthologs in another genome. Further details of the subcategory terminologies of orthologs and paralogs are reviewed in
[12]. Nonetheless, it should be pointed out that in the genomics community ‘the same gene in different species’ is referred as orthologous genes [13].
Earlier studies on mutations mostly focused on the coding region. For example, single nucleotide polymorphisms
(SNPs), which are single nucleotide mutations within the
coding sequence of a gene, or, as was discovered more recently through the HapMap project, in the intergenic regions
[14]. The focus has now gradually shifted to considering
whether cis-regulatory and coding mutations make different
contributions to the phenotypic difference. Several cases
suggest that some phenotypic changes are more likely to
have resulted from cis-regulatory mutations than from mutations in the coding regions of a gene [15, 16].
Cancer is a prevalent clinical problem in modern society,
which is characterized by uncontrollable cell growth, evasion
of death, immortality and the ability to invade and avoid
detection. The American Cancer Society estimated about
1,529,560 new cancer cases are expected to be diagnosed
and over 569,490 deaths in 2010 [17], whereas Canadian
Cancer Statistics estimated 173,800 new cancer cases and
2010 Bentham Open
42 The Open Bioinformatics Journal, 2010, Volume 4
Pan et al.
76,200 deaths in 2010 [18]. It is well established that mutations in specific genes have been associated with the neoplastic transformation and development of specific cancer
types [19, 20]. Coding regions of many human cancerassociated genes (CAGs) are largely identical to their
orthologs in mouse and other mammalian organisms. Cancer
genes are on average more conserved than other genes [21,
22]. Genes that contribute to cancer fusion are also more
conserved [23]. Similarly, the essential genes are more conserved than the nonessential genes are [24].
genes is mitigated; that is no matter how many copies of a
TFBS appear on the promoter of a gene, as long as they collectively qualify the threshold of significance measure (p
0.05), we consider one association based on Equation (1).
Because this method measures promoter’s functional, rather
than sequential, similarity, we call it Functional Promoter
Similarity Index (FPSI). This approach has been successfully
applied in one of our recent studies [28].
In modern laboratories studying human diseases, metabolic functions, and genetics, experiments are often conducted on the mouse and rat models. However, the resulting
conclusions of such experiments are often limited in their
application to humans [25]. This has raised a question why
the orthologous genes that suppose to perform the same
function behave so differently in different mammals, and
prompted us to study the cis-regulatory elements of orthologous genes.
Promoter sequences (1000 bp upstream and 200 bp
downstream of the Transcription Start Site, RefSeq tables) of
human (version=hg18), chimpanzee (version=panTro2),
mouse (version=mm9), rat (version=rn4), dog (version=
canFam2), chicken (version= galGal3) and zebrafish (version= danRer5) were obtained from UCSC Genome Browser
[1] on May 29, 2009.
We conducted a genome-wide pair-wise comparison between promoters of orthologous genes in human, chimpanzee, mouse, rat, dog, chicken, and zebrafish. In the following
sections, we present the algorithm measuring the promoter
similarity, a brief description of the methods that were used
in this study, result of genome-wide comparison, discussion,
and finally conclusions.
MATERIALS AND METHODOLOGY
We searched the highly significant putative TFBSs from
promoter regions of all genes based on the PWMs obtained
from the TRANSFAC database [6] using Profile Hidden
Markov Model (PHMM) [26]. The PHMM is a well established method for sequence motif search. However, the
PHMM has its drawback of selecting a threshold for significant motifs. We therefore conducted a comparative study
using several PWMs for humans and yeast and proposed a
threshold selection criterion based on single sequence in one
of our earlier papers [27]. In this paper, we selected a stringent threshold (p 0.05) for each of the 573 PWMs. We
then searched the highly significant putative TFBSs from the
promoter regions. We deduced the TF-gene associations
based on these TFBSs and then compared them between the
orthologous genes.
Promoter Similarity Measure Between Orthologs
A putative association between a TF and a gene is deduced based on the existence of a significant putative binding site (BS) of the TF in the promoter region of the gene.
Let and be the sets of distinct TFs that are found to be
associated with the two genes X and Y, respectively. Usually,
and share some common instances. Let n(Z) be the number of TFs in a set Z, the promoter similarity, Sim(X, Y), between two genes X and Y can be defined as:
Sim( X ,Y ) =
n( )
,
n( )
(1)
Since is a subset of and n( ) n( ) ,
the value of Sim(X, Y) is between 0 and 1. Sim(X, Y) is defined in such a way that the influence of the abundance of a
specific TFBS in certain pairs of promoters of orthologous
Data Sources
Although some promoter elements may lie a few tens of
thousands bp upstream of transcription start site (TSS, see
[5] and refs therein), the majority of TFBSs are usually more
concentrated in the first 1000 bp or even in a closer proximity of TSS [5, 29]. Because of the fact that we require a statistical significance level of p 0.05, all motifs of length 7
bp or below are not satisfied (see [27] for details). We included a small proportion of downstream sequence to account for alternative splicing, which result in different TSSs
for the same genes and is obvious in the RefSeq tables. The
difference in TSS for the same gene is believed to be within
200 bp [29].
The orthologous genes were obtained from the NCBI
Homologene Database [30] (build 63, May 21, 2009).
PWMs were obtained from the TRANSFAC professional
database [6], release 2009.1. We retrieved 2128 human
CAGs from [31].
Experiments
Promoter regions of the orthologous genes were compared and their similarity was calculated based on Equation
(1). We first performed a validation of FPSI (Equation 1) by
using microarray data on orthologous gene expression of
lung adenocarcinoma between human and mouse [32] obtained from the ArrayExpress Gene Expression Atlas [33].
We then performed a pair-wise comparison of the seven organisms for all genes listed in the RefSeq tables and labelled
it as the genome-wide promoter comparison. For genes with
multiple paralogs in certain organism, we selected the best
similarity between the orthologs. For example, species A and
B have n and m paralogs, respectively; we did n m pairwise comparisons and the best FPSI among the n m pairs is
selected to represent the similarity between the orthologs.
We then retrieved the FPSIs of the 2128 human CAGs from
the genome-wide dataset. The distributions of the human
CAGs were investigated over the functional promoter similarity profile by dividing the similarity profile into bins of
0.1 FPSI span.
RESULTS
Validation of the Promoter Similarity Index
We divided the microarray gene expression data on
orthologs of lung adenocarcinoma [32] into two groups.
The Open Bioinformatics Journal, 2010, Volume 4
0.26
A
0.24
0.22
0.20
B
0.16
0.15
0.14
Different
Same
Different
Same
Gene expression modulation
Fig. (1). Validation of the FPSI. Error bar = standard error.
Group I consists of genes that have the same gene expression
modulation; both are up- or down-modulated in the orthologous genes. Group II consists of genes that have different
gene expression modulation between the orthologs. We calculate promoter similarity by using Equation (1) and alignment identity using BLASTN between the entire 1200 bp
promoter sequences of the orthologs of human and mouse.
The result indicates that the FPSIs of Group I genes are sigTable 1.
43
brafish. Between human and non-primates, such as mouse,
rat and dog, 80% of the orthologs have a FPSI < 0.4 (Fig. 2,
Supplemental File 2). It is even lower between human and
non-mammal vertebrates, such as chicken and zebrafish
(82% and 93% with a FPSI < 0.3, respectively, Table 1).
This progressive change of overall FPSI among these species
are revealed through FPSI between each pair as shown in
Table 1.
0.17
Alignment identity
Functional promoter similarity index
Promoters of Orthologous Genes
The distribution profile of overall FPSI of orthologous
genes between chimpanzee and other vertebrates largely
resembles that seen between human and these vertebrates
(comparison between Figs. 2, 3A, Table 1). It progressively
worsens when chimpanzee is compared with other mammals
such as mouse and rat, and non-mammals such as chicken
and zebrafish.
An interesting finding of this analysis is that the overall
FPSI between the two rodents species (i.e., mouse and rat) is
quite good at a mean value of 0.374 (underlined in Table 1),
being more than double of each rodent compared to dog (Table 1, Fig. 3B). In general, however, the FPSIs between primates and other non-primate mammals (mouse, rat and dog)
appear to get progressively worsened as they are compared
to non-mammalian vertebrates, such as zebrafish and
Genome-Wide Mean FPSIs Between Each Pair of Organisms
Human
Chimpanzee
Chimpanzee
Mouse
Rat
Dog
Chicken
Zebrafish
0.786
0.218
0.200
0.201
0.130
0.065
0.200
0.189
0.192
0.119
0.066
0.374
0.167
0.122
0.070
0.160
0.127
0.074
0.134
0.073
Mouse
Rat
Dog
Chicken
nificantly higher than those of Group II (Fig. 1). Nevertheless, these two groups cannot be significantly separated by
the identity in sequence alignment. Furthermore, we did
comparison between the FPSIs on one hand and sequence
conservation in the promoters and in the coding sequences
on the other. We found that FPSI does not correlate with
either of them. This experiment indicates that FPSI is a good
measure for functional similarity in promoters.
0.057
chicken (Table 1). Notably, the FPSI is the lowest when the
six vertebrates are compared to zebrafish, over 92% below
0.3 (Supplemental File 2).
The closeness of FPSIs among the 20 pairs of comparison is consistent with the t statistics (Table 2). For example,
0.5
Chimpanzee
Genome-Wide Promoter Similarity Profiles Between
Orthologs
Frequency
Frequency
Using Equation 1, we first examined the frequency distribution of the FPSIs among human genes and their
orthologs in other species (Fig. 2). The number of genes analyzed for each species is available in Supplemental File 1. As
can be expected, the highest FPSI occurs between human
genes and their orthologs in chimpanzee (Table 1), with 68%
of the orthologs having a FPSI 0.8. Furthermore, 27% of
orthologs have identical promoters (FPSI = 1.0, Supplemental File 2), reminiscent of their phylogenetic closeness. As
can be seen in Fig. (2), the FPSI profile of the orthologs seen
with chimpanzee and other species progressively worsens
when humans are compared to other mammals such as
mouse and rat, and non-mammals such as chicken and ze-
0.4
Mouse
Rat
Human vs
0.3
Dog
Chicken
Zebrafish
0.2
0.1
0.0
0
0.2
0.4
0.6
0.8
1
Similarity
Functional promoter
similarity index
Fig. (2). Genome wide FPSI of orthologous genes between human
genes and their orthologs in other vertebrates.
44 The Open Bioinformatics Journal, 2010, Volume 4
Pan et al.
mates compared to the others. On the other hand, when the
two primates compared to the three non-primates mammals,
their promoter similarities are very close (Rows 1 & 2 in
Table 1). This is also revealed through t statistics (Table 2).
This observation prompted us to study the relationship between the FPSI and time of divergence between the pair of
organisms in question. We used the TimeTree Knowledge
Base [34, 35] to estimate the divergence time. The promoter
similarity appears to be negatively correlated with the time
of divergence between the pair (Fig. 4, Supplemental File 3).
0.5
Mouse
Frequency
Frequency
0.4
A
Rat
0.3
Chimpanzee vs.
Dog
Chicken
0.2
Zebrafish
0.1
0.0
0
0.2
0.4
0.6
0.8
Similarity
0.4
B
1
Rat
Dog
Mouse vs.
0.3
Chicken
Zebrafish
0.2
0.1
Frequency
Frequency
0.0
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
0.2
0.4
0.6
Similarity
0.8
1
Rat vs Dog
Rat vs Chicken
C
Rat vs Zebrafish
Dog vs Chicken
Dog vs Zebrafish
Chicken vs Zebrafish
0
0.2
0.4
0.6
0.8
500
Divergence time (Mya)
Frequency
Frequency
0.5
400
200
100
0
1
0.0
Similaritysimilarity index
Functional promoter
H&M
H&R
H&D
H&C2
H&Z
C1&M
C1&R
C1&D
0.4
0.6
0.8
Fig. (4). Correlation between genome-wide FPSI and time of divergence.
human and chimpanzee are closest. Their FPSIs are the best
and far ahead of the other 19 pairs, the t statistics also indicate their FPSIs are drastically different from the other 19
pairs. Similarly, the second closest pair is between the two
rodents, mouse and rat. Their FPSIs are second to the human-chimpanzee comparison, and far ahead of the remaining
18 pairs. The t statistics indicate that they are significantly
different from the other 19 pairs; and their differences from
other pairs are not as drastic as those between the two pri-
H&C1
0.2
Functional promoter similarity index
Fig. (3). Genome wide FPSI of orthologous genes in vertebrates
other than human.
Table 2.
y = 6.1528x-1.66
R² = 0.9416
300
Comparison of Human CAGs with their Orthologs in
other Vertebrates
The genome-wide FPSI profiles reveal some genes
within a genome are more conserved than others. In this regard, we took human CAGs as an example. The comparison
between human CAGs (see methods) and their orthologs in
chimpanzee indicates that these genes do not show a signifi-
t Statistics Between FPSIs Among the 20 Pairs of Comparisons
H&M
H&R
H&D
H&C2
H&Z
C1&M
C1&R
C1&D
C1&C2
C1&Z
M&R
M&D
M&C2
M&Z
R&D
R&C2
R&Z
D&C2
D&Z
C2&Z
177.1
133.6
73.1
162.6
241.1
174.8
126.1
68.1
159.4
232.1
81.0
83.8
164.3
238.9
59.3
111.5
186.5
64.2
104.2
202.1
4.5
2.2
24.7
66.6
6.4
6.7
3.1
26.7
62.2
33.0
7.2
27.0
64.2
5.6
16.3
43.6
8.5
22.1
52.7
0.1
15.0
35.9
0.1
2.1
1.0
17.1
34.7
31.1
4.3
16.8
34.5
3.8
11.6
28.3
6.4
17.6
33.5
8.6
17.7
0.1
1.4
0.8
10.0
17.4
19.8
3.3
9.6
17.0
3.2
8.0
15.8
5.4
13.0
18.1
19.5
19.0
11.8
6.9
2.6
18.6
45.8
4.8
1.9
18.0
2.7
0.6
13.8
0.3
8.2
18.8
54.3
29.8
15.0
15.2
0.5
67.5
14.5
17.0
2.4
9.2
11.3
2.9
6.9
1.3
2.8
2.5
1.0
21.2
51.1
36.0
4.7
21.2
52.2
3.9
13.0
36.7
6.7
19.2
44.7
0.3
13.8
28.9
31.4
2.8
13.5
28.6
2.7
9.5
24.0
5.3
15.6
28.5
8.2
14.8
19.4
2.3
7.8
14.4
2.4
6.6
13.4
4.5
11.4
15.5
14.5
46.9
6.3
0.8
13.8
3.8
1.3
10.5
1.4
6.4
15.1
66.2
14.2
16.2
1.7
9.0
10.9
2.4
6.8
1.1
3.1
25.3
47.3
66.3
19.3
36.0
58.2
22.4
39.2
63.3
5.9
13.8
0.6
4.6
12.5
2.8
10.0
15.0
15.5
3.5
0.7
11.7
1.1
7.0
16.6
8.7
10.4
1.3
6.4
0.5
4.5
2.9
8.1
1.8
7.2
9.7
8.8
0.6
6.5
11.9
C1&C2
C1&Z
M&R
M&D
M&C2
M&Z
R&D
R&C2
R&Z
D&C2
D&Z
Note: H=human, C1=chimpanzee, M=mouse, R=rat, D=dog, C2=chicken, Z=zebrafish.
5.9
0.1
4.6
5.2
7.6
2.4
Promoters of Orthologous Genes
Table 3.
The Open Bioinformatics Journal, 2010, Volume 4
45
Mean FPSIs Between Human Cancer-Associated Genes (CAGs), Non-CAGs, and Transcription Factor Genes (TFs), and
their Orthologs in Chimpanzee, Mouse, Rat, Dog, Chicken, and Zebrafish
Chimpanzee
Mouse
Rat
Dog
Chicken
Zebrafish
CAGs
0.782
0.237
0.219
0.236
0.137
0.069
non-CAGs
0.787
0.216
0.197
0.193
0.129
0.064
TFs
0.792
0.330
0.278
0.349
0.225
0.084
Note: CAGs: cancer-associated genes, TFs: transcription factor genes. Underlined: significantly higher (mean ± SE, SE: standard error) than the mean FPSIs of non-CAG or genomewide mean.
cantly altered FPSI than other genes (Table 3). However, as
we can see from Table 1, the overall FPSI is already very
high between these two primates. Nevertheless, when human
CAGs compared with their orthologs in the two rodents and
dog, their FPSIs appear to be significantly better than other
genes. We then studied the relative distribution of CAGs
over all genes in various FPSI regions and found a higher
relative density of CAGs at the higher FPSI regions between
human CAGs and their orthologs in mouse and rat (Fig. 5).
The density of CAGs is the proportion of these genes
over the entire gene population in the bin. It is noted from
Fig. (5), panels B and C, that the density appears to increase
Density
A
0.08
0.04
0.00
0.16
Density
B
0.12
0.08
0.04
0.00
0.24
Density
0.20
The mean values presented in Tables 1 and 3 are the
arithmetic average of the individual FPSI values. We also
calculated the weighted means and present the result in the
Supplemental File 4. The weighted mean values are smaller
than those presented in Tables 1 and 3, but the overall trend
is the same.
Functional Characterization
0.16
0.12
with an increasing FPSI up to 0.7 and 0.6, respectively, and
then it becomes unsteady beyond this point. This unsteadiness is due to the fact that the number of orthologs in these
FPSI regions is very small, only about 5% of genes (see Fig.
2); the values become noisier when the total number of genes
in that FPSI region becomes smaller.
C
0.16
0.12
The promoter difference between orthologs appears to be
related to the time of divergence between the species. The
promoters of CAGs appear to be more conserved than others.
In order to find whether the promoter conservation is related
with biological functions, we identified two groups of genes,
one with more conserved promoters and the other with less
conserved promoters. The more conserved gene group consists of 62 human genes whose FPSIs are higher than 0.7
when human is compared with both chimpanzee and mouse.
The less conserved gene group consists of 171 human genes
whose FPSIs are 0.0 (no common TF) when human is compared with both chimpanzee and mouse (Supplemental File
5). We then performed a gene functional characterization of
each group using Gene Ontology AnaLyzer (GOAL) [36].
This reveals that the genes with more conserved promoters
are closely related with various developmental process
(GO:0007275, GO:0032502, GO:0048856, GO:0048731,
GO:0009653, GO:0048513, GO:0001822), while genes with
less conserved promoters are related with regulation of
transport, protein binding and signalling (GO:0051050,
GO:0005515, GO:0007242, GO:0019932, GO:0032501) and
their functional representations are less significant even
though there are twice number of genes in this group than in
the most conserved group (Supplemental File 5).
0.08
DISCUSSION
0.04
In this study, we applied very stringent threshold (p
0.05) in searching the TFBSs based on its probability on a
single sequence (1200 bp) [27]. For such reason, we did not
consider a motif of 7 bp or less. This study indicates that
FPSI is closely related with microarray gene expression
modulation among the orthologous genes. But the identity at
the sequence level revealed by BLASTN alignment does not
correlate with microarray gene expression modulation. This
is not surprising because the functional cis-regulatory ele-
0.00
Functional promoter similarity index
Fig. (5). Density of cancer-associated genes over various regions of
the FPSI. A: human vs. chimpanzee, B: human vs. mouse, C: human vs. rat.
46 The Open Bioinformatics Journal, 2010, Volume 4
ments imbedded in the promoter are typically short (~6-20
bp) and usually not detected by BLAST. Also, a TF could
bind to several very different sequence motifs. For such reason, we chose TFs that bind to these cis-regulatory elements
to measure the functional similarity of the promoters.
Functional TFBSs are often position specific [29, 37].
Given limited availability of information in known functional regions of some TFBSs, which are not exclusive and
our search space are small (1200 bp), we decided not to consider the positions of TFBSs in Equation 1. Instead, we used
a very stringent threshold in motif search. Recent research
also indicates that some functional TFBSs are phylogenetically conserved across various species and many such methods have been developed over the past decades (see [5] and
refs therein). For example, the COmparative Regulatory Genomics (CORG) platform [38] contains promoters and 5’
UTR regions of 16,127 groups of orthologous genes from 5
vertebrate species (human, mouse, rat, fugu, and zebrafish).
Some TFBSs, such as Erg-1 site, are conserved only within
mammals, but diverged in fish. Phylogenetic footprinting in
functional motif discovery has been successful. However, it
is a fact that many TFs bind to different motifs [5-7]. Given
limited information in this regard, restricting motif search by
phylogenetic conservation would eliminate some potential
association between TFs and their target genes. Such restriction would hinder us from discovering the divergence of
promoter function between orthologs. Nevertheless, without
considering these two factors described here, our FPSIs are
significantly related with microarray gene expression modulation. This indicates that our decision on the measure of
functional promoter similarity is sound.
The overall FPSIs between orthologous genes detected in
this study are not high. At the outset of this study, we were
interested in determining whether gene regulation may account in large part for the phenotypic differences observed
amongst species. This study indicates that FPSIs between the
orthologs vary significantly among the species studied here.
Even in the two closely related primates, human and chimpanzee, only 27% of promoters are identical between the
orthologs. This finding that the cis-regulatory machineries
are less likely conserved among species than their protein
coding regions, suggest that regulation of gene products
rather than the composition/sequence of these proteins may
account for many of the phenotypic differences among species. This result may account for the finding that surveys of
developmental gene expression often reveal differences in
timing, location and level, even among closely related species (see [39] and refs wherein). Indeed, several cases exist
which indicate that some phenotypic changes are more likely
to have resulted from cis-regulatory mutations than from
mutations in the coding regions of a gene (see [16] and refs
therein). For example, it was recently reported that only
about half of mRNA transcripts of the one-to-one orthologs
were detected in placental labyrinth of both human or mouse
[40]. We found that the mean FPSI of this group of orthologs
is significantly higher than that of the other orthologs. This
indicates that the developmental genes are more conserved in
promoter, a consistent observation with [41].
Genes encoding TFs are generally highly conserved [42].
In this study, we found that the promoter regions of TF genes
are more conserved than other genes (Table 3). Earlier stud-
Pan et al.
ies revealed that more than 26% of the known cancer genes
are actually TFs [43]. This conservation is evident both in
sequence and in function [44]. At the structural level, the
DNA-binding domains of many orthologous TFs are very
comparable over large phylogenetic distances, allowing them
to bind to identical DNA motifs and regulate the same target
genes. Additionally, some TFs can bind on different motifs
and perform the same function. For example, many Drosophila genes with maternally inherited transcripts were found
to have alternative promoters utilized later in development
[45]; human TF MBD1 can bind on four different motifs
(TRASFAC: R25533, R25534, R25535 and R25536) [6].
Our approach in calculating promoter similarity is distinguished by the fact that we choose the calculation to be
based on the TFs, which are most likely bound to the promoter, rather than on the DNA motif. This would reduce
artefacts caused by the difference at the sequence level due
to the fact that the same TF could bind to several cisregulatory elements that are very different at the sequence
level [5-7].
However, in the tissue-specific transcriptional regulation,
this conservation does not always exist. A recent discovery
from liver-specific TFs (FOXA2, HNF1A, HNF4A and
HNF6) reveals that the cis-regulatory network diverged extensively between mouse and human orthologs. Despite the
conserved functions of these TFs, 41-89% of their BSs appear to be species specific [15].
Mutations in the coding region of orthologous genes are
well studied in the context of their associations with certain
diseases such as cancer [20, 43, 46]. In light of this, we were
interested in determining whether promoter regions of the
CAGs differed as compared to the majority of other orthologous genes. Our results show that CAGs tended to be more
conserved in their promoter functions (Table 3, Fig. 5). Previous work shows that cancer genes and essential genes are
more conserved in function and in coding sequences than
other genes [21-24]. The corresponding conservation of
promoter function suggests that regulation of the timing,
location and expression levels of these essential genes plays
a critical basic role in growth regulation and differentiation
across species [39]. This hypothesis is supported by the various roles of them indicating that the CAGs have more conserved promoter functions. These span a range of biological
functions (Supplemental File 6) including protein kinase
activity (GO:0004672), cell cycle processes (GO:0022402),
phosphotransferase activity (GO:0016773), phosphatase activity (GO:0016791) and cell differentiation (GO:0030154).
The fact that these genes play different functional roles is in
accordance with the view that cancer is caused by defects in
genes from multiple functional categories, according to Hanahan and Weinberg’s hallmarks of cancer [47].
It is interesting to note that the functional promoter similarity of orthologous genes is correlated with the divergence
time of the pair of organisms under consideration. Even
though the coding regions of the orthologous genes are very
similar or even identical, their promoter regions can be very
different. For example, a pair-wise comparison between human and mouse or between human and rat reveals that the
FPSIs of the majority (70%) of these orthologs are below
0.3 (Fig. 2, Supplemental File 2). This observation also holds
true for the chimpanzee-mouse and chimpanzee-rat compari-
Promoters of Orthologous Genes
The Open Bioinformatics Journal, 2010, Volume 4
47
sons (Fig. 3A). This prompted us to investigate the promoter
similarity between the two rodents: mouse and rat. Not unexpectedly, the FPSI between these two rodent species
(mouse and rat) is significantly better than comparing each
of them to either of the primates or to dog (Figs. 2, 3; Table
1). This is consistent with their time of divergence as estimated from the TimeTree Knowledge Base [34, 35]. For
example, the divergence time between mouse and rat is 25
Mya, which is about 1/4 of that between dog and each of the
two rodents (98 Mya). The relative divergence time is inversely proportional to the FPSI, which is more than double
(Table 1, Fig. 4). It is more interesting to note that the shape
of FPSI distribution curve of the Human-Chimpanzee pair is
very different from the other pair-wise comparisons (Fig. 2).
This is consistent with the time of divergence. The divergence time between human and chimpanzee is 6.5 Mya,
while that of other pair-wise distances (excepting the mouserat pair) is at least one order of magnitude higher (Supplemental file 3). The good correlation between FPSI and the
divergence time suggests that the FPSI could be a measure
for phylogenetic conservation. However, the number of species examined in this study is moderate; this remains to be
explored by studying additional organisms and in a broader
scale.
Biological Sciences, NRC, and Xuhua Xia, University of
Ottawa, for the constructive comments on this work. This
research is jointly supported by NRC’s Genomics and Health
Initiative and NRC’s Institute for Information Technology.
This study explored the differences in cis- and transregulatory elements between the orthologous genes. Structural differences, such as chromosomal rearrangements,
segmental duplications and copy numbers, are other important contributing factors and worth exploring as indicated in
[50-53].
CONCLUSIONS
We proposed a functional promoter similarity index to
measure similarity in transcriptional regulation between
orthologs in human, chimpanzee, mouse, rat, dog, chicken,
and zebrafish. This index is significantly correlated with
microarray gene expression modulation. This study shows
that the promoter functions are significantly different between orthologous genes although protein coding sequences
closely resemble to each other. The CAGs tend to be more
conserved in the promoter function as compared with other
genes. The high degree of functional promoter similarity of
CAGs suggests that their regulation is essential for growth
and development across different species.
There is a general understanding that promoter of a gene
is less conserved than its coding region [16, 39 and refs
therein]. This is evident from studies of individual genes [48,
49]. However, there were not much of genome-wide studies
in this regard. This study enlightens that such differences are
genome-wide across various vertebrates. We also found that
the genome-wide promoter dissimilarity between orthologs
is closely correlated with the time of divergence between the
organisms under consideration and is higher when comparing human to chimpanzee than when comparing either of the
primates to a rodent or another vertebrate. Such close correlation indicates that FPSI could be used as a measure of
phylogenetic conservation. This merits further study.
ACKNOWLEDGEMENTS
We are grateful to the two anonymous referees for the
constructive comments and criticism on the manuscript. We
are thankful to our colleagues Brandon Smith, Institute for
ABBREVIATIONS
BS
= Binding site
CAG
= Cancer-associated gene
DNA
= Deoxyribonucleic acid
FPSI
= Functional promoter similarity index (defined by
Equation 1)
NRC
= National Research Council Canada
PHMM = Profile Hidden Markov Model
PWM
= Positional weight matrix
SNP
= Single nucleotide polymorphism
TF
= Transcription factor
TFBS
= Transcription factor binding site
TSS
= Transcription start site
UCSC
= University of California, Santa Cruz
UTR
= Untranslated region
SUPPLEMENTARY MATERIAL
Supplementary material is available on the publishers
Web site along with the published article.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
B. Rhead, D. Karolchik, R. M. Kuhn, A. S. Hinrichs, A. S. Zweig,
P. A. Fujita, M. Diekhans, K. E. Smith, K. R. Rosenbloom, B. J.
Raney, A. Pohl, M. Pheasant, L. R. Meyer, K. Learned, F. Hsu, J.
Hillman-Jackson, R. A. Harte, B. Giardine, T. R. Dreszer, H. Clawson, G. P. Barber, D. Haussler, and W. J. Kent, “The UCSC Genome Browser Database: update 2010”, Nucleic Acids Res., vol. 38,
pp. D613-D619, 2010.
T. J. Hubbard, B. L. Aken, S. Ayling, B. Ballester, K. Beal, E.
Bragin, S. Brent, Y. Chen, P. Clapham, L. Clarke, G. Coates, S.
Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gordon, S. Graf, S.
Haider, M. Hammond, R. Holland, K. Howe, A. Jenkinson, N.
Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl, B.
Overduin, A. Parker, B. Pritchard, D. Rios, M. Schuster, G. Slater,
D. Smedley, W. Spooner, G. Spudich, S. Trevanion, A. Vilella, J.
Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, F. Cunningham,
V. Curwen, R. Durbin, X. M. Fernandez-Suarez, J. Herrero, A.
Kasprzyk, G. Proctor, J. Smith, S. Searle, and P. Flicek., “Ensembl
2009”. Nucleic Acids Res., vol. 37, pp. D690-D697, 2009.
W. W. Wasserman and A. Sandelin, “Applied bioinformatics for
the identification of regulatory elements”, Nat. Rev. Genet., vol. 5,
pp. 276-287, 2004.
M. Tompa, N. Li, T. L. Bailey, G. M. Church, B. De Moor, E.
Eskin, A. V. Favorov, M. C. Frith, Y. Fu, W. J. Kent, V. J.
Makeev, A. A. Mironov, W. S. Noble, G. Pavesi, G. Pesole, M.
Régnier, N. Simonis, S. Sinha, G. Thijs, J. van Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, and Z. Zhu, “Assessing
computational tools for the discovery of transcription factor binding sites”, Nat. Biotechnol., vol. 23, pp. 137-144, 2005.
Y. Pan, “Advances in the discovery of cis-regulatory elements”,
Curr. Bioinformatics, vol. 1, pp. 321-336, 2006.
V. Matys, O. V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A.
Barre-Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N.
Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A. E. Kel, and
E. Wingender, “TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes”, Nucleic Acids Res., vol.
34, pp. D108-D110, 2006.
48 The Open Bioinformatics Journal, 2010, Volume 4
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
A. Sandelin, W. Alkema, P. Engstrom, W. W. Wasserman, and B.
Lenhard, “JASPAR: an open-access database for eukaryotic transcription factor binding profiles”, Nucleic Acids Res., vol. 32, pp.
D91-D94, 2004.
W. M. Fitch, “Distinguishing homologous from analogous proteins”, Syst. Zool., vol. 19, pp. 99-113, 1970.
T. Gabaldón, “Large-scale assignment of orthology: back to phylogenetics?” Genome Biol., vol. 9, article 235, 2008.
W.M. Fitch, “Homology – a personal view on some of the problems”, Trend Genet., vol. 16, pp. 227-231, 2000.
R. A. Jensen, “Orthologs and paralogs – we need to get it right”
(Included also respond from J. Gerlt and P. Babbitt), Genome Biol.,
vol. 2, pinteractions1002.1-1002.3, 2001.
E. V. Koonin, “Orthologs, paralogs, and evolutionary genomics”,
Annu. Rev. Genet., vol. 39, pp. 309-338, 2005.
E. V. Koonin, “An apology for orthologs – or brave new memes”,
Genome Biol., vol. 2, comment 1005, 2001.
The International HapMap Consortium, “A second generation
human haplotype map of over 3.1 million SNPs”, Nature, vol. 449,
pp. 851-862, 2007.
D. T. Odom, R. D. Dowell, E. S. Jacobsen, W. Gordon, T. W.
Danford, K. D. MacIsaac, P. A. Rolfe, C. M. Conboy, D. K. Gifford, and E. Fraenkel, “Tissue-specific transcriptional regulation
has diverged significantly between human and mouse”, Nat.
Genet., vol. 39, pp. 730-732, 2007.
G. A. Wray “The evolutionary significance of cis-regulatory mutations”, Nat. Genet., vol. 8, pp. 206-216, 2007.
American Cancer Society, Cancer facts & figures 2010. American
Cancer Society Inc, Atlanta, 2010.
Canadian Cancer Society’s Steering Committee, Canadian Cancer
Statistics 2010. Toronto (ISSN 0835-2976), 2010.
B. Vogelstein and K. W. Kinzler, “Cancer genes and the pathways
they control”, Nat. Med., vol.10, pp. 789-799, 2004.
P. A. Futreal, L. Coin, M. Marshall, T. Down, T. Hubbard, R.
Wooster, N. Rahman, and M. R. Stratton, “A consensus of human
cancer genes”, Nat. Rev. Cancer., vol. 4, pp. 177-183, 2004.
M.A. Thomas, B. Weston, M. Joseph, W. Wu, A. Nekrutenko, and
P. J. Tonellato, “Evolutionary dynamics of oncogenes and tumor
suppressor genes: higher intensities of purifying selection than
other genes”, Mol. Biol. Evol., vol. 20, pp. 964-968, 2003.
S.J. Furney, D.G. Higgins, C.A. Ouzounis, and N. López-Bigas,
“Structural and functional properties of genes involved in human
cancer”, BMC Genomics, vol. 7, article 3, 2006.
S. Narsing, Z. Jelsovsky, A. Mbah, and G. Blanck, “Genes that
contribute to cancer fusion genes are large and evolutionarily conserved”, Cancer Genetics and Cytogenetics, vol. 191, pp. 78-84,
2009.
I.K. Jordan, I.B. Rogozin, Y.I. Wolf, V. Eugene, and E.V. Koonin,
“Essential genes are more evolutionarily conserved than are nonessential genes in bacteria”, Genome Res., vol. 12, pp. 962-968, 2002.
C.W. Hay and K. Docherty, “Comparative analysis of insulin gene
promoters: implications for diabetes research”, Diabetes, vol. 55,
pp. 3201-3213, 2006.
S.R. Eddy, “Profile hidden Markov models”, Bioinformatics, vol.
14, pp. 755-763, 1998.
Y. Pan and S. Phan, “Threshold for positional weight matrix”, Eng.
Lett., vol. 16, pp. 498-504, 2008.
Z. Liu, S. Phan, F. Famili, Y. Pan, A. E. G. Lenferink, C. Cantin,
C. Collins, and M.D. O’Connor-Mccourt, “A multi-strategy approach to informative gene identification from gene expression
data”, J. Bioinform. Comput. Biol., vol. 8, 19-23, 2010.
B. Smith, H. Fang, Y. Pan, P. R. Walker, A. F. Famili, and M.
Sikorska, “Evolution of motif variants and positional bias of the
cyclic-AMP response element”, BMC Evol. Biol., vol. 7, article
S15, 2007.
The NCBI Homologene Database. [online] available:
ftp://ftp.ncbi.nih.gov/pub/HomoloGene/build63/, [accessed: May
21, 2009].
Q. Cui, Y. Ma, M. Jaramillo, H. Bari, A. Awan, S. Yang, S. Zhang,
L. Liu, M. Lu, M. O'Connor-McCourt, E. O. Purisima, and E.
Wang, “A map of human cancer signalling”. Mol. Syst. Biol., vol.
3, article 152, 2007.
R. S. Stearman, L. Dwyer-Nield, L. Zerbe, S. A. Blaine, Z. Chan,
P. A. Bunn Jr, G. L. Johnson, F. R. Hirsch, D. T. Merrick, W. A.
Franklin, A. E. Baron, R. L. Keith, R. A. Nemenoff, A. M. Malkinson, and M. W. Geraci, “Analysis of orthologous gene expression
Pan et al.
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
between human pulmonary adenocarcinoma and a carcinogeninduced murine model”, Am. J. Clin. Pathol., vol. 167, pp. 17631775, 2005.
M. Kapushesky, I. Emam, E. Holloway, P. Kurnosov, A. Zorin, J.
Malone, G. Rustici, E. Williams, H. Parkinson, and A. Brazma,
“Gene Expression Atlas at the European Bioinformatics Institute”,
Nucleic Acids Res., vol. 38, pp. D690-D698, 2010.
S.B. Hedges, J. Dudley, and S. Kumar, “TimeTree: a public knowledge-base of divergence times among organisms”, Bioinformatics,
vol. 22, pp. 2971-2972, 2006.
S.B. Hedges and S. Kumar, Eds., The Timetree of Life. Oxford:
Oxford University Press, 2009.
A.B. Tchagang, A. Gawronski, H. Bérubé, S. Phan, F. Famili, and
Y. Pan, “GOAL: A software tool for assessing biological significance of genes group”, BMC Bioinformatics, vol. 11, article 229,
2010.
E. Blanco, X. Messeguer, T. F. Smith, and R. Guigó, “Transcription factor map alignment of promoter regions”, PLoS Comput.
Biol., vol. 2, article e49, 2006.
C. Dieterich, S. Grossmann, A. Tanzer, S. Röpcke, P. F. Arndt, P.
F. Stadler, and M. Vingron, “Comparative promoter region analysis
powered by CORG”, BMC Genomics, vol. 6, article 24, 2005.
G. A. Wray, M. W. Hahn, E. Abouheif, J. P. Balhoff, M. Pizer, M.
V. Rockman, and L. A. Romano, “The evolution of transcriptional
regulation in eukaryotes”, Mol. Biol. Evol., vol. 20, pp. 1377-1419,
2003.
B. Cox, M. Kotlyar, A. I. Evangelou, V. Ignatchenko, A. Ignatchenko, K. Whiteley, I. Jurisica, S. L. Adamson, J. Rossant, and
T. Kislinger, “Comparative systems biology of human and mouse
as a tool to guide the modelling of human placental pathology”,
Mol. Syst. Biol., vol. 5, article 279, 2009.
A. Woolfe, M. Goodson, D. K. Goode, P. Snell, G. K. McEwen, T.
Vavouri, S. F. Smith, P. North, H. Callaway, K. Kelly, K. Walter, I.
Abnizova, W. Gilks, Y. J. Edwards, J. E. Cooke, and G. Elgar,
“Highly conserved non-coding sequences are associated with vertebrate development”, PLoS Biol., vol. 3, article e7, 2005.
K. S. Zaret, “Regulatory phases of early liver development: paradigms of organogenesis”, Nat. Rev. Genet., vol. 3, pp. 499-512,
2002.
X. S. Puente, G. Velasco, A. Gutierrez-Fernandes, J. Berranpetit,
M.-C. King, and C. Lopez-Otin, “Comparative analysis of cancer
genes in the human and chimpanzee genomes”, BMC Genomics,
vol. 7, article 15, 2006.
R.P. Zinzen, and E. E. M. Furlong, “Divergence in cis-regulatory
networks: taking the ‘species’ out of cross-species analysis”, Genome Biol., vol. 9, article 240, 2008.
E.A. Rach, H.Y. Yuan, W.H. Majoros, P. Tomancak, and U. Ohler,
“Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome”, Genome Biol., vol. 10, article R73, 2009.
P.C. Chen, S. Dudley, W. Hagen, D. Dizon, L. Paxton, D.
Reichow, S.R. Yoon, K. Yang, N. Arnheim, R. M. Liskay, and S.
M. Lipkin, “Contributions by MutL homologues Mlh3 and Pms2 to
DNA mismatch repair and tumor suppression in the mouse”, Cancer Res., vol. 65, pp. 8662-8670, 2005.
D. Hanahan and R. A. Weinberg, “The hallmarks of cancer”, Cell,
vol. 100, pp. 57-70, 2000.
V. F. Bumaschny, F. S. de Souza, R. A. López Leal, A. M. Santangelo, M. Baetscher, D. H. Levi, M. J. Low, and M. Rubinstein,
“Transcriptional regulation of pituitary POMC is conserved at the
vertebrate extremes despite great promoter sequence divergence”,
Mol. Endocrinol., vol. 21, pp. 2738-2749, 2007.
M. Zhan, T. Miura, X. Xu, and M. S. Rao, “Conservation and
variation of gene regulation in embryonic stem cells assessed by
comparative genomics”, Cell Biochem. Biophys., vol. 43, pp. 379405, 2005.
R. Blekhman, A. Oshlack, and Y. Gilad, “Segmental duplications
contribute to gene expression differences between humans and
chimpanzees”, Genetics, vol. 182, pp. 627-630, 2009.
L. Dumas, Y. H. Kim, A. Karimpour-Fard, M. Cox, J. Hopkins, J.
R. Pollack, and J. M. Sikela, “Gene copy number variation spanning 60 million years of human and primate evolution”, Genome
Res., vol. 17, pp.1266-1277, 2007.
L. Hu, D. Segrè, and T. F. Smith, “Evolutionary changes in gene
regulation from a comparative analysis of multiple Drosophila species”, Genome Inform., vol. 18, pp. 12-21, 2007.
Promoters of Orthologous Genes
[53]
The Open Bioinformatics Journal, 2010, Volume 4
L. Huminiecki and K. H. Wolfe, “Divergence of spatial gene expression profiles following species-specific gene duplications in
Received: July 13, 2010
49
human and mouse”, Genome Res., vol. 14, pp. 1870-1879, 2004.
Revised: August 24, 2010
Accepted: October 01, 2010
© Pan et al.; Licensee Bentham Open.
This is an open access article licensed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the
work is properly cited.