Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction
Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction
Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction
3, JUNE 2008
Abstract—Genomic sequence processing has been an active codon “ATG.” Looking from the end of DNA (upstream) to
area of research for the past two decades and has increasingly its end (downstream), the exon-to-intron border is known as
attracted the attention of digital signal processing researchers in the donor splice site and consists of a consensus dinucleotide
recent years. A challenging open problem in deoxyribonucleic
acid (DNA) sequence analysis is maximizing the prediction ac- “GT” as the first two nucleotides of the intron, whereas the
curacy of eukaryotic gene locations and thereby protein coding intron-to-exon border is known as the acceptor splice site,
regions. In this paper, DNA symbolic-to-numeric representations which consists of a consensus dinucleotide “AG” as the last
are presented and compared with existing techniques in terms of two nucleotides of the intron. The accurate identification of
relative accuracy for the gene and exon prediction problem. Novel genomic protein coding regions, along with the recognition of
signal processing-based gene and exon prediction methods are
then evaluated together with existing approaches at a nucleotide other signals and/or regions (shown in Fig. 1) would result in
level using the Burset/Guigo1996, HMR195, and GENSCAN an ideal gene finding and annotation system.
standard genomic datasets. A new technique for the recognition Despite the existence of various data-driven gene finding
of acceptor splice sites is then proposed, which combines signal programs, such as AUGUSTUS [1], FGENES [2], geneid [3],
processing-based gene and exon prediction methods with an GeneMark.hmm [4], Genie [5], GENSCAN [6], HMMgene
existing data-driven statistical method. By comparison with the
acceptor splice site detection method used in the gene-finding pro- [7], Morgan [8], and MZEF [9], the accuracy of gene predic-
gram GENSCAN, the proposed DSP-statistical hybrid technique tion is still limited. Previous investigations of computational
reveals a consistent reduction in false positives at different levels gene finding programs [10]–[13] reveal that these data-driven
of sensitivity, averaging a 43% reduction when evaluated on the approaches seem to rely more on compositional statistics of the
GENSCAN test set. sequences (e.g., content) than the genomic signals (e.g.,
Index Terms—Autoregressive processes, correlation, deoxyri- promoters, acceptor/donor sites, start/stop codons) involved in
bonucleic acid (DNA), discrete cosine transforms (DCTs), discrete the translation process from DNA to protein, and are heavily
Fourier transforms (DFTs), Gaussian mixture models. dependent on the statistics of the sequences they learn from
and are, thus, not equally suitable for all types of sequences.
I. INTRODUCTION Furthermore, the accuracy is dependent on the length and
position of the exons [14], [15]. High prediction accuracy can
EOXYRIBONUCLEIC acid (DNA), the material of often be attributed to friendly training and test sequences, in
D heredity in most living organisms, consists of genic
and intergenic regions, as shown in Fig. 1. In eukaryotes,
whose formation certain rules were followed, such as including
sequences consisting of one complete gene with consensus
genes are further divided into relatively small protein coding intronic dinucleotides “GT” and “AG,” respectively, for their
segments known as exons, interrupted by noncoding spacers donor and acceptor splice sites, excluding those containing al-
known as introns. In eukaryotes such as human, the intergenic ternatively spliced genes and having any in-frame stop codons.
and intronic regions often make up more than 95% of their However, it has been observed that gene prediction accuracy
genomes. Codons (i.e., triplets of possible four types of DNA can be substantially increased by combining different methods
nucleotides , , , and ) in exons encode 20 amino acids [16], [17]. The discrete nature of the DNA information, being
and 3 terminator signals, known as stop codons (i.e., TAA, discrete in both “time” and “amplitude,” invites investigation
TAG, and TGA). Initial exons of the genes begin with a start by digital signal processing (DSP) techniques. The conversion
of DNA nucleotide symbols into discrete numerical values en-
Manuscript received September 11, 2007; revised March 8, 2008. This work
ables novel and useful DSP-based applications for the solu-
was supported in part by the National University of Sciences and Technology tion of different sequence analysis related problems such as
(NUST), Pakistan, and in part by the University of New South Wales (UNSW), gene finding and annotation, and such applications have been
Australia, under a UNSW Faculty Research Grant 2007 for genomic signal pro-
cessing research. The associate editor coordinating the review of this manuscript
overviewed by previous authors [18], [19]. The present role
and approving it for publication was Dr. Ioan Tabus. of DSP applications in this area is summarized in Fig. 2. In
M. Akhtar is with the National University of Sciences and Technology, order to apply DSP methods, the DNA sequences are first con-
Rawalpindi Cantt, Pakistan, and also with the University of New South Wales,
Sydney, NSW 2052, Australia (e-mail: [email protected]).
verted into suitable numeric values. DSP-based methods for pe-
J. Epps and E. Ambikairajah are with the School of Electrical Engineering riodicity detection are then applied to the numerical sequences
and Telecommunications, University of New South Wales, Sydney, NSW 2052, to obtain 1-D or multidimensional features. The resultant fea-
Australia (e-mail: [email protected], [email protected]). tures are then passed on for back-end processing to classify be-
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. tween protein coding and noncoding regions. An empirically
Digital Object Identifier 10.1109/JSTSP.2008.923854 derived decision threshold can be used for 1-D classification,
1932-4553/$25.00 © 2008 IEEE
AKHTAR et al.: SIGNAL PROCESSING IN SEQUENCE ANALYSIS: ADVANCES IN EUKARYOTIC GENE PREDICTION 311
Fig. 1. Eukaryotic DNA consists of genic and intergenic regions. Donor and acceptor sites (i.e., 5 and 3 ends of all introns) are used to splice exons on both
sides of an intron in a process known as splicing.
Fig. 2. DSP techniques are applied to DNA sequences following conversion into numerical signals, to extract features. The classification results can be used to
provide accurate detection of exonic end-points (acceptor/donor splice sites).
whereas multidimensional classification can be achieved using desirable properties of a DNA numerical representation in-
well-known pattern recognition tools such as Gaussian mixture clude: 1) each nucleotide has equal “weight” (e.g., magnitude),
models (GMMs) or support vector machines (SVMs). since there is no biological evidence to suggest that one is
To our knowledge, despite the signal processing research ac- more “important” than another; 2) distances between all pairs
tivity in this area, no comparisons with well-established existing of nucleotides should be equal, since there is no biological
data-driven methods for eukaryotic gene prediction (e.g., GEN- evidence to suggest that any pair is “closer” than another;
SCAN) are available. We address this shortcoming herein, in 3) representations should be compact, in particular, redundancy
addition to proposing newly developed DNA symbolic-to-nu- should be minimized; and 4) representations should allow
meric mappings and gene prediction features. It is our belief access to a range of mathematical analysis tools.
that a system combining improved DSP techniques with existing The binary or Voss representation [20] is currently the
data-driven methods could offer a level of gene prediction accu- most popular scheme, which maps the nucleotides , , ,
racy higher than that offered by existing data-driven methods. and into the four binary indicator sequences , ,
This paper is organized as follows. Section II reviews existing , and showing the presence (e.g., 1) or absence
methods for DNA numerical representation and DSP-based (e.g., 0) of the respective nucleotides. Both the -curve [21]
methods for gene and exon feature extraction from the DNA and tetrahedron [22] methods reduce the number of indicator
sequence. In Section III, newly developed DNA represen- sequences from four to three in a manner symmetric to all four
tations and gene prediction features are discussed. Selected components. The Voss and tetrahedron representations have
existing statistical approaches and a proposed DSP-statistical been shown to be equivalent representations for the purpose of
combination for acceptor splice site detection are presented in power spectra computation [28]. The complex representation
Section IV. Existing and newly proposed DNA representations, [18], [23] reflects some of the complementary features of the
DSP-based gene and exon prediction features, and acceptor nucleotides in its mathematical properties. As an alternative
splice site detection methods are then compared using standard to the typical complex representation of DNA nucleotides,
datasets and evaluation measures, as explained in Section V. certain complex weights for each of the four bases can also be
Evaluation results and discussion are given in Section VI. calculated and employed with the binary indicator sequences
[18]. In the quaternion representation of DNA symbols [24],
II. DNA NUMERICAL REPRESENTATION AND pure quaternions are assigned to each symbol. It has been
FEATURE EXTRACTION conjectured that the quaternion approach can improve DNA
pattern detection in the spectral domain through use of the
A. DNA Numerical Representations quaternionic Fourier transform [24]. In the EIIP (electron-ion
In recent years, a number of schemes have been introduced interaction potential) method [25], the electron-ion interaction
to map DNA nucleotides into numerical values. Some possible potential (related to the quasi-valence number) associated with
312 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 2, NO. 3, JUNE 2008
each nucleotide is used to map DNA character strings into , and ). The periodicity of 3 in exon regions of a
numerical sequences. The EIIP is just one example of a real DNA sequence suggests that the DFT coefficient corresponding
number representation. Another can be obtained by attaching to (where is chosen to be a multiple of 3) in each
digits to the four nucleotides: , , , DFT sequence should be large [18]. Note that the calculation
and [23]. However, this structure implies that purines of the DFT at a single frequency is sufficient, so
( or ) are in some respect “greater than” pyrimidines ( or that the Goertzel algorithm [37], which reduces the cost of
). Similarly, the representation , , , and single point DFT computation by almost a factor of two, can
suggests that and . This representation is be employed. Various DFT based spectral measures exploiting
an example of a Galois field assignment, upon which symbolic the period-3 behaviour of exons for the identification of these
Galois field operations are possible [26]. Another real-number regions have been proposed. The spectral content (SC) measure
representation maps , , , and [32] combines the individual DFTs (i.e., , , ,
, similar to the complementary property of the and ) to obtain a total Fourier magnitude spectrum of the
complex method. These assignments of real numbers to each DNA sequence, as follows:
of the four DNA characters do not necessarily reflect the
structure present in the original DNA sequences. Alternatively, (2)
to calculate weights representing the actual participation of
each symbol in the detected pattern, a linear transform and The GeneScan program [32], based on the SC measure, com-
optimization can be performed on the DNA sequences [29]. putes the signal-to-noise ratio of the peak at as
The internucleotide distance method [27] replaces each DNA , where represents a longer-term average of the
nucleotide with an integer representing the distance between spectral content defined in (2). Regions having are as-
the current nucleotide and the next similar nucleotide. sumed to be protein coding (exon) regions. The optimized SC
Each of the existing DNA representations offer different measure [18] assigns complex weights , , , and to each of
properties, and map the DNA sequences into between one and the four DFTs , , , and in (2). These
four numerical sequences. Many existing methods, such as weights are calculated using an optimization technique applied
Voss [20], -curve [21], and tetrahedron [22], map the DNA to the known genes of a given organism. However, one can also
sequence to three or four numerical sequences, potentially apply complex conjugate pairs and . The spec-
introducing redundancy in the representation. The assignment tral rotation (SR) measure [33] rotates four DFT vectors ,
of arbitrary numbers to each of the four DNA characters in , , and clockwise, each by an angle equiva-
EIIP [25] and other real number representations [23], [26] lent to the average phase angle value in coding regions, to make
does not necessarily reflect the structure present in the original all of them “point” in the same direction. The SR measure also
DNA sequence. Representations such as quaternions [24] are divides each term by the corresponding phase angle deviations
limited to specific mathematical analysis tools. For example, to give more weight to exonic distributions. The feature
a discrete quaternion Fourier transform (DQFT) [41] based
spectral analysis is required to detect certain DNA patterns.
Furthermore, existing DNA representations do not fully exploit (3)
the structural differences of protein coding and noncoding
regions to facilitate digital signal processing based gene and where and are the means and standard deviations of
exon prediction features. These issues are addressed in the the phase angle value in coding regions, has been used for the
DNA representations proposed in Section III-A. detection of exons, and was shown to give better performance
than the SC (2) measure at a 10% false positive gene detection
B. DSP-Based Features for Gene and Exon Prediction rate [33]. Note that all DFT-based techniques suffer from spec-
Periodicities of 3, 10.5, 200, and 400 have been reported in tral leakage, due to the finite-length analysis window, which in-
genomic sequences [30]. In exons, the occurrence of identical troduces small contributions from signal frequencies other than
nucleotides in identical codon positions is the basis for a period- those at the frequency .
icity of three interpretation in these regions [31]. The period-3 Autoregressive (AR) methods provide an alternative, more
behavior of exons has been widely used to identify these regions compact characterization of the signal spectrum. Particular ad-
using DSP-based methods, following conversion to numerical vantages of the AR technique are that it requires relatively few
sequences. base pairs to calculate the AR model (which is convenient if
The discrete Fourier transform (DFT), the most commonly the exon regions are short and/or closely spaced), and that it
used method for spectrum analysis of a finite-length numerical provides a compact estimate of the signal spectrum. It has been
sequence of length , is defined as [36] shown that AR spectral estimation using the Burg algorithm and
improved covariance analysis performs better than the DFT for
the detection of period-3 behaviour in short genomic sequences
(1)
[34]. However, the selection of the model order is crucial in this
approach, since choosing too low or too high will result in un-
Equation (1) can be used to calculate DFTs for numerical necessarily smoothed or spurious modeling of spectral peaks,
sequences representing DNA sequence portions, for example respectively. Note also that AR models cannot reasonably be ap-
each of the four binary indicator sequences (i.e., , , plied to binary indicator sequences, since these could not have
AKHTAR et al.: SIGNAL PROCESSING IN SEQUENCE ANALYSIS: ADVANCES IN EUKARYOTIC GENE PREDICTION 313
resulted from an AR process. A possible solution to the problem 2) Frequency of Nucleotide Occurrence: It has been shown
is bandpass filtering of numerical sequences before their appli- in [42] that the four DNA nucleotides differ considerably in
cation to AR modeling. their occurrence in exonic regions, and that the fractional oc-
In order to reduce the spectral leakage present in DFT-based currence of any particular nucleotide is reasonably consistent
exon prediction, a larger window size is required, which implies across the Burset/Guigo1996 [10], HMR195 [11], and GEN-
longer computation time and also compromises the base-do- SCAN learning [45] datasets considered therein. It has been
main resolution. The infinite impulse response (IIR) antinotch further observed that the frequency of nucleotide occurrence in
(AN) filter approach in [35] attempts to address these problems. exons is a key parameter for any DNA representation to be used
The magnitude response of the antinotch filter has a sharp peak for the detection of these regions. According to the frequency
at , which preserves only period-3 components. The of nucleotide mapping [42], nucleotides are represented by their
individual outputs of the four binary indicator sequences can fractional occurrences in exons of a training database.
be combined in a sum-of-squares manner. It has been shown in
[35] that digital filter based period-3 detection results are com- B. DSP-Based Features for Gene and Exon Prediction
parable to those of the DFT-based SC measure given by (2). 1) Paired and Weighted Spectral Rotation (PWSR) Measure:
The autocorrelation function (ACF) is a measure of how well The PWSR measure [46] incorporates a statistical property of
a signal matches a time-shifted version of itself, as a function of eukaryotic sequences, according to which introns are rich in the
the time shift. Practically, the ACF will produce a peak if sig- nucleotides “ ” and “ ” whereas exons are rich in nucleotides
nificant correlation exists at . Besides period-3 detection, “ ” and “ .” This information leads to an alternative property
DNA sequences have been widely analyzed for other correla- to the well-known period-3 behavior of exons. In this method,
tions in [20], [31], [38], [39]. Li [40] gives a critical review of the DNA sequences are first converted into two binary indicators,
the study of correlation structures. - and - . Using training data from DNA sequences
The identification of protein coding regions is difficult of the same organism, the means and standard deviations
mainly due to the noncontiguous and noncontinuous nature of of the distributions of DFT phase angle averaged over coding re-
eukaryotic genes. Despite the existence of many DSP-based gions, i.e., one phase angle value per coding region, are calcu-
approaches and also data-driven approaches, the accuracy of lated. Weights based on the frequency of occurrence of nu-
exon prediction is still limited and needs to be improved. More- cleotides “ or ” and “ or ” in coding regions of the training
over, the existing approaches only rely on the identification of data are also calculated. The expression given in (4) can then be
period-3 property of exons, and do not fully exploit efficient used as a feature, along one direction of the DNA sequence
DNA representations and other complementary features re-
quired to separate protein coding and noncoding nucleotides. -
- -
These problems have been addressed in new DSP-based gene -
and exon prediction features, proposed in Section III-B.
-
- - (4)
-
III. NEW DNA REPRESENTATIONS AND DSP-BASED FEATURES
FOR GENE AND EXON PREDICTION where denotes the forward and reverse directions of
DNA sequence, and - - are the mean
A. DNA Numerical Representations and standard deviation values obtained from distributions of the
1) Paired Numeric: The paired numeric representation for DFT phase angle averaged over coding regions of the training
gene and exon prediction [42] exploits one of the differential data, are frequency of occurrence weights from training
properties of exons and introns, according to which introns are data, and are the sliding DFT windows of two indicator
rich in nucleotides “ ” and “ ” whereas exons are rich in nu- sequences. The PWSR is calculated in both directions of the
cleotides “ ” and “ ” [43]. Furthermore, the DFT phase angle DNA sequence, and combined as
histogram distributions for coding and noncoding regions of (5)
human datasets have been shown to give smaller and nearly
equal values of angular mean for distributions of nucleotides Note that due to paired indicators, a DFT in the reverse di-
“ ” and “ ” than those of “ ” and “ ” [42]. To fully exploit rection of the same DNA strand is equivalent to a DFT on its
both of these properties, these nucleotides - - can be complementary strand.
paired in a complementary manner and values of and can 2) Paired Spectral Content (PSC) Measure: The PWSR
be used to denote - and - nucleotide pairs, respectively. A measure is a data-driven frequency domain method for gene and
similar approach was used by Datta and Asif [44]; however, no exon prediction, which requires training from DNA sequences of
motivation for the “ - ” and “ - ” pairing was given. “ - ” the same organism. A more general method, known as the paired
and “ - ” are complementary pairs, joining opposite strands spectral content (PSC) measure [47], first converts the DNA se-
of double helix DNA through hydrogen bonds. However, this is quence into single numerical sequences using the paired-numeric
not the reason for their pairing here, as only one strand is used representation, as discussed in Section III-A-I, then combines
for the computational analysis of DNA. This representation in- forward and backward DFTs on the same DNA sequence
corporates a very useful DNA structural property, in addition to
reducing complexity. (6)
314 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 2, NO. 3, JUNE 2008
where and are DFTs of the indicator sequence the largest singular values of all frames obtained using all four
in the forward and reverse directions. Contrary to the SR binary indicator sequences can then be used for the coding/non-
and PWSR measures, the PSC measure can be applied to the se- coding decision. SVD-based period-3 detection has also been
quences taken from any organism, i.e., PSC is not an organism- enhanced using bandpass filtering of the individual binary indi-
specific measure. cator sequences, emphasizing the period-3 behavior [52].
3) “Time-Domain” Algorithms: In the following approaches 5) Time-Frequency Hybrid (TFH) Measure: The time-fre-
[48], DNA sequences are first converted into Voss indicator se- quency hybrid (TFH) measure [46] combines magnitude and
quences, which are passed through a second-order resonant filter phase-based features, acknowledging earlier results by Kotlar
with a center frequency of (similar to [35]) before being and Lavner [33] showing that additionally considering the DFT
input to either algorithm. This prefiltering helps to remove spec- phase angle is more informative than the magnitude alone. Fea-
tral components at , , , which arise from tures from the time-domain AMDF method and frequency-do-
the application of correlation-based approaches to a binary in- main PWSR measure are normalized to the range [0, 1] and then
dicator sequence at a base-domain lag of 3. summed to produce a TFH feature.
Average magnitude difference function (AMDF)—the av- 6) Multidimensional Features: For all existing methods and
erage magnitude difference function (AMDF) has long been new methods discussed in Sections III-B1–V, features are com-
used in speech processing, and is defined for a discrete signal bined, typically using a sum-of-squares approach as in (2), to
as a function of the period as [49] produce a 1-D feature for comparison with some predetermined
threshold. Since a sum-of-squares approach will not necessarily
(7) produce optimal feature fusion, multidimensional features have
been recently proposed in [53] and [54]. One such scheme uses
the PWSR and AMDF features from the TFH measure in [46],
where is the window length. The AMDF, with , is an and transforms each separately using the DCT, to decorrelate
efficient time-domain algorithm for gene and exon prediction them with energy localized to the first few coefficients. A 5-D
[48]. Practically, the AMDF will produce a deep null if signifi- feature, comprising one transformed PWSR coefficient and all
cant correlation exists at period . four transformed AMDF coefficients, is then formed. In an-
Time domain periodogram (TDP)—the time domain peri- other such scheme, the linear predictor coefficients are treated as
odogram (TDP) is an algorithm used for periodicity detection multidimensional AR-based features, modeling coding and non-
in sunspots and pitch detection for speech processing [50]. Ac- coding regions in terms of their spectral characteristics within
cording to this algorithm, the -point data are first arranged in a the given window length. The optimized AR based feature, with
matrix, with rows containing subsequences of length equal model order 12 and window size 180 bp is then concatenated
to the period being tested, where is the window length. with the 5-D TFH feature, as shown in [54]. The resultant higher
The columns of the matrix are summed to obtain the TDP vector dimensional feature set is then used for training and testing of
of size , as follows: the multidimensional feature based classification system.
(11)
(13)
where is the probability of generating nucleotide at posi-
tion of the signal, which can be estimated from the positional
frequencies of nucleotides in the training sets of the true and is proposed to discriminate the true and false acceptor sites,
false acceptor site sequences. Therefore, the positive and nega- where and are, respectively, the upper and lower indices
tive probabilistic models correspond to learning using true and for the period-3 summations in the putative coding (denoted
false acceptor site sequences, respectively. The log of the ratio ) and noncoding (denoted ) regions. DSP-based methods
of the WMM generated under a positive model to the WMM are attractive because they mostly do not require any training
generated under a negative model can be used as a score to dis- on the genomic data before use, unlike the WMM, WAM, and
criminate true acceptor splice sites from false. WWAM approaches, and also because they are derived from
Weight array method (WAM)—the WAM explores and cap- different information from these approaches. Hence, we also
tures the dependencies between adjacent positions, in contrast to combine the DSP-based method with WAM to improve the dis-
the WMM, which considers each position independently [58], crimination power of acceptor splice site detection. Empirically,
[59]. In [57], probabilities of generating a signal of length we found the WAM model of the region , and DSP-
under positive and negative Weight Array Models of the pyrim- based method over the region to be op-
idine-rich acceptor region are computed as timum for the recognition of human acceptor splice sites.
(12) V. EVALUATION
In this section, the evaluations of different DNA representa-
where is the conditional probability of generating nu- tions, feature extraction methods, and acceptor splice site detec-
cleotide at position , given nucleotide at position . tion methods (reviewed in Sections II–IV) are described, using
This quantity can be calculated from the ratio of the frequency standard eukaryotic datasets.
of dinucleotides and at positions and , to the fre-
quency of the nucleotide at position . A. Data Sets
Windowed weight array method (WWAM)—the WWAM is a Table I summarizes the Burset/Guigo1996 [10], HMR195
second-order WAM model, in which nucleotides of the branch [11], and GENSCAN [45] standard datasets, used herein.
point region are generated conditional on the nu- During the original formation of these datasets, certain common
cleotides of the previous two positions [45]. In order to have rules were followed. For example, each sequence consists of
enough data to model these second-order conditional probabili- one complete gene starting with an “ATG” codon and ending
ties, data from a window of adjacent signal positions are pooled. with one of the three possible stop codons (TAA, TAG, or
Here, the second-order conditional probability at position is TGA). The protein coding genes do not have any in-frame stop
calculated as the average of the conditional probabilities at po- codons. Moreover, the multiexon genes have “GT” and “AG”
sitions , , , , and . The WWAM has been consensus dinucleotides for the donor and acceptor splice sites,
combined with the WAM over the region to compute respectively.
signal ratio scores for acceptor site recognition in GENSCAN The data sets referred to, respectively, as the GENSCAN
[6], [45]. learning and test sets comprise the 188 multiexon gene se-
quences listed in [45, Appendix A] and 64 available multiexon
B. Proposed DSP-Statistical Hybrid Approach gene sequences listed in [45, Appendix B]. The number of
Since the exon region starts from the next nucleotide to the true/false acceptor site sequences of GENSCAN learning
consensus dinucleotide “AG” of the true acceptor sites in the and test sets were 1031/156107 and 317/44301, respectively,
direction, the detection of the presence or absence of period-3 when extracted from windows of 140 nucleotides around each
behavior, as determined by signal processing-based methods, in consensus dinucleotide “AG.”
316 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 2, NO. 3, JUNE 2008
TABLE I
SUMMARY OF DATASETS
TABLE III
SUMMARY OF PERIOD-3 DETECTION RESULTS ON THREE DATASETS
TABLE IV
SUMMARY OF ACCEPTOR SPLICE SITE DETECTION
RESULTS USING GENSCAN TEST SET
VII. CONCLUSION
In summary, a number of digital signal processing-based
Fig. 9. ROC plot for acceptor site detection, using GENSCAN test set. methods for eukaryotic gene prediction have been proposed,
and these have been evaluated alongside many other DSP-based
methods. Firstly, DNA symbolic-to-numeric mappings were
false positive and percentage specificity at different levels of compared in terms of both computational complexity and
percentage sensitivity, for all methods, using the GENSCAN relative accuracy for the gene and exon prediction problem.
test set. The proposed DSP-statistical hybrid method clearly From these experiments, the recently proposed paired numeric
achieves a larger area under the ROC curve, consistently fewer representation was shown to give an improvement of 2% over
false positives and higher percent specificities compared with the Voss binary indicator sequences in terms of area under
all three existing methods. By comparison with the WWAM the ROC curve for gene and exon prediction, with 75% less
method used in gene-finding program GENSCAN [6], the downstream processing, when evaluated on the GENSCAN test
number of false positives across different sensitivity levels in set.
the proposed method shows an average relative improvement All 1-D output feature methods for gene and exon predic-
of 43%. tion were then evaluated on the standard genomic datasets
According to the results of subsection B, further gains Burset/Guigo1996, HMR195 and GENSCAN. In terms of
might be expected from using multidimensional feature-based gene and exon prediction accuracy, the recently proposed
methods; however, these require suitable training data for TFH, AMDF, TDP, SVD, PWSR, and PSC methods exhibited
estimating the GMM parameters and, hence, suffer similar relatively more accurate gene and exon prediction, improving
drawbacks to existing data-driven techniques in terms of re- on the well-known DFT based-SC measure by 4% to 9% in
quiring sufficient organism-specific training data. terms of area under the ROC curve. In light of the weaknesses
320 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 2, NO. 3, JUNE 2008
and strengths of the 1-D genomic period-3 detection methods, [11] S. Rogic, A. K. Mackworth, and B. F. Ouellette, “Evaluation of gene-
we recommend the AMDF and TFH for nondata-driven and finding programs on mammalian sequences,” Genome Res., vol. 11, no.
5, pp. 817–832, 2001.
data-driven gene detection, respectively. Furthermore, recently [12] V. Makarov, “Computer programs for eukaryotic gene prediction,”
proposed multidimensional output feature methods were shown Briefings Bioinf., vol. 3, no. 2, pp. 195–199, 2002.
to give improved gene and exon prediction over their 1-D [13] A. Nagar, S. Purushothaman, and H. Tawfik, “Evaluation and fuzzy
classification of gene finding programs on human genome sequences,”
counterparts. By comparison with the most accurate 1-D mea- FSKD, pp. 821–829, 2005.
sure, the multidimensional methods yielded improvements of [14] S. Logeswaran, E. Ambikairajah, and J. Epps, “A method for detecting
5% to 11% in terms of relative increase in exonic nucleotides short initial exons,” in Proc. IEEE Workshop Genomic Signal Pro-
cessing and Statistics, 2006, pp. 61–62.
detected at a 10% false positive rate, when evaluated on the [15] Y. Saeys, P. Rouze, and Y. V. de Peer, “In search of the short ones:
GENSCAN test set. Evaluations of all schemes herein have Improved prediction of short exons in vertebrates, plants, fungi and
been performed on large databases and using metrics calculated protists,” Bioinformatics, vol. 23, no. 4, pp. 414–420, 2007.
[16] K. Murakami and T. Takagi, “Gene recognition by combination of
at the nucleotide level, in contrast to much of the previous several gene-finding programs,” Bioinformatics, vol. 14, no. 8, pp.
literature on the topic. 665–675, 1998.
Finally, we have also proposed a new DSP-statistical hybrid [17] V. Pavlovic, A. Garg, and S. Kasif, “A Bayesian framework for com-
bining gene predictions,” Bioinformatics, vol. 18, no. 1, pp. 19–27,
technique for acceptor splice site detection. Results show that 2002.
DSP-based approaches to gene and exon prediction alone [18] D. Anastassiou, “Genomic signal processing,” IEEE Signal Process.
are unlikely to rival current data-driven techniques such as Mag., vol. 18, no. 4, pp. 8–20, Apr. 2001.
[19] X. Zhang, F. Chen, Y. Zhang, S. C. Agner, M. Akay, Z. Lu, M. M.
GENSCAN or AUGUSTUS. The proposed DSP-statistical Y. Waye, and S. K. Tsui, “Signal processing techniques in genomic
combination for the detection of acceptor splice sites, which engineering,” Proc. IEEE, vol. 90, no. 12, pp. 1822–1833, Dec. 2002.
achieves a performance improvement of 43% over WWAM [20] R. F. Voss, “Evolution of long-range fractal correlations and 1/f noise in
DNA base sequences,” Phy. Rev. Lett., vol. 68, no. 25, pp. 3805–3808,
(used in GENSCAN), is illustrative of the potential DSP-based 1992.
techniques still offer in terms of improving the state of the art. [21] R. Zhang and C. T. Zhang, “Z curves, an intuitive tool for visualizing
Future directions may include more accurate identification of and analyzing the DNA sequences,” J. Biomol. Struct. Dyn., vol. 11,
no. 4, pp. 767–782, 1994.
exonic/intronic end-point signals (i.e., start codon, donor splice [22] B. D. Silverman and R. Linsker, “A measure of DNA periodicity,” J.
site, acceptor splice site, and stop codons) using multidimen- Theor. Biol., vol. 118, pp. 295–300, 1986.
sional DSP-based features, and combining signal processing [23] P. D. Cristea, “Genetic signal representation and analysis,” in Proc.
SPIE Inf. Conf. Biomedical Optics Symp., 2002, vol. 4623, pp. 77–84.
based work with data-driven methods to advance the state of [24] A. K. Brodzik and O. Peters, “Symbol-balanced quaternionic period-
the art in eukaryotic gene prediction algorithms. icity transform for latent pattern detection in DNA sequences,” in Proc.
IEEE ICASSP, 2005, vol. 5, pp. v/373–v/376.
[25] J. Ning, C. N. Moore, and J. C. Nelson, “Preliminary wavelet analysis
ACKNOWLEDGMENT of genomic sequences,” in Proc. IEEE Bioinformatics Conf., 2003, pp.
509–510.
The authors would like to thank the anonymous reviewers [26] G. L. Rosen, “Signal processing for biologically-inspired gradient
source localization and DNA sequence analysis,” Ph.D. dissertation,
and editor whose helpful suggestions resulted in substantial im- Georgia Inst. Technol., Atlanta, 2006.
provement of this paper. [27] A. S. S. Nair and T. Mahalakshmi, “Visualization of genomic data
using inter-nucleotide distance signals,” presented at the IEEE Int.
Conf. Genomic Signal Processing, 2005.
REFERENCES [28] E. Coward, “Equivalence of two Fourier methods for biological se-
quences,” J. Math. Biol., vol. 36, pp. 64–70, 1997.
[1] M. Stanke, R. Steinkamp, S. Waack, and B. Morgenstern, “AU- [29] W. Wang and D. H. Johnson, “Computing linear transforms of sym-
GUSTUS: A web server for gene finding in eukaryotes,” Nucl. Acids bolic signals,” IEEE Trans. Signal Process., vol. 50, no. 3, pp. 628–634,
Res., Web Server Issue, vol. 32, pp. W309–W312, 2004. Mar. 2002.
[2] V. V. Solovyev, A. A. Salamov, and C. B. Lawrence, “Identification of [30] E. N. Trifonov, “3-, 10.5-, 200- and 400-base periodicities in genome
human gene structure using linear discriminant functions and dynamic sequences,” Phys. A, vol. 249, pp. 511–516, 1998.
programming,” in Proc. 3rd Int. Conf. Intelligent Systems for Molec- [31] J. W. Fickett, “Recognition of protein coding regions in DNA se-
ular Biology, 1995, pp. 367–375. quences,” Nucl. Acids Res., vol. 10, pp. 5303–5318, 1982.
[3] G. Parra, E. Blanco, and R. Guigo, “GeneID in drosophila,” Genome [32] S. Tiwari, S. Ramaswamy, A. Bhattacharya, S. Bhattacharya, and R.
Res., vol. 10, no. 4, pp. 511–515, 2000. Ramaswamy, “Prediction of probable genes by Fourier analysis of ge-
[4] A. V. Lukashin and M. Borodovsky, “GeneMark.hmm: New solutions nomic sequences,” Comput. Appl. Biosci., vol. 13, pp. 263–270, 1997.
for gene finding,” Nucl. Acids Res., vol. 26, no. 4, pp. 1107–1115, 1998. [33] D. Kotlar and Y. Lavner, “Gene prediction by spectral rotation mea-
[5] D. Kulp, D. Haussler, M. G. Reese, and F. H. Eeckman, “A generalized sure: A new method for identifying protein-coding regions,” Genome
hidden Markov model for the recognition of human genes in DNA,” in Res., vol. 18, pp. 1930–1937, 2003.
Proc. 4th Int. Conf. Intelligent Systems for Molecular Biology, 1996, [34] N. Rao and S. J. Shepherd, “Detection of 3-periodicity for small ge-
pp. 134–142. nomic sequences based on AR techniques,” in Proc. IEEE Int. Conf.
[6] C. Burge and S. Karlin, “Prediction of complete gene structure in Comm., Circuits Syst., 2004, vol. 2, pp. 1032–1036.
human genomic DNA,” J. Mol. Biol., vol. 268, no. 1, pp. 78–94, 1997. [35] P. P. Vaidyanathan and B.-J. Yoon, “Gene and exon prediction using
[7] A. Krogh, “Two methods for improving performance of an HMM and allpass-based filters,” presented at the IEEE Workshop Genomic Signal
their applications for gene-finding,” in Proc. 5th Int. Conf. Intelligent Processing and Statistics, Raleigh, NC, 2002.
Systems for Molecular Biology, 1997, pp. 179–186. [36] S. K. Mitra, Digital Signal Processing: A Computer-Based Approach,
[8] S. Salzberg, A. L. Delcher, K. H. Fasman, and J. Henderson, “A deci- 2nd ed. Singapore: McGraw-Hill, 2002.
sion tree system for finding genes in DNA,” J. Comput. Biol., vol. 5, [37] G. Goertzel, “An algorithm for the evaluation of finite trigonometric
no. 4, pp. 667–680, 1998. series,” Amer. Math. Monthly, vol. 65, no. 1, pp. 34–35, 1958.
[9] M. Q. Zhang, “Identification of protein coding regions in the human [38] W. Li and T. G. Marr, “Understanding long-range correlations in DNA
genome by quadratic discriminant analysis,” Proc. Nat. Acad. Sci., vol. sequences,” Phys. D, vol. 75, pp. 392–416, 1994.
94, no. 2, pp. 565–568, 1997. [39] H. Herzel and I. Große, “Correlations in DNA sequences: The role of
[10] M. Burset and R. Guigo, “Evaluation of gene structure prediction pro- protein coding segments,” Phys. Rev. E, vol. 55, no. 1, pp. 800–810,
grams,” Genomics, vol. 34, pp. 353–367, 1996. 1997.
AKHTAR et al.: SIGNAL PROCESSING IN SEQUENCE ANALYSIS: ADVANCES IN EUKARYOTIC GENE PREDICTION 321
[40] W. Li, “The study of correlation structure of DNA sequences: A critical [62] M. Akhtar, E. Ambikairajah, and J. Epps, “Optimizing period-3
review,” Comput. Chem., vol. 21, no. 4, pp. 257–271, 1997. methods for eukaryotic gene prediction,” in Proc. IEEE ICASSP,
[41] S. J. Sangwine, “The discrete quaternion Fourier transform,” in Proc. 2008, pp. 621–624.
6th Int. Conf. Image Processing and its Applications, 1997, vol. 2, pp.
790–793.
[42] M. Akhtar, J. Epps, and E. Ambikairajah, “On DNA numerical repre-
sentations for period-3 based exon prediction,” presented at the IEEE Mahmood Akhtar received the B.Sc. (Honors)
Workshop on Genomic Signal Processing and Statistics, Tuusula, Fin- degree in electrical engineering from the University
land, 2007. of Engineering and Technology (UET), Lahore,
[43] P. D. Cristea, “Conversion of nucleotides sequences into genomic sig- Pakistan, in 2003, and the M.S. degree in computer
nals,” J. Cell. Mol. Med., vol. 6, no. 2, pp. 279–303, 2002. engineering from the National University of Sciences
[44] S. Datta and A. Asif, “A fast DFT based gene prediction algorithm for and Technology (NUST), Pakistan, in 2005. He is
identification of protein coding regions,” in Proc. IEEE ICASSP, 2005, currently pursuing the Ph.D. degree in the School
vol. 5, pp. 653–656. of Electrical Engineering and Telecommunications,
[45] C. Burge, “Identification of genes in human genomic DNA,” Ph.D. dis- University of New South Wales, Australia.
sertation, Stanford Univ., Stanford, CA, 1997. His main research interest is in genomic signal pro-
[46] M. Akhtar and J. E. E. Ambikairajah, “Time and frequency domain cessing with specific focus on DNA representations,
methods for gene and exon prediction in eukaryotes,” in Proc. IEEE feature extractions, and classifications of genomic protein coding and noncoding
ICASSP, 2007, pp. 573–576. regions. Other research interests lie in the areas of public health bio-surveillance
[47] M. Akhtar, J. Epps, and E. Ambikairajah, “Paired spectral content mea- modeling, speech, audio, and image processing. He has authored or coauthored
sure for gene and exon prediction in eukaryotes,” in Proc. IEEE Int. around 16 publications.
Conf. Information and Emerging Technologies, 2007, pp. 127–130.
[48] E. Ambikairajah, J. Epps, and M. Akhtar, “Gene and exon prediction
using time-domain algorithms,” in Proc. IEEE 8th Int. Symp. Signal
Processing and its Applications, 2005, pp. 199–202.
[49] M. Ross, H. Shaffer, A. Cohen, R. Freudberg, and H. Manley, “Average Julien Epps (M’99) received the B.E. and Ph.D. de-
magnitude difference function pitch extractor,” IEEE Trans. Acoustics, grees from the University of New South Wales, Aus-
Speech, Signal Process., vol. ASSP-22, no. 5, pp. 353–362, May 1974. tralia, in 1997 and 2001, respectively.
[50] E. Ambikairajah and M. J. Carey, “The time-domain periodogram al- After an appointment as a Postdoctoral Fellow at
gorithm,” Signal Process., vol. 5, pp. 491–513, 1983. the University of New South Wales, he worked on
[51] P. P. Kanjilal, J. Bhattacharya, and G. Saha, “Robust method for pe- speech recognition and language processing research
riodicity detection and characterization of irregular cyclical series in as a Research Engineer at Motorola Labs and then
terms of embedded periodic components,” Phys. Rev. E, vol. 59, no. 4, as a Senior Researcher at National ICT Australia. He
pp. 4013–4025, 1999. joined the UNSW School of Electrical and Telecom-
[52] M. Akhtar, E. Ambikairajah, and J. Epps, “Detection of period-3 be- munications as a Senior Lecturer in 2007. He has au-
havior in genomic sequences using singular value decomposition,” in thored or coauthored around 80 publications and has
Proc. IEEE Int. Conf. Emerging Technologies, 2005, pp. 13–17. served as a reviewer for several IEEE, IET, and other journals and numerous
[53] M. Akhtar, E. Ambikairajah, and J. Epps, “GMM-based classification conferences. His research interests include speaker verification, speech recog-
of genomic sequences,” in Proc. IEEE 15th Int. Conf. Digital Signal nition, speech and audio coding, auditory modeling, speech enhancement, and
Processing, 2007, pp. 103–106. genomic signal processing.
[54] M. Akhtar, E. Ambikairajah, and J. Epps, “Comprehensive autoregres-
sive modeling for classification of genomic sequences,” presented at the
IEEE 6th Int. Conf. Information, Communications, Signal Processing,
2007. Eliathamby Ambikairajah (M’90) received the
[55] P. Pollastro and S. Rampone, “ HS D : Homo sapiens splice site data Ph.D. degree from Keele University, U.K.
set,” Nucl. Acids Res. Annu. Database Issue, 2003. He was appointed as Head of Electronic En-
[56] R. Staden, “Computer methods to locate signals in nucleic acid se- gineering and later Dean of Engineering at the
quences,” Nucl. Acids Res., vol. 12, pp. 505–519, 1984. Athlone Institute of Technology, Ireland. He was
[57] C. Burge, “Modeling dependencies in pre-mRNA splicing signals,” an invited Research Fellow with British Telecom
in Computational Methods in Molecular Biology, S. L. Salzberg, D. Laboratories (BTL), Martlesham Heath, U.K., for
B. Searls, and S. Kasif, Eds. New York: Elsevier, 1998, ch. 8, pp. ten years (1989–1999). He joined the University
129–164. of New South Wales, Australia, in 1999, where
[58] M. Q. Zhang and T. G. Marr, “A weight array method for splicing he is currently the Deputy Head of School and
signal analysis,” CABIOS, vol. 9, no. 5, pp. 499–509, 1993. the Director of Academic Studies in the School of
[59] S. L. Salzberg, “A method for identifying splice sites and translational Electrical Engineering and Telecommunications. His research interests include
start sites in eukaryotic mRNA,” Comput. Appl. Biosci., vol. 13, no. 4, speech and audio compression, speech enhancement and recognition, biometric
pp. 365–376, 1997. technology, and biomedical signal processing. He has authored and coauthored
[60] T. Fawcett, ROC Graphs: Notes and Practical Considerations around 150 conference and journal papers and is also a regular reviewer for
for Researchers HP Laboratories, 2003 [Online]. Available: several IEEE, IET, and other journals and conferences.
http://www.hpl.hp.com/personal/Tom Fawcett/papers/ROC101 Prof. Ambikairajah received the Vice-Chancellor’s Award for Teaching Ex-
[61] A. Rushdi and J. Tuqan, “Gene identification using the Z -curve repre- cellence in April 2004 for his innovative use of educational technology. He is
sentation,” in Proc. IEEE ICASSP, 2006, vol. 2, pp. 1024–1027. currently a Fellow and a Chartered Engineer of IET (UK) and EA (Australia).