Papers by Katerina Perdikuri
One of the most important goals in computational molecular biology is allocating repeated pattern... more One of the most important goals in computational molecular biology is allocating repeated patterns in nucleic or protein sequences, and identifying structural or functional motifs that are common to a set of such sequences. Although the problem of computing the repetitions in biological sequences has been extensively studied, in the relevant literature, the problem of computing the repetitions in biological weighted sequences has not been efficiently solved. In this work we present an O(n 2 ) algorithm for computing the set of repetitions in a biological weighted sequence with probability of appearance larger than 1/k , where k is a given constant. Our algorithm can be applied in the detection of the repeated patterns in biological weighted sequences such as assembled DNA sequences.
... Iliopoulos, Costas S. and Makris, C. and Panagis, I. and Perdikuri, Katerina and Theodoridis,... more ... Iliopoulos, Costas S. and Makris, C. and Panagis, I. and Perdikuri, Katerina and Theodoridis, E. and Tsakalidis, A. (2003) Computing the Repetitions in a Weighted Sequence using Weighted Suffix Trees. In: European Conference on Computational Biology. ...
In this chapter we deal with various string manipulation problems which originate from the field ... more In this chapter we deal with various string manipulation problems which originate from the field of computational biology and musicology. These problems are: "approximate string matching with gaps", "inference of maximal pairs in a set of strings" and "handling of weighted sequences". We provide new upper bounds for solving these problems and for the third we propose a novel data structure, for the representation of the weighted sequences, which inherits most of the properties of the suffix tree.
Algorithms for computing typical regularities in strings with don't care symbols are presented. T... more Algorithms for computing typical regularities in strings with don't care symbols are presented. The period of a string of length n over an alphabet Σ can be computed in O(n log n log |Σ|) worst-case time. The computation of all possible borders, the border array and all covers of a string require quadratic time in the worst-case but in practice it performs very well. The expected average running time is linear.
Digital Signal Processing techniques constitute the basic scientific approach used in most of the... more Digital Signal Processing techniques constitute the basic scientific approach used in most of the current advances in medicine. In particular, the development of algorithms in order to extract, predict and model raw biomedical data series has revolutionized many routine, but data-intensive, areas of current medical practice. In this contribution, we present an evolutionary technique for modelling and analysing Non-linear Time Series (NLTS). The proposed methodology has been already used in two cases with great biomedical importance and we therefore explore its effectiveness on other biomedical signals.
Fundamenta Informaticae
In this paper we introduce the Weighted Suffix Tree, an efficient data structure for com- puting ... more In this paper we introduce the Weighted Suffix Tree, an efficient data structure for com- puting string regularities in weighted sequences of molecular data. Molecular Weighted Sequences can model important biological processes such as the DNA Assembly Process or the DNA-Protein Binding Process. Thus pattern matching or identification of repeated patterns, in biological weighted sequences is a very important procedure in the translation of gene expression and regulation. We present time and space efficient algorithms for constructing the weighted suffix tree and some appli- cations of the proposed data structure to problems taken from the Molecular Biology area such as pattern matching, repeats discovery, discovery of the longest common subsequence of two weighted sequences and computation of covers.
Proceedings of the eleventh international conference on Information and knowledge management - CIKM '02, 2002
In our days the business, scientific and personal databases are growing in an exponential rate. H... more In our days the business, scientific and personal databases are growing in an exponential rate. However, what is truly valuable is the knowledge that can be extracted from the stored data. Knowledge Discovery in patent databases was traditionally based on manual analysis carried out from statistical experts. Nowadays the increasing interest of many actors have led to the development of
Fun with Algorithms, 2000
In this paper we develop new and efficient algorithms for the problems of pattern matching and id... more In this paper we develop new and efficient algorithms for the problems of pattern matching and identification of repeated patterns in biological weighted sequences. Biological Weighted Se- quences can model important biological processes such as the DNA Assembly Process or the DNA-Protein Binding Process. Thus, pattern matching or identification of repeated patterns in biological weighted sequences is a very important
Lecture Notes in Computer Science, 2004
We present in this paper three algorithms. The first extracts repeated motifs from a weighted seq... more We present in this paper three algorithms. The first extracts repeated motifs from a weighted sequence. The motifs correspond to words which occur at least q times and with hamming distance e in a weighted sequence with probability ≥ 1/k each time, where k is a small constant. The second algorithm extracts common motifs from a set of N ≥ 2 weighted sequences with hamming distance e. In the second case, the motifs must occur twice with probability ≥ 1/k, in 1 ≤ q ≤ N distinct sequences of the set. The third algorithm extracts maximal pairs from a weighted sequence. A pair in a sequence is the occurrence of the same substring twice. In addition, the algorithms presented in this paper improve slightly on previous work on these problems.
Indexing video content is one of the most important problems in video databases. In this paper we... more Indexing video content is one of the most important problems in video databases. In this paper we present a simple optimal algorithm for this problem that answers certain content queries invoking video functions in linear time and space in terms of the number of the objects appearing in the video. To accomplish this, we make a straightforward reduction of this
In this chapter we deal with various string manipulation problems which originate from the field ... more In this chapter we deal with various string manipulation problems which originate from the field of computational biology and mu- sicology. These problems are: "approximate string matching with gaps", "inference of maximal pairs in a set of strings" and "handling of weighted sequences". We provide new upper bounds for solving these problems and for the third we propose a novel
IFIP International Federation for Information Processing, 2004
In this paper we introduce the Weighted Suffix Tree, an efficient data structure for computing st... more In this paper we introduce the Weighted Suffix Tree, an efficient data structure for computing string regularities in weighted sequences of molecular data. Molecular Weighted Sequences can model important biological processes such as the DNA Assembly Process or the DNA-Protein Binding Process. Thus pattern matching or identification of repeated patterns, in biological weighted sequences is a very important procedure in the translation of gene expression and regulation. We present time and space efficient algorithms for constructing the weighted suffix tree and some applications of the proposed data structure to problems taken from the Molecular Biology area such as pattern matching, repeats discovery, discovery of the longest common subsequence of two weighted sequences and computation of covers.
Journal of Discrete Algorithms, 2007
In this paper we present three algorithms for the Motif Identification Problem in Biological Weig... more In this paper we present three algorithms for the Motif Identification Problem in Biological Weighted Sequences. The first algorithm extracts repeated motifs from a biological weighted sequence. The motifs correspond to repetitive words which are approximately equal, under a Hamming distance, with probability of occurrence 1/k, where k is a small constant. The second algorithm extracts common motifs from a set of N 2 weighted sequences. In this case, the motifs consists of words that must occur with probability 1/k, in 1 q N distinct sequences of the set. The third algorithm extracts maximal pairs from a biological weighted sequence. A pair in a sequence is the occurrence of the same word twice. In addition, the algorithms presented in this paper improve previous work on these problems.
Journal of Computational Biology, 2006
Biological Weighted Sequences are used extensively in Molecular Biology as profiles for protein f... more Biological Weighted Sequences are used extensively in Molecular Biology as profiles for protein families, in the representation of binding sites and often for the representation of sequences produced by a shotgun sequencing strategy. In this paper we address three fundamental problems in the area of Biological Weighted Sequences: i) Computation of Repetitions, ii) Pattern Matching and iii) Computation of Regularities. To the best of our knowledge, this is the first time these problems are tackled in the relative literature. Our algorithms can be used as basic building blocks for more sophisticated algorithms applied on weighted sequences. A preliminary form of the results in this paper were presented in the conferences Fun with Algorithms [Iliopoulos et al. 2004b], CompBionets [Christodoulakis et al. 2004a] and ICCMSE [Christodoulakis et al. 2004b].
International Conference on Information Technology and Applications, 2005
In this paper we present algorithms for the localization and extraction of interesting motifs fro... more In this paper we present algorithms for the localization and extraction of interesting motifs from biological sequences. We are especially interested in weighted sequences, which are extensively used in molecular biology as profiles for protein families and for the representation of binding sites. It is our belief that these algorithms can also be applied to other information technology applications such
Uploads
Papers by Katerina Perdikuri