Approximate string matching Research Papers

We improve the fastest known algorithm for approximate string matching, which can be used only for low error levels. By using a new method to verify potential matches and a new optimization technique for biased texts (such as English),... more

Bookmark
Download
- by Ricardo Baeza-yates
- •
- 5
  Engineering, Information Retrieval, IPL, Mathematical Sciences

In this article, a word-oriented approximate string matching approach for searching Arabic text is presented. The distance between a pair of words is determined on the basis of aligning the two words by using occurrence heuristic tables.... more

Bookmark
Download
- by Antonello Rizzi
- •
- 6
  Computer Science, FPGA, Dynamic programming, Cognitive Computation

This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine... more

The web has become a resourceful tool for almost all domains today. Search engines prominently use inverted indexing technique to locate the web pages having the users query. The performance of inverted index fundamentally depends upon... more

This paper focuses on the problem of alias detection based on orthographic variations of Arabic names. Alias detection is the process to identify dif f erent variants of the same name. To detect aliases based on orthographic vari ations,... more

The δ-approximate string matching problem, recently introduced in connection with applications to music retrieval, is a generalization of the exact string matching problem for alphabets of integer numbers. In the δ-approximate variant,... more

A compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T. It can also reproduce any substring of T , thus actually replacing T. Despite the recent explosion of... more

Bookmark
Download
- by Arlindo Oliveira
- •
- 9
  Engineering, Algorithms, Data Structure, Mathematical Sciences

We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for... more

Bookmark
Download
- by Ricardo Baeza-yates
- •
- 7
  Engineering, Algorithms, Mathematical Sciences, Pattern Matching

Approximate string matching is an important operation in information systems because an input string is often an inexact match to the strings already stored. Commonly known accurate methods are computationally expensive as they compare... more

Bookmark
Download
- by Olumide Owolabi
- •
- 3
  Dictionary, Approximate string matching, N gram

Given two strings, a pattern P of length m and a text T of length n over some alphabet Σ, we consider the string matching problem under k mismatches. The wellknown Shift-Add algorithm (Baeza-Yates and Gonnet, 1992) solves the problem in... more

Bookmark
Download
- by Szymon Grabowski
- •
- 12
  Engineering, Algorithms, Algorithm, Information Processing

We consider a version of pattern matching useful in processing large musical data: - matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The... more

Indexing Methods for Approximate String Matching Gonzalo Navarro£ Ricardo Baeza-Yates£ Erkki Sutinen Ý Jorma Tarhio Þ Abstract Indexing for approximate text searching is a novel problem that has received significant attention be-cause of... more

In this chapter we deal with various string manipulation problems which originate from the field of computational biology and mu- sicology. These problems are: "approximate string matching with gaps", "inference of maximal... more

Bookmark
Download
- by Katerina Perdikuri
- •
- 5
  Computational Biology, Data Structure, Suffix Tree, Upper Bound

This presentation looks at the spammers modifying content in spam. For example the deliberate misspelling of words like Viagra to get through spam filters. We use the dynamic programming algorithm to identify variations of these words.... more

The Boyer-Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length... more

Bookmark
Download
- by Esko Ukkonen
- •
- 6
  Pure Mathematics, Experimental Evaluation, String Matching, Edit Distance

The newest generation of sequencing instruments, such as Illumina/Solexa Genome Analyzer and ABI SOLiD, can generate hundreds of millions of short DNA "reads" from a single run. These reads must be matched against a reference genome to... more

We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in... more

Bookmark
Download
- by Heikki Hyyrö
- •
- 8
  Bioinformatics, Distance, Natural language, Search Algorithm

Bookmark
Download
- by Filipo Mignosi
- •
- 13
  Mathematics, Computer Science, Modeling, Data Structure

A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T , thus it actually replaces T. Despite the... more

A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T , thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes. 1 Introduction and Related Work Approximate string matching (ASM) is an important problem that arises in applications related to text searching, pattern recognition, signal processing, and computational biology, to name a few. It consists in locating all the occurrences of a given pattern string P [0, m − 1] in a larger text string T [0, u − 1], letting the occurrences be at distance ed() at most k from P. In this paper we focus on edit distance, that is, the minimum number of character insertions, deletions, and substitutions of single characters to convert one string into the other. The classical sequential search solution runs in O(um) worst-case time (see [1]). An optimal average-case algorithm requires time O(u(k + log σ m)/m) [2, 3], where σ is the size of the alphabet Σ. Those good average-case algorithms are called filtration algorithms: they traverse the text fast while checking for a simple necessary condition, and only when this holds they verify the text area using a

Bookmark
Download
- by Arlindo Oliveira
- •
- 7
  Mathematics, Computer Science, Computational Biology, Space Time

We introduce a problem called Maximum Common Characters in Blocks (MCCB), which arises in applications of approximate string comparison, particularly in the unification of possibly erroneous textual data coming from different sources. We... more

Bookmark
Download
- by Alysson Costa
- •
- 11
  Linear Programming, Integer Programming, Multidisciplinary, Unification

We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the... more

Bookmark
Download
- by Avi Shmidman and +1
  Moshe Koppel
- •
- Approximate string matching

In this paper we consider several new versions of approximate string matching with gaps. The main characteristic of these new versions is the existence of gaps in the matching of a given pattern in a text. Algorithms are devised for each... more

We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by... more

Bookmark
Download
- by Jani Tanninen
- •
- 5
  Applied Mathematics, Load Balance, Indexation, Discrete Algorithms

In this paper we describe a factorial language, denoted by L (S, k, r), that contains all words that occur in a string S up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition... more

Bookmark
Download
- by C. Epifanio
- •
- 5
  Data Structure, Combinatorics on Words, Formal language, Indexation

Distinctive visual cues are of central importance for image retrieval applications, in particular, in the context of visual location recognition. While in indoor environments typically only few distinctive features can be found, outdoors... more

We consider a version of pattern matching useful in processing large musical data: - matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The... more

Indexing Methods for Approximate String Matching Gonzalo Navarro£ Ricardo Baeza-Yates£ Erkki Sutinen Ý Jorma Tarhio Þ Abstract Indexing for approximate text searching is a novel problem that has received significant attention be-cause of... more

Bookmark
Download
- by E. Sutinen and +1
  Ricardo Baeza-yates
- •
- 4
  Signal Processing, Computational Biology, Indexation, Approximate string matching

Treating electronic ink as first-class data -as opposed to simply a substitute for keyboard input -offers intriguing possibilities. The pen has well-known advantages in terms of portability and user acceptance, and ink is an extremely... more

This study focuses on the intellectual accessibility of information in indigenous languages, using Zulu, one of the main indigenous languages in South Africa, as a test case. Both Cross-Lingual Information Retrieval (CLIR) and metadata... more

In this paper we consider the confidentiality aspects of particular Grid's applications such as, for example, genetic applications. The search of DNA similarities is one of the interesting areas of genetic biology. However, DNA sequences... more

Bookmark
Download
- by Yves Roggeman
- •
- 6
  Genetics, String Matching, Edit Distance, Grid System

Distinctive visual cues are of central importance for image retrieval applications, in particular, in the context of visual location recognition. While in indoor environments typically only few distinctive features can be found, outdoors... more

Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a time-consuming process. In this paper we investigate the performance of metric trees, namely the M-tree, when they are... more

Bookmark
Download
- by I. Bartolini
- •
- 14
  Computer Science, Data Mining, Data Analysis, Pattern Recognition

The k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (insertions, deletions, substitutions) allowed in a match, and asks for every location in... more

Bookmark
Download
- by William I Chang
- •
- 14
  Genetics, Computer Science, Molecular Biology, Pattern Recognition

Here, we present PatMatch, an efficient, web-based pattern-matching program that enables searches for short nucleotide or peptide sequences such as cis-elements in nucleotide sequences or small domains and motifs in protein sequences. The... more

Bookmark
Download
- by eeee eeee
- •
- 20
  Genetics, Computer Science, Biology, Medicine

This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of... more

Approximate string matching is an important problem in Computer Science. The standard solution for this problem is an O(mn) running time and space dynamic programming algorithm for two strings of length m and n. Lan-dau and Vishkin... more

We explain new ways of constructing search algorithms using fuzzy sets and fuzzy automata. This technique can be used to search or match strings in special cases when some pairs of symbols are more similar to each other than the others.... more

Bookmark
Download
- by aboul ella hassanien
- •
- 6
  Soft Computing, Search Algorithm, String Matching, Fuzzy Set

Approximate string matching is an important paradigm in domains ranging from speech recognition to information retrieval and molecular biology. In this paper, we introduce a new formalism for a class of applications that takes two strings... more

AbstractÐSeveral new number representations based on a Residue Number System are presented which use the smallest prime numbers as moduli and are suited for parallel computations on a reconfigurable mesh architecture. The bit model of... more

Index-based search algorithms are an important part of a genomic search, and how to construct indices is the key to an index-based search algorithm to compute similarities between two DNA sequences. In this paper, we propose an efficient... more

Bookmark
Download
- by Seung-Ho Kang
- •
- 19
  Mathematics, Computer Science, Algorithms, Medicine

This paper introduces a framework for clarifying and formalizing the duplicate document detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution derived from the realm of approximate... more

Bookmark
Download
- by Daniel Lopresti
- •
- 7
  Computer Science, Information Management, Data Mining, Packaging

We propose novel machine learning methods for exploring the domain of music performance praxis. Based on simple measurements of timing and intensity in 12 recordings of a Schubert piano piece, short performance sequences are fed into a... more

In this paper we focus on the construction of the minimal deterministic finite automaton S k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of the automaton S k in order to accept in... more

Bookmark
Download
- by Filipo Mignosi
- •
- 5
  Mathematics, Computer Science, Combinatorics on Words, Application

Approximate string matching

Log In