Approximate string matching
12 Followers
Recent papers in Approximate string matching
We improve the fastest known algorithm for approximate string matching, which can be used only for low error levels. By using a new method to verify potential matches and a new optimization technique for biased texts (such as English),... more
In this article, a word-oriented approximate string matching approach for searching Arabic text is presented. The distance between a pair of words is determined on the basis of aligning the two words by using occurrence heuristic tables.... more
This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine... more
The web has become a resourceful tool for almost all domains today. Search engines prominently use inverted indexing technique to locate the web pages having the users query. The performance of inverted index fundamentally depends upon... more
This paper focuses on the problem of alias detection based on orthographic variations of Arabic names. Alias detection is the process to identify dif f erent variants of the same name. To detect aliases based on orthographic vari ations,... more
The δ-approximate string matching problem, recently introduced in connection with applications to music retrieval, is a generalization of the exact string matching problem for alphabets of integer numbers. In the δ-approximate variant,... more
A compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T. It can also reproduce any substring of T , thus actually replacing T. Despite the recent explosion of... more
We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for... more
Approximate string matching is an important operation in information systems because an input string is often an inexact match to the strings already stored. Commonly known accurate methods are computationally expensive as they compare... more
Given two strings, a pattern P of length m and a text T of length n over some alphabet Σ, we consider the string matching problem under k mismatches. The wellknown Shift-Add algorithm (Baeza-Yates and Gonnet, 1992) solves the problem in... more
We consider a version of pattern matching useful in processing large musical data: - matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The... more
Indexing Methods for Approximate String Matching Gonzalo Navarro£ Ricardo Baeza-Yates£ Erkki Sutinen Ý Jorma Tarhio Þ Abstract Indexing for approximate text searching is a novel problem that has received significant attention be-cause of... more
In this chapter we deal with various string manipulation problems which originate from the field of computational biology and mu- sicology. These problems are: "approximate string matching with gaps", "inference of maximal... more
This presentation looks at the spammers modifying content in spam. For example the deliberate misspelling of words like Viagra to get through spam filters. We use the dynamic programming algorithm to identify variations of these words.... more
The Boyer-Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length... more
The newest generation of sequencing instruments, such as Illumina/Solexa Genome Analyzer and ABI SOLiD, can generate hundreds of millions of short DNA "reads" from a single run. These reads must be matched against a reference genome to... more
We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in... more
A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T , thus it actually replaces T. Despite the... more
We introduce a problem called Maximum Common Characters in Blocks (MCCB), which arises in applications of approximate string comparison, particularly in the unification of possibly erroneous textual data coming from different sources. We... more
In this paper we consider several new versions of approximate string matching with gaps. The main characteristic of these new versions is the existence of gaps in the matching of a given pattern in a text. Algorithms are devised for each... more
We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by... more
In this paper we describe a factorial language, denoted by L (S, k, r), that contains all words that occur in a string S up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition... more
Distinctive visual cues are of central importance for image retrieval applications, in particular, in the context of visual location recognition. While in indoor environments typically only few distinctive features can be found, outdoors... more
We consider a version of pattern matching useful in processing large musical data: - matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The... more
Treating electronic ink as first-class data -as opposed to simply a substitute for keyboard input -offers intriguing possibilities. The pen has well-known advantages in terms of portability and user acceptance, and ink is an extremely... more
In this paper we consider the confidentiality aspects of particular Grid's applications such as, for example, genetic applications. The search of DNA similarities is one of the interesting areas of genetic biology. However, DNA sequences... more
Distinctive visual cues are of central importance for image retrieval applications, in particular, in the context of visual location recognition. While in indoor environments typically only few distinctive features can be found, outdoors... more
Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a time-consuming process. In this paper we investigate the performance of metric trees, namely the M-tree, when they are... more
The k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (insertions, deletions, substitutions) allowed in a match, and asks for every location in... more
Here, we present PatMatch, an efficient, web-based pattern-matching program that enables searches for short nucleotide or peptide sequences such as cis-elements in nucleotide sequences or small domains and motifs in protein sequences. The... more
This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of... more
Approximate string matching is an important problem in Computer Science. The standard solution for this problem is an O(mn) running time and space dynamic programming algorithm for two strings of length m and n. Lan-dau and Vishkin... more
We explain new ways of constructing search algorithms using fuzzy sets and fuzzy automata. This technique can be used to search or match strings in special cases when some pairs of symbols are more similar to each other than the others.... more
Approximate string matching is an important paradigm in domains ranging from speech recognition to information retrieval and molecular biology. In this paper, we introduce a new formalism for a class of applications that takes two strings... more
AbstractÐSeveral new number representations based on a Residue Number System are presented which use the smallest prime numbers as moduli and are suited for parallel computations on a reconfigurable mesh architecture. The bit model of... more
Index-based search algorithms are an important part of a genomic search, and how to construct indices is the key to an index-based search algorithm to compute similarities between two DNA sequences. In this paper, we propose an efficient... more
This paper introduces a framework for clarifying and formalizing the duplicate document detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution derived from the realm of approximate... more
We propose novel machine learning methods for exploring the domain of music performance praxis. Based on simple measurements of timing and intensity in 12 recordings of a Schubert piano piece, short performance sequences are fed into a... more
In this paper we focus on the construction of the minimal deterministic finite automaton S k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of the automaton S k in order to accept in... more