Lecture 01
Lecture 01
Lecture 01
Juha Kärkkäinen
Contents
0. Introduction
1. Sets of strings
• Search trees, string sorting, binary search
4. Text Indexes
• Suffix tree and suffix array
• Preprocess a long text for fast string matching and
all kinds of other tasks
2
0. Introduction
Strings and sequences are one of the simplest, most natural and most used
forms of storing information.
• Many algorithmic techniques are first developed for strings and later
generalized for more complex types of data such as graphs.
3
About this course
On the lectures, we will focus on the clean, basic problem. Exercises may
include some variations and extensions. We will mostly ignore any
application specific issues.
4
Strings
Examples of alphabets:
• Σ = {a, b, . . . , z}
• Unicode alphabet
• ...
5
Alphabet or not:
• UTF-8
– Encoding of Unicode symbols using 1–4 bytes
– Not an alphabet, Unicode is the underlying alphabet
– Alternatively, we can treat a UTF-8 encoded string as
a sequence of bytes.
6
A string is a sequence of symbols. The set of all strings over an alphabet Σ
is
∞
[
∗
Σ = Σk = Σ0 ∪ Σ1 ∪ Σ2 ∪ . . .
k=0
where
k
z }| {
k
Σ = Σ × Σ × ··· × Σ
= {a1 a2 . . . ak | ai ∈ Σ for 1 ≤ i ≤ k}
= {(a1 , a2 , . . . , ak ) | ai ∈ Σ for 1 ≤ i ≤ k}
is the set of strings of length k. In particular, Σ0 = {ε}, where ε is the
empty string.
7
There are many notations for strings.
• α, β ∈ Σ∗
• x = a1a2 . . . ak where ai ∈ Σ for all i
• w = u · v = uv, u, v ∈ Σ∗ (w is the concatenation of u and v)
8
Individual characters or their positions usually do not matter. The
significant entities are the substrings or factors.
9
Models of Computation
The usual model for analyzing algorithms is RAM (Random Access
Machine), where all basic operations such as memory accesses, comparisons
and arithmetic operations take constant time.
However, RAM can be unrealistically powerful:
A more realistic model is word RAM, that allows only w-bit integers, where
w is the machine word size.
10
The word RAM model still allows packing about w/ log σ characters into a
single machine word. We call this the packed string model.
11
We consider three alphabet models.
12
The alphabet models are really used for classifying and analysing algorithms
rather than alphabets:
• Algorithms for the general alphabet model that use only comparisons
are highly general and can be used in almost any situation.
• Using the more powerful operations of the integer alphabet model can
make algorithms faster. However, it makes the algorithms less general,
and can complicate the analysis.
• The constant alphabet model often indicates one of two things:
– The effect of the alphabet on the complexity is complicated and the
constant alphabet model is used to simplify the analysis.
– The time or space complexity of the algorithm is heavily (e.g.,
linearly) dependent on the alphabet size and the algorithm is
effectively unusable for large alphabets.
13
Example 0.3: Suppose we want to count how many distinct characters
occur in a given string. We can do this by maintaining a set of distinct
characters. Each character of the string is added to the set if not already
there. The alphabet model determines the choice of the set data structure:
• With general alphabet model, we can use a balanced binary search tree.
14
1. Sets of Strings
Basic operations on a set of objects include:
There are many other operations too but we will concentrate on these here.
15
An efficient execution of the operations requires that the set is stored as a
suitable data structure.
• An ordered array supports lookup and range searching in the same time
as binary search trees using binary searching. It is simpler, faster and
more space efficient in practice, but does not support insertions and
deletions.
16
The above time complexities assume that basic operations on the objects
including comparisons can be performed in constant time. When the objects
are strings, this is no more true:
• The worst case time for a string comparison is the length of the shorter
string. Even the average case time for a random set of n strings is
O(logσ n) in many cases, including for basic operations in a balanced
binary search tree. We will later show an even stronger result for
sorting later. And sets of strings are rarely fully random.
• Computing a hash function is slower too. A good hash function
depends on all characters and cannot be computed faster than the
length of the string.
Prefix query: Find all strings in R that have the query string S as a
prefix. This is a special type of range query.
Lcp (longest common prefix) query: What is the length of the
longest prefix of the query string S that is also a prefix of some
string in R.
Thus we need special set data structures and algorithms for strings.
17
Trie
A simple but powerful data structure for a set of strings is the trie. It is a
rooted tree with the following properties:
• For every node v, the edges from v to its children have different labels.
e s a
18
The trie is conceptually simple but it is not simple to implement efficiently.
The time and space complexity of a trie depends on the implementation of
the child function:
19
There are many implementation options for the child function including:
Array: Each node stores an array of size σ. The space complexity is O(σN ),
where N is the number of nodes in trie(R). The time complexity of the
child operation is O(1). Requires an integer alphabet.
Binary tree: Replace the array with a binary tree. The space complexity is
O(N ) and the time complexity O(log σ). Works for a general alphabet.
Hash table: One hash table for the whole trie, storing the values
child(v, c) 6= ⊥. Space complexity O(N ), time complexity O(1).
Requires an integer alphabet.
20
Prefix free string sets
Many data structures and algorithms for a string set R become simpler if R
is prefix free.
This has little or no effect on most operations on the set. The length of
each string increases by one only, and the additional symbol could be there
only virtually.
Example 1.4: The set {ali, alice, anna, elias, eliza} is not prefix free
because ali is a prefix of alice, but {ali$, alice$, anna$, elias$, eliza$} is
prefix free.
21
If R is prefix free, the leaves of trie(R) represent exactly R. This simplifies
the implementation of the trie.
Example 1.5: The trie for {ali$, alice$, anna$, elias$, eliza$}.
a e
l n l
i n i
$ a a z
c
$ s a
e
$ $
$
22
Compact Trie
Tries suffer from a large number of nodes, close to ||R|| in the worst case.
• For a string set R, we use |R| to denote the number of strings in R and
||R|| to denote the total length of the strings in R.
The space requirement can be problematic, since typically each node needs
much more space than a single symbol.
Compact tries reduce the number Example 1.6: Compact trie for
of nodes by replacing branchless {ali$, alice$, anna$, elias$, eliza$}.
path segments with a single edge. a
• The label of a new edge is the eli
concatenation of the labels of li
the edges it replaced.
nna$
$
as$ za$
ce$
23
The space complexity of a compact trie is O(|R|) (in addition to the
strings):
• In a compact trie, every internal node (except possibly the root) has at
least two children. In such a tree, there is always at least as many
leaves as internal nodes. Thus the number of nodes is at most 2|R|.
• The egde labels are factors of the input strings. If the input strings are
stored separately, the edge labels can be represented in constant space
using pointers to the strings.
The time complexities are the same or better than for tries:
24
Ternary Trie
• Each internal node has three children: smaller, equal and larger.
• The branching is based on a single symbol at a given position as in a
trie. The position is zero (first symbol) at the root and increases along
the middle branches but not along side branches.
There are also compact ternary tries based on compacting branchless path
segments.
25
Example 1.7: Ternary tries for {ali$, alice$, anna$, elias$, eliza$}.
a
e a
l
n l l
i i eli
n i c
c
$ a a $ e$ nna$ a
e
z
$ s s$
$ a za$
$
$
Ternary tries have the same asymptotic size as the corresponding (σ-ary)
tries.
26
A ternary trie is balanced if each left and right subtree contains at most half
of the strings in its parent tree.
d b
A B D E
C D E A B C
27
In a balanced ternary trie each step down either
Thus the time complexity of insertion, deletion, lookup and lcp query is
O(|S| + log n).
28