Toc
Toc
Toc
DFA stands for deterministic finite automaton. Basically, it is a directed graph where each
node has edges coming out of it labeled with the letters from a fixed finite alphabet Σ. One
node is designated as the start node, and a set of nodes are designated as final nodes. If
|Σ| = k, then each node has k distinctly labeled outgoing edges, one for each letter in Σ.
The strings accepted by a DFA are the ones corresponding to walks from the start state to
a final state following the correctly labeled edges.
DFA Exercises
1. Give a DFA for Σ = {0, 1} and strings that have an odd number of 1’s and any number
of 0’s.
2. Give a DFA for Σ = {a, b, c} that accepts any string with aab as a substring.
3. Give a DFA for Σ = {a, b} that accepts any string with aababb as a substring.
4. Suppose you are given many texts (strings) T1 , . . . , Tn and one pattern string P . You
want to determine which texts have the pattern P as a substring. What is the total
runtime of doing this using DFAs?
Solutions
1.
0 0
2.
b, c c a a, b, c
a a b
b, c
1
3.
a
b a a, b
a
a a b a b b
b
b
P
4. Once the DFA is built, the total runtime is O( Ti ). The question is how long does it
take to construct the DFA? We will see a particularly efficient way in the next section
giving an O(|P |) runtime for DFA construction.
KMP Exercises
1. For the last DFA (aababb) in the previous group of exercises, construct the correspond-
ing KMP table.
2. Build the corresponding KMP table for the string abcaabc.
3. Show how to build the above array in O(|P |) time.
4. Show how to simulate the corresponding DFA in O(|T |) time.
5. Show how to find all occurrences of a pattern P in a text T using the KMP algorithm.
2
6. Given a string z find the shortest string x such that z = xk for some k > 0. That is, z
is x repeated k times.
Solutions
1. The table is
index 0 1 2 3 4 5 6
value 0 0 1 0 1 0 0
2. The table is
index 0 1 2 3 4 5 6 7
value 0 0 0 0 1 1 2 3
3.
public s t a t i c int [ ] buildKMPTable ( S t r i n g p a t t e r n )
{
int [ ] t a b l e = new int [ p a t t e r n . l e n g t h ( ) + 1 ] ;
f o r ( int i = 2 ; i < t a b l e . l e n g t h ; ++i )
{
int j = t a b l e [ i − 1 ] ;
do
{
i f ( p a t t e r n . charAt ( j ) == p a t t e r n . charAt ( i −1) )
{ t a b l e [ i ] = j +1; break ; }
else j = table [ j ] ;
} while ( j != 0 ) ;
}
return t a b l e ;
}
4.
public s t a t i c int s i m u l a t e ( int [ ] t a b l e , S t r i n g p a t t e r n , S t r i n g
text )
{
int s t a t e = 0 ;
f o r ( int i = 0 ; i < p a t t e r n . l e n g t h ( ) ; ++i )
{
while ( true )
{
i f ( p a t t e r n . charAt ( i ) == t e x t . charAt ( s t a t e ) )
{
s t a t e ++;
3
break ;
}
e l s e i f ( s t a t e == 0 ) break ;
state = table [ state ] ;
}
}
return s t a t e ;
}
5. Look for the pattern P $ in the string T and determine how many times you arrive at
the next-to-last state.
6. Using KMP you can find the suffixes of P that are also prefixes. Suppose P = vw and
P = wx where v, w, x are strings. That is, w is both a suffix and a prefix of P . If |v|
divides |P | then P = v |P |/|v| . Looping through the KMP table, find the shortest such
v.
4
your position in the string, and hence could not lead to a node with at least 2 children.
To determine what suffix a leaf node corresponds to, simply find the first $i that occurs in
thePedge that leads to it, and ignore all characters following it. The resulting tree has size
O( |xi |).
3. Given a suffix tree for the text T $ and a pattern string P , show how to find all
occurrences of P in O(|P | + occ) time.
4. Given a suffix tree for the string x$ and an integer k, show how to find all maximum
length substrings s that occur at least k times in x
5. Given a suffix tree for the string x$, find the smallest string z such that x = z k for
some k ≥ 1.
6. Given a generalized suffix tree for the strings x1 , . . . , xn , find all substrings of maximum
length that are common to every string.
7. Given a list of strings x1 , . . . , xn , compute for every k the length of the longest string
common to k of the xi .
Solutions
1. There are O(|n|) nodes in the tree (a complete binary tree is the worst case).
2.
5
a b
ababb$ b abb$ $
b$
abb$ b$
3. Travel edges according to the letters in P . You will either not be able to (0 occurrences),
you will end at a node, or you will end in the middle of an edge. The number of leaves
below your position in the tree is the number of occurrences. This can be counted with
a DFS.
4. (a) First do a DFS and compute the length of the maximum length substring that
occurs at least k times. To do this, consider all nodes with at least k leaves below
them, and take the maximum length of all prefixes that lead to them. Then do a
second DFS, and output all of the prefixes that have this computed length.
(b) Do a DFS. At each node, you then have access to all of the leaves that lie below
it. Each leaf gives you an index range of the corresponding suffix. This allows you
to compute the index ranges of all occurrences of the substring x corresponding
to the node. Sort these index ranges, and greedily choose a maximum disjoint set
of intervals. These give you the number of non-overlapping substrings. Use this
to find the answer, as in the previous part.
5. As before, this reduces to finding all suffixes that are prefixes. To do this, simply
traverse the path of the suffix tree corresponding to the string itself. If at any point
you reach a node that has a $ edge coming out of it, that marks the end of a suffix.
Hence you have found a prefix that is a suffix.
6. Build the generalized suffix tree. For each node, determine how many leaves below it
correspond to distinct xi s. Consider every node that has n distinct leaves below it.