Exercises of Compiler

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 51
At a glance
Powered by AI
The key takeaways are the differences between compilers and interpreters, and their relative advantages. Compilers translate source code to machine code for faster execution, while interpreters directly execute operations with better error handling. Compilers can produce assembly code for easier debugging compared to machine code.

A compiler translates source code to another language like machine code, while an interpreter directly executes operations on inputs. Compilers usually generate faster machine code while interpreters provide better error diagnostics.

Compilers produce faster machine code while interpreters provide better error handling. Compilers also allow separate compilation while interpreters require the entire program to be available. However, interpreters are more portable across systems.

Exercises for Section 1.

1
1.1.1
What is the difference between a compiler and an interpreter?

Answer
A compiler is a program that can read a program in one language - the source language
- and translate it into an equivalent program in another language – the target language
and report any errors in the source program that it detects during the translation
process.

Interpreter directly executes the operations specified in the source program on inputs
supplied by the user.

1.1.2
What are the advantages of: (a) a compiler over an interpreter (b) an interpreter over a
compiler?

Answer
a. The machine-language target program produced by a compiler is usually much faster
than an interpreter at mapping inputs to outputs.

b. An interpreter can usually give better error diagnostics than a compiler, because it
executes the source program statement by statement.

1.1.3
What advantages are there to a language-processing system in which the compiler
produces assembly language rather than machine language?

Answer
The compiler may produce an assembly-language program as its output, because
assembly language is easier to produce as output and is easier to debug.

1.1.4
A compiler that translates a high-level language into another high-level language is
called a source-to-source translator. What advantages are there to using C as a target
language for a compiler?
Answer
For the C language there are many compilers available that compile to almost every
hardware.

1.1.5
Describe some of the tasks that an assembler needs to perform.

Answer
It translates from the assembly language to machine code. This machine code is
relocatable.

Exercises for Section 1.3


1.3.1
Indicate which of the following terms:

a. imperative

b. declarative

c. von Neumann

d. object-oriented

e. functional

f. third-generation

g. fourth-generation

h. scripting

apply to which of the following languages:

1. C
2. C++
3. Cobol
4. Fortran
5. Java
6. Lisp
7. ML
8. Perl
9. Python
10. VB.

Answer
imperative: C, C++

object-oriented: C++, Java

functional: ML

scripting: Perl, Python

Exercises for Section 2.2


2.2.1
Consider the context-free grammar:

S -> S S + | S S * | a

1. Show how the string aa+a* can be generated by this grammar.


2. Construct a parse tree for this string.
3. What language does this grammar generate? Justify your answer.

Answer
1. S -> S S -> S S + S -> a S + S -> a a + S -> a a + a *

2.
3. L = {Postfix expression consisting of digits, plus and multiple signs}

2.2.2
What language is generated by the following grammars? In each case justify your
answer.

1. S -> 0 S 1 | 0 1
2. S -> + S S | - S S | a
3. S -> S ( S ) S | ε
4. S -> a S b S | b S a S | ε
5. S -> a | S + S | S S | S * | ( S )

Answer
1. L = {0n1n | n>=1}
2. L = {Prefix expression consisting of plus and minus signs}
3. L = {Matched brackets of arbitrary arrangement and nesting, includes ε}
4. L = {String has the same amount of a and b, includes ε}
5. L = {Regular expressions used to describe regular languages} refer to wiki

2.2.3
Which of the grammars in Exercise 2.2.2 are ambiguous?

Answer
1. No
2. No
3. Yes

4. Yes
5. Yes

2.2.4
Construct unambiguous context-free grammars for each of the following languages. In
each case show that your grammar is correct.

1. Arithmetic expressions in postfix notation.


2. Left-associative lists of identifiers separated by commas.
3. Right-associative lists of identifiers separated by commas.
4. Arithmetic expressions of integers and identifiers with the four binary operators +,
-, *, /.
5. Add unary plus and minus to the arithmetic operators of 4.

Answer
1. E -> E E op | num

2. list -> list , id | id

3. list -> id , list | id


4. expr -> expr + term | expr - term | term
term -> term * factor | term / factor | factor
factor -> id | num | (expr)

5. expr -> expr + term | expr - term | term


term -> term * unary | term / unary | unary
unary -> + factor | - factor | factor
factor - > id | num | (expr)

2.2.5
1. Show that all binary strings generated by the following grammar have values
divisible by 3. Hint. Use induction on the number of nodes in a parse tree.

num -> 11 | 1001 | num 0 | num num

2. Does the grammar generate all binary strings with values divisible by 3?

Answer
1. Proof

Any string derived from the grammar can be considered to be a sequence


consisting of 11 and 1001, where each sequence element is possibly suffixed
with a 0.

Let n be the set of positions where 11 is placed. 11 is said to be at position i if the


first 1 in 11 is at position i, where i starts at 0 and grows from least significant to
most significant bit.
Let m be the equivalent set for 1001.
The sum of any string produced by the grammar is:

sum

= Σn (21 + 20) 2 n + Σm (23 + 20) 2m

= Σ n 3 2 n + Σm 9 2 m

This is clearly divisible by 3.

1. No. Consider the string "10101", which is divisible by 3, but cannot be derived
from the grammar.

Readers seeking a more formal proof can read about it below:

Proof:
Every number divisible by 3 can be written in the form 3k. We will consider k >
0 (though it would be valid to consider k to be an arbitrary integer).
Note that every part of num(11, 1001 and 0) is divisible by 3, if the grammar
could generate all the numbers divisible by 3, we can get a production for binary
k from num's production:
3k = num -> 11 | 1001 | num 0 | num num
k = num/3 -> 01 | 0011 | k 0 | k k
k -> 01 | 0011 | k 0 | k k
It is obvious that any value of k that has more
than 2 consecutive bits set to 1 can
never be produced. This can be confirmed by the example given in the
beginning:
10101 is 3*7, hence, k = 7 = 111 in binary. Because 111 has more than 2
consecutive 1's in binary, the grammar will never produce 21.

2.2.6
Construct a context-free grammar for roman numerals.

Note: we just consider a subset of roman numerals which is less than 4k.

Answer
wikipedia: Roman_numerals

 via wikipedia, we can categorize the single roman numerals into 4 groups:

 I, II, III | I V | V, V I, V II, V III | I X


then get the production:
digit -> smallDigit | I V | V smallDigit | I X
smallDigit -> I | II | III | ε
 and we can find a simple way to map roman to arabic numerals. For example:

o XII => X, II => 10 + 2 => 12


o CXCIX => C, XC, IX => 100 + 90 + 9 => 199
o MDCCCLXXX => M, DCCC, LXXX => 1000 + 800 + 80 => 1880
 via the upper two rules, we can derive the production:

romanNum -> thousand hundred ten digit

thousand -> M | MM | MMM | ε

hundred -> smallHundred | C D | D smallHundred | C M

smallHundred -> C | CC | CCC | ε

ten -> smallTen | X L | L smallTen | X C


smallTen -> X | XX | XXX | ε

digit -> smallDigit | I V | V smallDigit | I X

smallDigit -> I | II | III | ε

2.3 Exercises for Section 2.3


2.3.1
Construct a syntax-directed translation scheme that translates arithmetic expressions
from infix notation into prefix notation in which an operator appears before its operands;
e.g. , -xy is the prefix notation for x - y. Give annotated parse trees for the inputs 9-5+2
and 9-5*2.

Answer
productions:
expr -> expr + term
| expr - term
| term
term -> term * factor
| term / factor
| factor
factor -> digit | (expr)
translation schemes:
expr -> {print("+")} expr + term
| {print("-")} expr - term
| term
term -> {print("*")} term * factor
| {print("/")} term / factor
| factor
factor -> digit {print(digit)}
| (expr)

2.3.2
Construct a syntax-directed translation scheme that translates arithmetic expressions
from postfix notation into infix notation. Give annotated parse trees for the inputs 95-
2 and 952-.

Answer
productions:
expr -> expr expr +
| expr expr -
| expr expr *
| expr expr /
| digit
translation schemes:
expr -> expr {print("+")} expr +
| expr {print("-")} expr -
| {print("(")} expr {print(")*(")} expr {print(")")} *
| {print("(")} expr {print(")/(")} expr {print(")")} /
| digit {print(digit)}

Another reference answer


E -> {print("(")} E {print(op)} E {print(")"}} op | digit {print(digit)}

2.3.3
Construct a syntax-directed translation scheme that translates integers into roman
numerals.

Answer
assistant function:
repeat(sign, times) // repeat('a',2) = 'aa'
translation schemes:
num -> thousand hundred ten digit
{ num.roman = thousand.roman || hundred.roman || ten.roman || digit.roman;
print(num.roman)}
thousand -> low {thousand.roman = repeat('M', low.v)}
hundred -> low {hundred.roman = repeat('C', low.v)}
| 4 {hundred.roman = 'CD'}
| high {hundred.roman = 'D' || repeat('X', high.v - 5)}
| 9 {hundred.roman = 'CM'}
ten -> low {ten.roman = repeat('X', low.v)}
| 4 {ten.roman = 'XL'}
| high {ten.roman = 'L' || repeat('X', high.v - 5)}
| 9 {ten.roman = 'XC'}
digit -> low {digit.roman = repeat('I', low.v)}
| 4 {digit.roman = 'IV'}
| high {digit.roman = 'V' || repeat('I', high.v - 5)}
| 9 {digit.roman = 'IX'}
low -> 0 {low.v = 0}
| 1 {low.v = 1}
| 2 {low.v = 2}
| 3 {low.v = 3}
high -> 5 {high.v = 5}
| 6 {high.v = 6}
| 7 {high.v = 7}
| 8 {high.v = 8}

2.3.4
Construct a syntax-directed translation scheme that trans lates roman numerals into
integers.

Answer
productions:
romanNum -> thousand hundred ten digit
thousand -> M | MM | MMM | ε
hundred -> smallHundred | C D | D smallHundred | C M
smallHundred -> C | CC | CCC | ε
ten -> smallTen | X L | L smallTen | X C
smallTen -> X | XX | XXX | ε
digit -> smallDigit | I V | V smallDigit | I X
smallDigit -> I | II | III | ε
translation schemes:
romanNum -> thousand hundred ten digit {romanNum.v = thousand.v || hundred.v || ten.v
|| digit.v; print(romanNun.v)}
thousand -> M {thousand.v = 1}
| MM {thousand.v = 2}
| MMM {thousand.v = 3}
| ε {thousand.v = 0}
hundred -> smallHundred {hundred.v = smallHundred.v}
| C D {hundred.v = smallHundred.v}
| D smallHundred {hundred.v = 5 + smallHundred.v}
| C M {hundred.v = 9}
smallHundred -> C {smallHundred.v = 1}
| CC {smallHundred.v = 2}
| CCC {smallHundred.v = 3}
| ε {hundred.v = 0}
ten -> smallTen {ten.v = smallTen.v}
| X L {ten.v = 4}
| L smallTen {ten.v = 5 + smallTen.v}
| X C {ten.v = 9}
smallTen -> X {smallTen.v = 1}
| XX {smallTen.v = 2}
| XXX {smallTen.v = 3}
| ε {smallTen.v = 0}
digit -> smallDigit {digit.v = smallDigit.v}
| I V {digit.v = 4}
| V smallDigit {digit.v = 5 + smallDigit.v}
| I X {digit.v = 9}
smallDigit -> I {smallDigit.v = 1}
| II {smallDigit.v = 2}
| III {smallDigit.v = 3}
| ε {smallDigit.v = 0}
2.3.5
Construct a syntax-directed translation scheme that translates postfix arithmetic
expressions into equivalent prefix arithmetic expressions.

Answer
production:
expr -> expr expr op | digit
translation scheme:
expr -> {print(op)} expr expr op | digit {print(digit)}

Exercises for Section 2.4


2.4.1
Construct recursive-descent parsers, starting with the following grammars:

1. S -> + S S | - S S | a
2. S -> S ( S ) S | ε
3. S -> 0 S 1 | 0 1

Answer
See 2.4.1.1.c, 2.4.1.2.c, and 2.4.1.3.c for real implementations in C.

1) S -> + S S | - S S | a

void S(){
switch(lookahead){
case "+":
match("+"); S(); S();
break;
case "-":
match("-"); S(); S();
break;
case "a":
match("a");
break;
default:
throw new SyntaxException();
}
}
void match(Terminal t){
if(lookahead = t){
lookahead = nextTerminal();
}else{
throw new SyntaxException()
}
}
2) S -> S ( S ) S | ε

void S(){
if(lookahead == "("){
match("("); S(); match(")"); S();
}
}
3) S -> 0 S 1 | 0 1

void S(){
switch(lookahead){
case "0":
match("0"); S(); match("1");
break;
case "1":
// match(epsilon);
break;
default:
throw new SyntaxException();
}
}

Exercises for Section 2.6


2.6.1
Extend the lexical analyzer in Section 2.6.5 to remove comments, defined as follows:

1. A comment begins with // and includes all characters until the end of that line.
2. A comment begins with / and includes all characters through the next occurrence
of the character sequence /.

2.6.2
Extend the lexical analyzer in Section 2.6.5 to recognize the relational operators <, <=,
==, ! =, >=, >.

2.6.3
Extend the lexical analyzer in Section 2.6.5 to recognize floating point numbers such as
2., 3.14, and . 5.

Answer
Source code: commit 8dd1a9a

Code snippet(src/lexer/Lexer.java):

public Token scan() throws IOException, SyntaxException{


for(;;peek = (char)stream.read()){
if(peek == ' ' || peek == '\t'){
continue;
}else if(peek == '\n'){
line = line + 1;
}else{
break;
}
}

// handle comment
if(peek == '/'){
peek = (char) stream.read();
if(peek == '/'){
// single line comment
for(;;peek = (char)stream.read()){
if(peek == '\n'){
break;
}
}
}else if(peek == '*'){
// block comment
char prevPeek = ' ';
for(;;prevPeek = peek, peek = (char)stream.read()){
if(prevPeek == '*' && peek == '/'){
break;
}
}
}else{
throw new SyntaxException();
}
}

// handle relation sign


if("<=!>".indexOf(peek) > -1){
StringBuffer b = new StringBuffer();
b.append(peek);
peek = (char)stream.read();
if(peek == '='){
b.append(peek);
}
return new Rel(b.toString());
}
// handle number, no type sensitive
if(Character.isDigit(peek) || peek == '.'){
Boolean isDotExist = false;
StringBuffer b = new StringBuffer();
do{
if(peek == '.'){
isDotExist = true;
}
b.append(peek);
peek = (char)stream.read();
}while(isDotExist == true ? Character.isDigit(peek) : Character.isDigit(peek) ||
peek == '.');
return new Num(new Float(b.toString()));
}

// handle word
if(Character.isLetter(peek)){
StringBuffer b = new StringBuffer();
do{
b.append(peek);
peek = (char)stream.read();
}while(Character.isLetterOrDigit(peek));
String s = b.toString();
Word w = words.get(s);
if(w == null){
w = new Word(Tag.ID, s);
words.put(s, w);
}
return w;
}

Token t = new Token(peek);


peek = ' ';
return t;
}

Exercises for Section 2.8


2.8.1
For-statements in C and Java have the form:

for ( exprl ; expr2 ; expr3 ) stmt

The first expression is executed before the loop; it is typically used for initializing the
loop index. The second expression is a test made before each iteration of the loop; the
loop is exited if the expression becomes 0. The loop itself can be thought of as the
statement {stmt expr3 ; }. The third expression is executed at the end of each iteration; it
is typically used to increment the loop index. The meaning of the for-statement is similar
to

expr1 ; while ( expr2 ) {stmt expr3 ; }

Define a class For for for-statements, similar to class If in Fig. 2.43.

Answer
class For extends Stmt {
Expr E1;
Expr E2;
Expr E3;
Stmt S;
public For(Expr expr1, Expr expr2, Expr expr3, Stmt stmt){
E1 = expr1;
E2 = expr2;
E3 = expr3;
S = stmt;
}
public void gen(){
E1.gen();
Label start = new Label();
Label end = new Label();
emit("ifFalse " + E2.rvalue().toString() + " goto " + end);
S.gen();
E3.gen();
emit("goto " + start);
emit(end + ":")
}
}

2.8.2
The programming language C does not have a boolean type. Show how a C compiler
might translate an if-statement into three-address code.

Answer
Replace
emit("ifFalse " + E.rvalue().toString() + " goto " + after);
with
emit("ifEqual " + E.rvalue().toString() + " 0 goto " + after);
or
emit("ifEqualZero " + E.rvalue().toString() + " goto " + after);
Exercises for Section 3.1
3.1.1
Divide the following C++ program:
float limitedSquare(x){float x;
/* returns x-squared, nut never more than 100 */
return (x <= -10.0 || x >= 10.0) ? 100 : x*x;
}
into appropriate lexemes, using the discussion of Section 3.1.2 as a guide. Which
lexemes should get associated lexical values? What should those values be?

Answer
<float> <id, limitedSquaare> <(> <id, x> <)> <{>
<float> <id, x>
<return> <(> <id, x> <op,"<="> <num, -10.0> <op, "||"> <id, x> <op, ">="> <num,
10.0> <)> <op, "?"> <num, 100> <op, ":"> <id, x> <op, "*"> <id, x>
<}>

3.1.2
Tagged languages like HTML or XML are different from conventional programming
languages in that the punctuation (tags) are either very numerous (as in HTML) or a
user-definable set (as in XML). Further, tags can often have parameters. Suggest how
to divide the following HTML document:
Here is a photo of <b>my house</b>;
<p><img src="house.gif"/><br/>
see <a href="morePix.html">More Picture</a> if you
liked that one.</p>
into appropriate lexemes. Which lexemes should get associated lexical values, and
what should those values be?

Answer
<text, "Here is a photo of"> <nodestart, b> <text, "my house"> <nodeend, b>
<nodestart, p> <selfendnode, img> <selfendnode, br>
<text, "see"> <nodestart, a> <text, "More Picture"> <nodeend, a>
<text, "if you liked that one."> <nodeend, p>

Exercises for Section 3.3


3.3.1
Consult the language reference manuals to determine

1. the sets of characters that form the input alphabet (excluding those that may only
appear in character strings or comments)
2. the lexical form of numerical constants, and
3. the lexical form of identifiers,

for each of the following languages:

1. C
2. C++
3. C#
4. Fortran
5. Java
6. Lisp
7. SQL

3.3.2
Describe the languages denoted by the following regular expressions:

1. a(a|b)*a
2. ((ε|a)b*)*
3. (a|b)*a(a|b)(a|b)
4. a*ba*ba*ba*
5. !! (aa|bb)*((ab|ba)(aa|bb)*(ab|ba)(aa|bb)*)*

Answer
1. String of a's and b's that start and end with a.
2. String of a's and b's.
3. String of a's and b's that the character third from the last is a.
4. String of a's and b's that only contains three b.
5. String of a's and b's that has a even number of a and b.

3.3.3
In a string of length n, how many of the following are there?

1. Prefixes.
2. Suffixes.
3. Proper prefixes.
4. ! Substrings.
5. ! Subsequences.
Answer
1. n+1
2. n+1
3. n-1
4. C(n+1,2) + 1 (need to count epsilon in)
5. Σ(i=0,n) C(n, i)

3.3.4
Most languages are case sensitive, so keywords can be written only one way, and the
regular expressions describing their lexeme is very simple. However, some languages,
like SQL, are case insensitive, so a keyword can be written either in lowercase or in
uppercase, or in any mixture of cases. Thus, the SQL keyword SELECT can also be
written select, Select, or sElEcT, for instance. Show how to write a regular expression
for a keyword in a case insensitive language. Illustrate the idea by writing the
expression for "select" in SQL.

Answer
select -> [Ss][Ee][Ll][Ee][Cc][Tt]

3.3.5
!Write regular definitions for the following languages:

1. All strings of lowercase letters that contain the five vowels in order.
2. All strings of lowercase letters in which the letters are in ascending lexicographic
order.
3. Comments, consisting of a string surrounded by / and /, without an intervening */,
unless it is inside double-quotes (")
4. !! All strings of digits with no repeated digits. Hint: Try this problem first with a few
digits, such as {O, 1, 2}.
5. !! All strings of digits with at most one repeated digit.
6. !! All strings of a's and b's with an even number of a's and an odd number of b's.
7. The set of Chess moves,in the informal notation,such as p-k4 or kbp*qn.
8. !! All strings of a's and b's that do not contain the substring abb.
9. All strings of a's and b's that do not contain the subsequence abb.

Answer
1、

want -> other* a (other|a)* e (other|e)* i (other|i)* o (other|o)* u (other|u)*


other -> [bcdfghjklmnpqrstvwxyz]
2、
a* b* ... z*
3、

\/\*([^*"]*|".*"|\*+[^/])*\*\/
4、

want -> 0|A?0?1(A0?1|01)*A?0?|A0?


A -> 0?2(02)*
Steps:

step1. Transition diagram

step2. GNFA

step3. Remove node 0 and simplify


step4. Remove node 2 and simplify
step5. Remove node 1 and simplify

5、

want -> (FE*G|(aa)*b)(E|FE*G)


E -> b(aa)*b
F -> a(aa)*b
G -> b(aa)*ab|a
F -> ba(aa)*b
Steps:

step1. Transition diagram


step2. GNFA

step3. Remove node A and simplify


step4. Remove node D and simplify
step5. Remove node C and simplify
8、

b*(a+b?)*
9、

b* | b*a+ | b*a+ba*

3.3.6
Write character classes for the following sets of characters:

1. The first ten letters (up to "j") in either upper or lower case.
2. The lowercase consonants.
3. The "digits" in a hexadecimal number (choose either upper or lower case for the
"digits" above 9).
4. The characters that can appear at the end of alegitimate English sentence (e.g. ,
exclamation point) .

Answer
1. [A-Ja-j]
2. [bcdfghjklmnpqrstvwxzy]
3. [0-9a-f]
4. [.?!]

3.3.7
Note that these regular expressions give all of the following symbols (operator
characters) a special meaning:
\ " . ^ $ [ ] * + ? { } | /
Their special meaning must be turned off if they are needed to represent themselves in
a character string. We can do so by quoting the character within a string of length one
or more; e.g., the regular expression "**" matches the string ** . We can also get the
literal meaning of an operator character by preceding it by a backslash. Thus, the
regular expression \*\* also matches the string **. Write a regular expression that
matches the string "\.

Answer
\"\\

3.3.9 !
The regular expression r{m, n} matches from m to n occurrences of the pattern r. For
example, a [ 1 , 5] matches a string of one to five a's. Show that for every regular
expression containing repetition operators of this form, there is an equivalent regular
expression without repetition operators.

Answer
r{m,n} is equals to r.(m).r | r.(m + 1).r | ... | r.(n).r

3.3.10 !
The operator ^ matches the left end of a line, and $ matches the right end of a line. The
operator ^ is also used to introduce complemented character classes, but the context
always makes it clear which meaning is intended. For example, ^aeiou*$ matches any
complete line that does not contain a lowercase vowel.

1. How do you tell which meaning of ^ is intended?


2. Can you always replace a regular expression using the ^ and $ operators by an
equivalent expression that does not use either of these operators?

Answer
1. if ^ is in a pair of brakets, and it is the first letter, it means complemented classes,
or it means the left end of a line.

3.4 节的练习
3.4.1
给出识别练习 3.3.2 中各个正则表达式所描述的语言状态转换图。

解答

解答步骤:NFA -> DFA -> 最少状态的 DFA(状态转换图)

1. a(a|b)*a

NFA:

DFA:

NFA DFA a

{0} A B

{1,2,3,5,8} B C D

{2,3,4,5,7,8,9} C C D

{2,3,5,6,7,8} D C D
最少状态的 DFA(状态转换图):

合并不可区分的状态 B 和 D

![3 4 1-1](assets/3.4.1-1.gif)
1. ((ε|a)b*)*

2. (a|b)*a(a|b)(a|b)

NFA:

DFA:

<table>
<thead>
<tr>
<th>NFA</th>
<th>DFA</th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<td>{0,1,2,4,7}</td>
<td>A</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>{1,2,3,4,6,7,8,9,11}</td>
<td>B</td>
<td>D</td>
<td>E</td>
</tr>
<tr>
<td>{1,2,4,5,6,7}</td>
<td>C</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>{1,2,3,4,6,7,8,9,10,11,13,14,16}</td>
<td>D</td>
<td><b>F</b></td>
<td><b>G</b></td>
</tr>
<tr>
<td>{1,2,4,5,6,7,12,13,14,16}</td>
<td>E</td>
<td><b>H</b></td>
<td><b>I</b></td>
</tr>
<tr>
<td>{1,2,3,4,6,7,8,9,10,11,13,14,15,16,<b>18</b>}</td>
<td><b>F</b></td>
<td><b>F</b></td>
<td><b>G</b></td>
</tr>
<tr>
<td>{1,2,4,5,6,7,12,13,14,16,17,<b>18</b>}</td>
<td><b>G</b></td>
<td><b>H</b></td>
<td><b>I</b></td>
</tr>
<tr>
<td>{1,2,3,4,6,7,8,9,11,15,<b>18</b>}</td>
<td><b>H</b></td>
<td>D</td>
<td>E</td>
</tr>
<tr>
<td>{1,2,4,5,6,7,17,<b>18</b>}</td>
<td><b>I</b></td>
<td>B</td>
<td>C</td>
</tr>
</tbody>
</table>

最少状态的 DFA(状态转换图):

合并不可区分的状态 A 和 C

![3 4 1-3](assets/3.4.1-3.gif)
1. a*ba*ba*ba*

3.4.2
给出识别练习 3.3.5 中各个正则表达式所描述语言的状态转换图。

3.4.3
构造下列串的失效函数。

1. abababaab
2. aaaaaa
3. abbaabb

解答

代码详见:src/failure-function.js

1. [ 0, 0, 1, 2, 3, 4, 5, 1, 2 ]
2. [ 0, 1, 2, 3, 4, 5 ]
3. [ 0, 0, 0, 1, 1, 2, 3 ]

3.4.4 !
对 s 进行归纳,证明图 3-19 的算法正确地计算出了失效函数。
图 3-19:计算关键字 b_1b_2…b_n 的失效函数

01 t = 0;
02 f(1) = 0;
03 for (s = 1; s < n; s ++){
04 while( t > 0 && b_s+1 != b_t+1) t = f(t);
05 if(b_s+1 == b_t+1){
06 t = t + 1;
07 f(s + 1) = t;
08 }else{
09 f(s + 1) = 0;
10 }
11 }

证明

1. 已知 f(1) = 0
2. 在第 1 次 for 循环时,计算 f(2) 的值,当第5行代码 b_2 == b_1 成立时,代码进
入到第7行得出 f(2) = 1,不成立时,则代码进入第9行得出 f(2) = 0。显然,这次
循环正确的计算出了 f(2) 。
3. 假设在第 i-1 次进入循环时,也正确的计算出了 f(i),也有 f(i) = t (无论 t 是大于 0
还是等于 0)
4. 那么在第 1 次进入循环时,分两种情况进行考虑:

i. t == 0

这种情况比较简单,直接从第 5 行开始,当 b_i+1 == b_1 时,f(i+1) = 1,


否则 f(i+1) = 0

ii. t > 0 while 循环会不断缩小 t 值,试图找出最大可能的使得 b_i+1 == b_t+1


成立的 t 值,如果找到了,则进入第 5 行执行,得到 f(i+1) = t+1;或者直
到 t == 0 时也没有找到,则跳出循环,这时进入第 5 行执行,过程类似于
前一种情况。

3.4.5 !!
说明图 3-19 中的第 4 行的复制语句 t = f(t) 最多被执行 n 次。进而说明整个算法的时间复
杂度是 O(n),其中 n 是关键字长度。
解答

详见 matrix 的博文 KMP算法详解。

3.4.6
应用 KMP 算法判断关键字 ababaa 是否为下面字符串的子串:

1. abababaab
2. abababbaa

解答

代码详见:src/failure-function.js

1. true
2. false

3.4.7 !!
说明图 3-20 中的算法可以正确的表示输入关键字是否为一个给定字符串的子串。

图 3-20:KMP 算法在 O(m + n) 的时间内检测字符串a_1a_3...a_n 中是否包含单个关键


字 b1b2...bn

s = 0;
for(i = 1; i <= m; i ++){
while(s > 0 && a_i != b_s+1) s = f(s);
if(a_i == b_s+1) s = s + 1;
if(s == n) return "yes";
}
return "no";

3.4.8
假设已经计算得到函数 f 且他的值存储在一个以 s 为下标的数字中,说明图 3-20 中算法
的时间复杂度为 O(m + n)。

解答

详见 matrix 的博文 KMP算法详解。


3.4.9
Fibonacci 字符串的定义如下:

1. s1 = b
2. s2 = a
3. 当 k > 2 时, sk = sk-1 sk-2

例如:s3 = ab, s4 = aba, s5 = abaab

1. sn 的长度是多少?
2. 构造 s6 的失效函数。
3. 构造 s7 的失效函数。
4. !! 说明任何 sn 的失效函数都可以被表示为:f(1) = f(2) = 0,且对于 2 < j <= |sn|,
f(j) = j - |sk-1|,其中 k 是使得 |sk| <= j + 1 的最大整数。
5. !! 在 KMP 算法中,当我们试图确定关键字 sk 是否出现在字符串 sk+1 中,最多
会连续多少次调用失效函数?

解答

1. 见 维基百科
2. s6 = abaababa

failure = [ 0, 0, 1, 1, 2, 3, 2, 3 ]

3. s7 = abaababaabaab

failure = [ 0, 0, 1, 1, 2, 3, 2, 3, 4, 5, 6, 4, 5 ]

Exercises for Section 3.5


3.5.1
Describe how to make the following modifications to the Lex program of Fig. 3.23:

1. Add the keyword while.


2. Change the comparison operators to be the C operators of that kind.
3. Allow the underscore ( _ ) as an additional letter.
4. ! Add a new pattern with token STRING. The pattern consists of a double quote (
" ) , any string of characters and a final double-quote. However, if a double-quote
appears in the string, it must be escaped by preceding it with a backslash () , and
therefore a backslash in the string must be represented by two backslashes. The
lexical value, which is the string without the surrounding double-quotes, and with
backslashes used to es.,. cape a character removed. Strings are to be installed
in a table of strings.

source

3.5.2
Write a Lex program that copies a file, replacing each non empty sequence of white
space by a single blank

3.5.3
Write a Lex program that copies a C program, replacing each instance of the keyword f
loat by double.。

3.5.4 !
Write a Lex program that converts a file to "Pig latin." Specifically, assume the file is a
sequence of words (groups . of letters) separated by whitespace. Every time you
encounter a word:

1. If the first letter is a consonant, move it to the end of the word and then add ay!
2. If the first letter is a vowel, just add ay to the end of the word.

All nonletters are copied int act to the output.

source

3.5.5 !
In SQL, keywords and identifiers are case-insensitive. Write a Lex program that
recognizes the keywords SELECT, FROM, and WHERE (in any combination of capital
and lower-case letters) , and token ID, which for the purposes of this exercise you may
take to be any sequence of letters and digits, beginning with a letter. You need not
install identifiers in a symbol table, but tell how the "install" function would differ from
that described for case-sensitive identifiers as in Fig. 3.23.

source
3.6 Exercises for Section 3.6
3.6.1 !
Figure 3.19 in the exercises of Section 3.4 computes the failure function for the KMP
algorithm. Show how, given that failure function, we can construct, from a keyword
b1b2...bn an n + 1-state DFA that recognizes .*b1b2...bn, where the dot stands for "any
character." Moreover, this DFA can be constructed in O(n) time.

Answer
Take the string "abbaabb" in exercise 3.4.3-3 as example, the failure function is:

 n : 1, 2, 3, 4, 5, 6, 7
 f(n): 0, 0, 0, 1, 1, 2, 3

The DFA is:

Pseudocode of building the DFA:

for (i = 0; i< n; i ++) {


move[s[i], c] = {
if ( c == b1b2…bn[i] ) {
goto s[i+1]
} else {
goto s[f(i)]
}
}
}
It is obviously that with the known f(n), this DFA can be constructed in O(n) time.

3.6.2
Design finite automata (deterministic or nondeterministic) for each of the languages of
Exercise 3.3.5.

3.6.3
For the NFA of Fig. 3.29, indicate all the paths labeled aabb. Does the NFA accept
aabb?

Answer
 (0) -a-> (1) -a-> (2) -b-> (2) -b-> ((3))
 (0) -a-> (0) -a-> (0) -b-> (0) -b-> (0)
 (0) -a-> (0) -a-> (1) -b-> (1) -b-> (1)
 (0) -a-> (1) -a-> (1) -b-> (1) -b-> (1)
 (0) -a-> (1) -a-> (2) -b-> (2) -b-> (2)
 (0) -a-> (1) -a-> (2) -b-> (2) -ε-> (0) -b-> (0)
 (0) -a-> (1) -a-> (2) -ε-> (0) -b-> (0) -b-> (0)

This NFA accepts "aabb"

3.6.4
Repeat Exercise 3.6.3 for the NFA of Fig. 3.30.

3.6.5
Give the transition tables for the NFA of:

1. Exercise 3.6.3.
2. Exercise 3.6.4.
3. Figure 3.26.

Answer
Table 1

state a b ε

0 {0,1} {0} ∅

1 {1,2} {1} ∅
state a b ε

2 {2} {2,3} {0}

3 ∅ ∅ ∅

Table 2

state a b ε

0 {1} ∅ {3}

1 ∅ {2} {0}

2 ∅ {3} {1}

3 {0} ∅ {2}

Table 3

state a b ε
0 ∅ ∅ {1,2}
1 {2} ∅ ∅
2 {2} ∅ ∅
3 ∅ {4} ∅
4 ∅ {4} ∅

Exercises for Section 3.7


3.7.1
Convert to DFA's the NFA's of:
1. Fig. 3.26.
2. Fig. 3.29.
3. Fig. 3.30.

Answer
1、

Transition table

NFA State DFA State a b

{0,1,3} A B C

{2} B B ∅

{4} C ∅ C

DFA

2、

Transition table

NFA State DFA State a b

{0} A B A

{0,1} B C B
NFA State DFA State a b

{0,1,2} C C D

{0,2,3} D C D

DFA

3、

Transition table

NFA State DFA State a b

{0,1,2,3} A A A

DFA

3.7.2
use Algorithm 3.22 to simulate the NFA's:

1. Fig. 3.29.
2. Fig. 3.30.

on input aabb.
Answer
1. -start->{0}-a->{0,1}-a->{0,1,2}-b->{0,2,3}-b->{0,2,3}
2. -start->{0,1,2,3}-a->{0,1,2,3}-a->{0,1,2,3}-b->{0,1,2,3}-b->{0,1,2,3}

3.7.3
Convert the following regular expressions to deterministic finite automata, using
algorithms 3.23 and 3.20:

1. (a|b)*
2. (a*|b*)*
3. ((ε|a)|b*)*
4. (a|b)*abb(a|b)*

Answer
1、

NFA

Transition table

NFA State DFA State a b

{0,1,2,3,7} A B C

{1,2,3,4,6,7} B B C
NFA State DFA State a b

{1,2,3,5,6,7} C B C

DFA

2、

NFA

Transition table

NFA State DFA State a

{0,1,2,3,4,5,8,9,10,11} A B C
NFA State DFA State a

{1,2,3,4,5,6,8,9,10,11} B B C

{1,2,3,4,5,7,8,9,10,11} C B C

DFA

3、

NFA

Transition table

NFA State DFA State a

{0,1,2,3,4,6,7,9,10} A B C
NFA State DFA State a

{1,2,3,4,5,6,7,9,10} B B C

{1,2,3,4,6,7,8,9,10} C B C

DFA

4、

NFA
Transition table

NFA State DFA State a

{0,1,2,4,7} A B C

{1,2,3,4,6,7,8} B B D

{1,2,4,5,6,7} C B C

{1,2,4,5,6,7,9} D B E

{1,2,4,5,6,7,10,11,12,14,17} E F G

{1,2,3,4,6,7,8,11,12,13,14,16,17} F F H
NFA State DFA State a

{1,2,4,5,6,7,11,12,13,15,16,17} G F G

{1,2,4,5,6,7,9,11,12,14,15,16,17} H F I

{1,2,4,5,6,7,10,11,12,14,15,16,17} I F G

DFA

Exercises for Section 3.8


3.8.1
Suppose we have two tokens: (1) the keyword if, and (2) identifiers, which are strings of
letters other than if. Show:

1. The NFA for these tokens, and


2. The DFA for these tokens.

Answer
1. NFA
NOTE: this NFA has potential conflict, we can decide the matched lexeme by 1.
take the longest 2. take the first listed.

2. DFA

3.8.2
Repeat Exercise 3.8.1 for tokens consisting of (1) the keyword while, (2) the keyword
when, and (3) identifiers consisting of strings of letters and digits, beginning with a letter.

Answer
1. NFA
2. DFA

bother to paint

3.8.3 !
Suppose we were to revise the definition of a DFA to allow zero or one transition out of
each state on each input symbol (rather than exactly one such transition, as in the
standard DFA definition). Some regular expressions would then have smaller "DFA's"
than they do under the standard definition of a DFA. Give an example of one such
regular expression.

Answer
Take the language defined by regular expression "ab" as the example, assume that the
set of input symbols is {a, b}

Standard DFA

Revised DFA
Obviously, the revised DFA is smaller than the standard DFA.

3.8.4 !!
Design an algorithm to recognize Lex-lookahead patterns of the form rl/r2, where rl and
r2 are regular expressions. Show how your algorithm works on the following inputs:

1. (abcd|abc)/d
2. (a|ab)/ba
3. aa*/a*

Exercises for Section 3.9


3.9.1
Extend the table of Fig. 3.58 to include the operators

1. ?
2. +

Answer

node n nullable(n) firstpos(n)

n = c_1 ? true firstpos(c_1)

n = c_1 + nullable(c_1) firstpos(c_1)

3.9.2
Use Algorithm 3.36 to convert the regular expressions of Exercise 3.7.3 directly to
deterministic finite automata.

Answer
1. (a|b)*
o Syntax tree

o firstpos and lastpos for nodes in the syntax tree

- The function followpos

<table>
<thead>
<tr>
<th>node n</th>
<th>followpos(n)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>{1, 2, 3}</td>
</tr>
<tr>
<td>2</td>
<td>{1, 2, 3}</td>
</tr>
<tr>
<td>3</td>
<td>∅</td>
</tr>
</tbody>
</table>
 Steps

The value of firstpos for the root of the tree is {1, 2, 3}, so this set is the start
state of D. Call this set of states A. We compute Dtran[A, a] and Dtran[A, b].
Among the positions of A, 1 correspond to a, while 2 correspond to b. Thus
Dtran[A, a] = followpos(1) = {1, 2, 3}, Dtran[A, b] = followpos(2) = {1, 2, 3}. Both
the results are set A, so dose not have new state, end the computation.

 DFA

1. (a*|b*)*

2. ((ε|a)|b*)*

3. (a|b)*abb(a|b)*

3.9.3 !
We can prove that two regular expressions are equivalent by showing that their
minimum-state DFA's are the same up to renaming of states. Show in this way that the
following regular expressions: (a|b)*, (a*|b*)*, and ((ε|a)b*)* are all equivalent. Note: You
may have constructed the DFA's for these expressions in response to Exercise 3.7.3.

Answer
Refer to the answers of 3.7.3 and 3.9.2-1

3.9.4 !
Construct the minimum-state DFA's for the following regular expressions:

1. (a|b)*a(a|b)
2. (a|b)*a(a|b)(a|b)
3. (a|b)*a(a|b)(a|b)(a|b)
Do you see a pattern?

3.9.5 !!
To make formal the informal claim of Example 3.25, show that any deterministic finite
automaton for the regular expression

(a|b)*a(a|b)...(a|b)

where (a|b) appears n - 1 times at the end, must have at least 2n states. Hint: Observe
the pattern in Exercise 3.9.4. What condition regarding the history of inputs does each
state represent?

You might also like