7 How To Implement A Recursive Descent Parser A Parser Is A Program Which Processes Input de
7 How To Implement A Recursive Descent Parser A Parser Is A Program Which Processes Input de
7 How To Implement A Recursive Descent Parser A Parser Is A Program Which Processes Input de
ned by a context-free grammar. The translation given in the previous section is not very useful in the design of such a program because of the non-determinism. Here I show how for a certain class of grammars this non-determinism can be eliminated and using the example of arithmetical expressions I will show how a JAVA-program can be constructed which parses and evaluates expressions. 7.1 What is a LL(1) grammar ? The basic idea of a recursive descent parser is to use the current input symbol to decide which alternative to choose. Grammars which have the property that it is possible to do this are called LL(1) grammars. First we introduce an end marker $, for a given G = (V;; S; P) we de
ne the augmented grammar G$ = (V 0;0; S0; P0) where V 0 = V [ fS0g where S0 is chosen s.t. S0 =2 V [ , 0 = [ f$g where $ is chosen s.t. $ =2 V [ , P0 = P [ fS0 ! S$g The idea is that L(G$) = fw$ j w 2 L(G)g Now for each nonterminal symbol A 2 V 0 [ 0 we de
ne First(A) = fa j a 2
^ A ) a
g Follow(A) = fa j a 2
^ S0 ) Aa
g i.e. First(A) is the set of terminal symbols with which a word derived from A may start and Follow(A) is the set of symbols which may occur directly after A. We use the augmented grammar to have a marker for the end of the word. For each production A ! 2 P we de
ne the set Lookahead(A ! ) which are the set of symbols which indicate that we are in this alternative. Lookahead(A ! B1B2 : : :Bn) = [ fFirst(Bi) j 81 k < i:Bk ) g [ Follow(A) if B1B2 : : :Bk ) ; otherwise We now say a grammar G is LL(1), i for each pair A ! ;A !
2 P with 6=
) = ; 7.2 How to calculate First and Follow We have to determine whether A ) . If there are no -production we know that the answer is always negative, otherwise If A ! 2 P we know that A ) . If A ! B1B2 : : :Bn where all Bi are nonterminal symbols and for all 1 i n: Bi ) then we also know A ) . 39 We calculate First and Follow in a similar fashion: First(a) = fag if a 2 . If A ! B1B2 : : :Bn and there is an i n s.t. 81 k < i:Bk ) then we add First(Bi) to First(A). And for Follow: $ 2 Follow(S) where S is the original start symbol. If there is a production A ! B
with
) then everything in Follow(A) is also in Follow(B). 7.3 Constructing an LL(1) grammar Let's have a look at the grammar G for arithmetical expressions again. G = (fE; T; Fg; f(; ); a;+; g;E; P) where P = fE ! T j E + T T ! F j T F F ! a j (E) We don't need the Follow-sets in the moment because the empty word doesn't occur in the grammar. For the nonterminal symbols we have First(F) = fa; (g First(T) = fa; (g First(E) = fa; (g and now it is easy to see that most of the Lookahead-sets agree, e.g. Lookahead(E ! T) = fa; (g Lookahead(E ! E + T) = fa; (g Lookahead(T ! F) = fa; (g Lookahead(T ! T F) = fa; (g Lookahead(F ! a) = fag Lookahead(F ! (E)) = f(g Hence the grammar G is not LL(1). However, luckily there is an alternative grammar G0 which de
nes the same language: G0 = (fE;E0; T; T0; Fg; f(; ); a;+; g;E; P0) where P0 = fE ! TE0 E0 ! +TE0 j T ! FT0 T0 ! *FT0 j F ! a j (E) 40 Since we have -productions we do need the Follow-sets. First(E) = First(T) = First(F) = fa; (g First(E0) = f+g First(T0) = f*g Follow(E) = Follow(E0) = f); $g Follow(T) = Follow(T0) = f+; ); $g Follow(F) = f+; *; ); $g Now we calculate the Lookahead-sets: Lookahead(E ! TE0) = fa; (g Lookahead(E0 ! +TE0) = f+g Lookahead(E0 ! ) = Follow(E0) = f); $g Lookahead(T ! +FT0) = fa; (g Lookahead(T0 ! *FT0) = f*g Lookahead(T0 ! ) = Follow(T0) = f+; ); $g Lookahead(F ! a) = fag Lookahead(F ! (E)) = f(g Hence the grammar G0 is LL(1). 7.4 How to implement the parser We can now implement a parser - one way would be to construct a deterministic PDA. However, using JAVA we can implement the parser using recursion - here the internal JAVA stack plays the role of the stack of the PDA. First of all we have to separate the input into tokens which are the terminal sy mbols of our grammar. To keep things simple I assume that tokens are separated by blanks, i.e. one has to type ( a + a ) * a for (a+a)*a. This has the advantage that we can use java.util.StringTokenizer. In a real implementation tokenizing is usually done by using
nite automata. I don't want to get lost in java details - in the main program I read a line and produce a tokenizer: String line=in.readLine(); st = new StringTokenizer(line+" $"); The tokenizer st and the current token are static variables. I implement the convenience method next which assigns the next token to curr. static StringTokenizer st; static String curr; static void next() { 41 try { curr=st.nextToken().intern(); } catch( NoSuchElementException e) { curr=null; } } We also implement a convenience method error(String) to report an error and terminate the program. Now we can translate all productions into methods using the Lookahead sets to determine which alternative to choose. E.g. we translate E0 ! +TE0 j into (using E1 for E0 to follow JAVA rules): static void parseE1() f if (curr=="+") f next(); parseT(); parseE1(); g else if(curr==")" curr=="$" ) f g else f error("Unexpected :"+curr); g The basic idea is to Translate each occurrence of a non terminal symbol into a test that this symbol has been read and a call of next(). Translate each nonterminal symbol into a call of the method with the same name. If you have to decide between dierent productions use the lookahead sets to determine which one to use. If you
nd that there is no way to continue call error(). We initiate the parsing process by calling next() to read the
rst symbol and then call parseE(). If after processing parseE() we are at the end marker, then the parsing has been successful. next(); parseE(); if(curr=="$") f System.out.println("OK "); g else f error("End expected"); g The complete parser can be found at http://www.cs.nott.ac.uk/~txa/g51mal/ParseE0.java. Actually, we can be a bit more realistic and turn the parser into a simple evalu ator by 42 Replace a by any integer. I.e. we use Integer.valueOf(curr).intValue(); to translate the current token into a number. JAVA will raise an exception if this fails. Calculate the value of the expression read. I.e. we have to change the method interfaces: static int parseE() static int parseE1(int x) static int parseT() static int parseT1(int x) static int parseF() The idea behind parseE1 and parseT1 is to pass the result calculated so far and leave it to the method to incorporate the missing part of the expression. I.e. in the case of parseE1 static int parseE1(int x) f if (curr=="+") f next(); int y = parseT(); return parseE1(x+y); g else if(curr==")" curr=="$" ) f return x; g else f error("Unexpected :"+curr); return x; g g Here is the complete program with evaluation http://www.cs.nott.ac.uk/~txa/g51mal/ParseE.java. We can run the program and observe that it handles precedence of operators and brackets properly: [txa@jacob misc]$ java ParseE 3 + 4 * 5 OK 23 [txa@jacob misc]$ java ParseE ( 3 + 4 ) * 5 OK 35 43 7.5 Beyond LL(1) - use LR(1) generators The restriction to LL(1) has a number of disadvantages: In many case a natural (and unambiguous) grammar like G has to be changed. There are some cases where this is actually impossible, i.e. although the language is deterministic there is no LL(1) grammar for this. Luckily, there is a more powerful approach, called LR(1). LL(1) proceeds from top to bottom, when we are looking at the parse tree, hence this is called topdo
wn parsing. In contrast LR(1) proceeds from bottom to top, i.e. it tries to construct the parse tree from the bottom upwards. The disadvantage with LR(1) and the related approach LALR(1) (which is slightly less powerful but much more ecient) is that it is very hard to construct LR-parsers from hand. Hence there are automated tools which get the grammar as an input and which produce a parser as the output. One of the
rst of those parser generators was YACC for C. Nowadays one can
nd parser generators for many languages such as JAVA CUP for Java [Hud99] and Happy for Haskell [Mar01]. 44