Academia.eduAcademia.edu

Complexity and unambiguity of context-free grammars and languages

1971, Information and Control

Four of the criteria of complexity of the description of context-free languages by context-free grammars are considered. The unsolvability of the basic problems is proved for each of these criteria. For instance, it is unsolvable to determine the complexity of the language generated by a given grammar, or to find out the simplest grammar, or to decide whether a given grammar is the simplest one and so on. Next, it is shown that in some cases one can obtain unambiguity only by increasing complexity. Namely, for each of the four criteria, in any complexity class there are unambiguous languages, all simplest grammars of which are ambiguous. As one would expect, it is unsolvable whether for an arbitrary grammar G there are unambiguous grammars within the simplest grammars for the language generated by G.

INFORMATION AND CONTROL 18, 502-519 (1971) Complexity and Unambiguity of Context-Free Grammars and Languages J. GRUSKAt Department of Computer, Information, and Control Sciences University of Minnesota, Minneapolis, Minnesota 55455 Four of the criteria of complexity of the description of context-free languages by context-free grammars are considered. The unsolvability of the basic problems is proved for each of these criteria. For instance, it is unsolvable to determine the complexity of the language generated by a given grammar, or to find out the simplest grammar, or to decide whether a given grammar is the simplest one and so on. Next, it is shown that in some cases one can obtain unambiguity only by increasing complexity. Namely, for each of the four criteria, in any complexity class there are unambiguous languages, all simplest grammars of which are ambiguous. As one would expect, it is unsolvable whether for an arbitrary grammar G there are unambiguous grammars within the simplest grammars for the language generated by G. 1. INTRODUCTION AND SUMMARY If the number of states is taken as a criterion of complexity of finite state acceptors, then effective procedures to construct a minimal finite state acceptor, equivalent to the one given, are well-known. T h e states of a finitestate acceptor correspond roughly to nonterminal symbols (variables) of a finite state grammar and vice versa. This leads to the idea of considering the number of nonterminal symbols as a criterion of complexity of contextfree grammars (CFG's). However, in this case, as it is shown in Section 3, there is no effective procedure to construct the minimal grammar. In addition to the n u m b e r of nonterminal symbols the three other criteria of complexity of C F G ' s are explored in this paper. T h e y are closely related to the concept of grammatical level and express in a way the intrinsic complexity, or loop complexity, of the description of context-free languages (CFL's). For all of these criteria the unsolvability of the basic problems is proved. For instance, it is unsolvable to determine the complexity of the language tPresent address: Mathematical Institute of Slovak Academy of Sciences, Bratislava, Czechoslovakia. 502 COMPLEXITY AND UNAMBIGUITY OF L A N G U A G E S 503 generated by a given grammar, or to find the minimal grammar, or to decide whether a given grammar is the minimal one, and so on. F r o m a practical point of view, it is usually desirable to have, for a given context-free language L, a grammar which is unambiguous and as simple as possible. It is proved in Section 4, for the criteria of complexity of C F G ' s and C F L ' s defined in Section 2, that these two requirements of simplicity and unambiguity are, in general, in conflict. I n other words, it may happen for an unambiguous C F L L that the simplest grammar for L, with respect to one of the criteria of Section 2, must be ambiguous. Moreover, it is undecidable for an arbitrary C F G G whether this is true for the language generated by G. 2. PRELIMINARIES 1. I n this paper, we shall consider only C F G ' s G = { V , X , P , @1, * # • such that for each variable M ~ V - - Z , (a) the set { x ; A ~ x e Z } is nonempty and, (b) there exist words x and y such that a ~ x A y in G. 2. If G = { V , Z, P , a ) is a CFG, then a subset G o C P is said to be a grammatical level of G if (A --~ ~) ~ G O implies that [(B --~/3) ~ G o if and only if A * x B y and B ~* x x A y 1 for some x, x1 , y, Yl in V*]. (In other words, a grammatical level G o of a C F G G is a maximal set of rules of G such that the symbols on the left sides of these rules are mutually dependent.) T h e n u m b e r of variables on the left sides of the rules of a grammatical level G o is said to be the depth of Go, and is denoted by Depth (Go). A grammatical level G o of G is termed nontrivial if Depth (Go) > 1. 3. I n a recent paper by Gruska (1969), the following criteria of complexity of C F G ' s were considered: Var(G) Depth(G) Lev(G) Lev,(G) = -= = the n u m b e r of variables of G. max{Depth(Go); G o is a grammatical level of G}. the n u m b e r of grammatical levels of G. the n u m b e r of nontrivial grammatical levels of G. 1 A context-free grammar is a quadruple G = /V, Z, P, a) where V is a finite set of symbols called nonterminals (or variables), Z C V with the elements of Z being called terminal symbols, P is a finite set of rules of the form A -~ c~where N E V -- Z, c g*, a a V -- X is called the initial symbol of G. If A --~ c~is in P and w1 and w2 are in V*, we write WlAW 2 ~ wlc~w2. Then * is the transitive and reflexive closure of ~, and we defineL(G) = (w, a * w ~ E*}. A languageL is context-free ilL = L(G) for a context free grammar G. The symbol e will denote the empty word. 504 GRUSKA 4. If K is one of the above criteria of complexity of CFG's, then K induces a criterion of complexity of C F L ' s which is also denoted by K and defined by K(L) -~ min{K(G); L(G) = L}. 5. In the next section we will often use Rul(G) to denote the maximal lenght of the right sides of the rules of a C F G G. 6. As usual, our proofs of undecidability and unsolvability will be based on the unsolvability of the Post correspondence problem. T o simplify the ensuing discussion we now introduce some notation. First let ~(x, y) be a predicate which holds if and only if x = (xl, x2 ,... , xn) and y = (Yl, Y2 .... , Yn) are n-tuples of nonempty words over the alphabet {a, b} and, moreover, there exists a sequence of indices i~ , i~ ,..., i~ with 1 ~ ij ~ n and such that xilxi~ "'" xi~ = Yi~Yi~ "'" Yi~ . 2.1. THEOREM (Post). It is undecidable, for arbitrary n-tuples x and y, whether ~(x, y) holds. 7. For n-tuples x and y of nonempty words over the alphabet {a, b} we define the languages L(x), L(x, y) and the language L s by L(x) = {bail "" ba%xi~ "" xil ; 1 ~ ij ~ n}, L(x, y) = L(x) cLR(y), and L s = {WlCW2CW2RCWxR; WiW2 e {a, b)*}, where, for a word w, w R is the reverse of w and for a language L, L R = {w R; w ~ L}. 8. In this paper, we are concerned only with context-free grammars and languages and therefore, unless stated otherwise, by "grammar" we shall mean context-free grammar, and by "language" we shall mean context-free language. COMPLEXITY AND UNAMBIGUITY OF LANGUAGES 505 3. UNSOLVABILITY OF BASIC COMPLEXITY PROBLEMS Let @ be the class of C F G ' s and K be a mapping K : do --~/, where I is the set of nonnegative integers. Let the domain of K be extended to the class of C F L ' s by defining, for a C F L L, K(L) = min{K(G); L(G) = L}. K may be interpreted as a criterion of complexity for C F G ' s and C F L ' s and then the following questions arise in a natural way: Q1. Is there an algorithm to determine K ( G ) for an arbitrary C F G G ? Q2. Is there an algorithm to determine K(L(G)) for an arbitrary CFG G ? Q3. Are there integers n such that it is decidable for an arbitrary C F G G, whether or not K(L(G)) ~ n ? Q4. Is it decidable, for an arbitrary C F G G, whether or not K ( G ) is, whether G is the simplest grammar for L(G) with respect K(L(a)) ? (That to K.) Q5. Is there an algorithm to construct to an arbitrary C F G G a C F G G' such that L(G) = L(G') and K(G') = K(L(G)). (In other words, whether there exists an algorithm to construct the simplest grammar for L(G).) Remark 3.1. For criteria K defined in Section 2, the answer to the question Q1 is yes, what is easy to see, but, as it will be shown in this section, the answers to Q 2 - Q 5 are negative. T o begin with, observe that the negative answer to Q3 for a K implies the negative answer to Q2 and, moreover, if the answer to Q1 is positive, also the negative answer to Q5. Therefore, in order to prove the negative answers to Q 2 - Q 5 for criteria K of the Section 2, it is sufficient to show the negative answers to Q3 and Q4. It will be done in this section in a series of lemmas. Unfortunately, their proofs are quite cumbersome but we were not able to find more elegant ones. LEMMA 3.2. For no integer n is it decidable for an arbitrary C F G G whether or not Var(L(G)) ~ n. Proof. Let x and y be n-tuples of nonempty words in {a, b}*, e be a symbol not in {a, b} and Lx,~ be the language defined by Lx, v = {a, b, c}* -- L(x, y) c~ L s . T h e proofs of Lemmas 4.2.4 and 4.2.6 in Ginsburg's book (1966) give an effective procedure to construct, given x and y, a linear C F G Gx, ~ generating the language L~. v . For this language we will be able to show below that (*) Var(Lx.~) = 1 if and only i f L ( x , y ) n L ~ = O. 506 GRUSKA On the other hand, L ( x , y) c~ L~ = 0, if and only if ~ ( x , y) holds. Thus, from ( , ) and from Post's theorem it immediately follows that if n = 1, then it is undecidable for an arbitrary grammar G whether or not Var(L(G)) = n. T o show it for n > 1 we use the results of Gruska (1967). Consider the languages L~ = L x . u U {d} and Lj = L~.~ w {dd}* w .." w {d#-~} * for j > 2, where d and e are distinct symbols not in {a, b, c}. Clearly, there is an effective procedure to construct a linear grammar for Lj., j ~ 2, whenever x, y and j Using are given. By Gruska (1967), Var({de2}* ~3 "" u {de~-l} *) = j - - 1 . this fact, one can show easily that for j ~ 2, Var(Lj) = j if and only if ~ ( x , y) does not hold. Hence, by Post's theorem, the lemma follows if ( . ) is true. I n order to prove ( . ) we proceed as follows. I f L ( x , y ) c ~ L , = 0, then trivially Var(L~.v) = 1 and, therefore, let us assume that L ( x , y ) n L~ ~ O. Then, there exists a sequence i 1 , i 2 ,..., i k of indices such that, if we denote I = bail ' ' ' baik, X = xi~ ".. x q , J = I R, Y = X R, then I ' ~ c X " * c Y ~ c J ~ ~ L ~ . v for no integer m >/ 1. W e want to show that in this case Var(Lx,v) > 1. T h e proof will be by contradiction, and we will make use of the fact that the words of L~.~ have a very regular structure, namely, (.~) for every word u ( v ) there is at most one word v ( u ) such that UCvcvRcu R is not in Lx. u . T o derive a contradiction, let G be a grammar with the only one variable, say a, which generates L , . v and let d > Rul(G) be an integer. If i > d, then clearly the word w = P c X i + l c y i + l c J ~+1 is in Lx, u and therefore there is an a such that ~--~ ~ =~ w and words ~0, ~1, Wo, wl such that a ~ C~oaa1 , w = %WoW 1 , a ~ w o , oh ~ w 1 and aoW1 J= e. Since i ~ d,~o does not contain the symbol c and therefore there must exist words z 0 and z 1 such that ZoZ 1 ~- I and a = P o z o for some i o >~ 0. Obviously, aoZaZoWoW1 ~ L~, v . On the other hand, since %w 1 =/: E and (**) holds, ZlZoW o 6 L , . u . But then, o~ ~ %ach ~ %ZlZoWoWa E L ~ , ~ , a contradiction. This completes the proof of (*) and the lemma. LEMMA 3.3. I t is u n d e c i d a b l e f o r a n a r b i t r a r y C F G G, w h e t h e r or n o t Var(G) = Var(L(G)). COMPLEXITY AND UNAMBIGUITY OF LANGUAGES 507 Proof. Let x a n d y be again n-tuples of n o n e m p t y words in {a, b}* a n d G'x, u be a g r a m m a r with the initial s y m b o l a and with the rules 2 cr --~ a~ra I bcrb l b~pb I a~pa bp~b l a p ( a l b~'p~'a l a~'p~'b ~--+ ~a[ ~ b [ a l b ~'~1~ p --+ ccrc I ccr'c[ d~d, ~' --> xz~'Yi R ] x~ d~ dyi R, l ~i~n. T h e language L'~, v generated b y this g r a m m a r has the form L~, u = {w; w = ulc "" cukdwodvkc "" c731 , h 7~/ 1, U~ a n d vi are in {a, b}*, uj :# %R i f j < k and either uk @ v1;R or uk = x~ '-" x,, = Yi~ "" Y ~ = vk R for some i i , i~ ,..., iz}. Note that the variables ~:' a n d p in the description of G'x, u are superfluous a n d can be easily reduced. T h u s , there is an effective procedure, say ~-, to construct, given x and y, a g r a m m a r with three variables for L'~, u . If ~ ( x , y) does not hold, such a g r a m m a r is n o t the simplest one since in this case one can also remove all rules with ~'. O n the other hand, if ~ ( x , y) holds, then, as will be shown below, Var(L'~,u) = 3 and therefore the g r a m m a r obtained b y ~r is the simplest one with respect to Var. Hence, b y Post's theorem, the l e m m a follows. It only remains to show that Var(L;,~) = 3 if ~@(x,y) holds. I n doing so we will often implicitly make use of the fact that each word of L~, u has exactly two d's, that there is no occurrence of c between these two d's and that there is the same n u m b e r of c's to the right a n d to the left of d's. I f ~ ( x , y) holds, t h e n there m u s t exist a sequence i 1 , i 2 .... , i~ of integers such that xilx~2 "" xi~ = y~ly,~ " " Y i ~ . P u t X = x~lxi2 "" xi~ and Y = X R. t Now, let us assume that G is a g r a m m a r for L~, v with a m i n i m a l n u m b e r of Here and in the sequel we will mostly describe the rules of a grammar in an abbreviated form, namely, we will write A ~ cq I c% I "'" ] c~ instead of A --* oq , A -~a2,...,A -~. 508 GRUSKA variables and let m ~ Rul(G). It is easy to see that Vat(G) ~ 3 and that for each variable B of G all terminal words derived from B must have the same n u m b e r of d's. If all terminal words derived from a variable of G had two d's, G would be linear and the only rules with d would be terminal ones and the word ada ~+1 da 2 ~L'~, u could not be derived in G. T h u s G has to have at least one variable generating words with less than two occurrences of d's and we have Var(G) > 1. Now assume that Var(G) ~- 2 and that a and A are two variables of G, a being the initial symbol of G. We start by observing that if B ~ a or B ---- A, then all words derived from B have the same n u m b e r of d's and c's. T o derive a contradiction we proceed as follows: T h e words w1 ----adada~+1 and w 2 = aca~nda~da~+lca ~n+l are in L' If a --~ ~ ~ w~, a =# ~i, i = 1, 2, then a t does not contain a. Otherwise ~, would have the form aa~ for some j ~ 0 yielding aJ+ldada J+l ~ L ( G ) , a contradiction. Thus, al contain only symbols from {a, c, d, A} and it implies that the words derived from A have no c and no d. t Consider now the word w ~ - a ~ e X ~ d a d Y ~ c a ~+t in L~, u . Let a ~--w0, Wl ,..., w~ = w be a derivation of w in G. We can assume that this derivation has already the property that if w~+t is obtained from Wl by using a rule a --+ a then w~ contains no A. But it means that an io must exist such that wi0 contains no c and wi0+~ = a~cuovca ~+~. Since w~0+~ ~ w, there must exist u, w, v such that u => ~, a => ~, v ~ ~7 and w = a ~ c ~ c a ~+z with no c in u-~. If g = gR, then ~ = UoC~CUoR for some u 0 ~ {a, b}* but this t cannot happen for a word in L~,~. If g v~ gn, then v7 ----u o d N d v o , u o ~ Vo~ and therefore also uocXdadXRcvo is in L'x,u. But then amcguocXdadX~cvogea m+l = a m c X ~ c X d a d Y c Y ~ c a ~ + l is inL'~.y. Therefore the assumption Var(G) = 2 leads to a contradiction. A detailed study of the proofs of the last two lemmas shows that we have actually proved more. T h e next corollary summarizes whas has been proved. I n doing so, the concepts of semilinear grammars and languages are used. A semilinear grammar--see Gruska (1970) and Ginsburg and Spanier (1968) where semilinear languages are called derivation bounded languages--is a grammar all grammatical levels of which are linear. A language generated by a semilinear grammar is called semilinear. For a semilinear language L, let Vary(L) = min{Var(G); L ( G ) = L, G is a semilinear grammar} and for a linear language L, let Vary(L) = min{Var(G); G is a linear grammar for L}. COROLLARY 3.4. (i) I t is unsolvable to determine Var(L(G)) (Var,(L(G)) f o r an arbitrary linear g r a m m a r G. COMPLEXITY AND UNAMBIGUITY OF LANGUAGES 509 (ii) For no integer n is it decidable for an arbitrary linear grammar G whether or not Var(L(G)) = n (Var~(L(G)) -----n). (iii) It is unsolvable to construct for an arbitrary linear grammar G a (linear) grammar G' generating L ( G ) and such that Var(G') ----Var(L(G)) ( V a r ( a ' ) = Var~(L(a)). (The statements (i)--(iii) remain true if the term "linear" is replaced by "semilinear" and Var, by Var~ .) (iv) It is undecidable for an arbitrary semilinear grammar G whether or not Var(G) = Var(L(G))(Var(G) = Var~(L(G)). Let us now proceed to study the criterion Lev. LEMMA 3.5. (a) For no integer n is it decidable for an arbitrary C F G G whether or not Lev(L(G)) = n. (b) It is undecidable for Lev(G) : Lev(L(G)). an arbitrary CFG G whether or not Proof. Let x and y be n-tuples of nonempty words in {a, b}* and G~, u be the grammar with the initial symbol a and with the rules which arise from the following rules by replacing r' by r or • and p by cac or d or c¢'c. a ~ aaa ] bab [ arpa J brpb [ apra ] bprb [ ar'pr'b l b-c'p'r'a, -r --~ za [ rb d a [ b ] eae, a' ~ x j y i a ] x,ca'cy, R [ x, dy~ R, 1 <~ i <~ n. The proof of L e m m a follows easily once we have proved (,) Lev(L(G~.u) ) ~ 1 if and only if ~(x, y) does not hold. Indeed, combining (.) with Post's theorem we get (b) and (a) for n = 1 and n = 2. To show (a) for n > 2, it is sufficient to consider the languages L,~ : L(G~.~,) w L'.-2 , where L t n-2 ~" {fg}* t5 {f2g}* W .., W {f~-~g}* and f, g are symbols not in {a, b, c, d, e). By Gruska (1969), Lev(L~_2) ~ n -- 1 if n > 2. Using this fact one can show easily that Lev(Ln) = n if and only if ~(x, y) holds. Hence, by Post's theorem, (a) follows for n > 2. It remains only to prove (,). If ~(x, y) does not hold, then we can remove from G~. u all rules with a' without affecting the language generated by G~.~. Hence, Lev(L(G~,u) ) = 1. Put Lz,u = L(Gz,v). If ~ ( x , y ) holds and Lev(Lx.u) : 1, then there must exist a grammar G forLx, ~ such that Lev(G) = 1. In order to finish the proof of (.) it is sufficient 510 GRUSKA to derive a contradiction from our last assumption. T o do that we proceed as follows. We can assume without loss of generality that G is ~-free and has no rules of the form A - + B with B being a variable. Let e be the initial symbol of G, n o the number of variables, and m > Rul(G). Since ~ ( x , y) holds, there are indices i 1 , i., ,..., i~ such that X = x~xi2 "'" xik = Yi~Yi~ "'" Y~k = yR. Now, consider the word z = a ( c X ~ ) g d(Ymc) N aa, where N > n o + 4 and parantheses are used only to abbreviate the description of z and are not symbols of z. W e shall show that z cannot be derived in G which will give a desired contradiction because z ~ L ~ . y . We start by proving a. (**) If u, v are terminal words, A is a variable, a * u A v *~ z, then either ] u [ ~ l acXmc[ or I v ] ~ [ c y m c a a ] . Since Lev(G) = 1, if (**) were not true, the words ff and g would exist such that acX~cffegcY~caa e L ( G ) which is impossible as one can easily verify from the description of G~,v. Thus, (**) holds. Now, let ¢ be a derivation tree for a derivation of z in G. For any node of ¢, let z e be the subword of z derived from ~: in ¢. Let ~7 be the node in ~b of the maximal order and such that z, - - udv for some u and v. By (**), max(] u 1, [ v 1) >/ [ Y'~c I~T-1. This in turn implies the existence of a node in ~b such that z e is the subword either (i) of (Y~c)n; or (ii) of (cXm) g and ]ze] ) [ y m c [ N - 1 m. Assume that (i) takes place. T h e case (ii) goes through similarly. Denote by ¢¢ the subtree of ¢ induced by ~:. A node ~ of Ce is called external if z, is a subword of cP~caa, otherwise ~ is called internal. Because of (**), in Ce there is only one path ~r which starts with se and contains only internal nodes. Since N > n o + 4, ~r contains j >~ n o @ 1 nodes ~:1, ~2 ,..., ~J such that the rules applied in ¢~ at ~, have the form A t - ~ u , c v , B i % with u~v, terminal and B, yielding an internal node. S i n c e j >~ n o + 1, there are 1 ~<j~ < J 2 ~<J such that _//a~ = -//3~ • This implies the existence of terminal words g and ~7in {a, b, c}* such that Aj~ * g a n g and ~ has at least one c. This in turn implies 3 For a word x, ] x [ is the number of symbols in x. COMPLEXITY AND UNAMBIGUITY OF LANGUAGES 511 that in G a word of the form uodv o with no e in u o and v 0 and unequal number of c's can be derived. T h e words of such a form are not in L~,~. Hence, the assumption Lev(G) = 1 leads to a contradiction proving (.) and thereby the Lemma. LEMMA 3.6. Let K be one of the criteria Lev~ and Depth. (a) For no integer n is it decidable for an arbitrary C F G G whether or not K(L(G)) = n. (b) It is undecidable for K ( C ) = K(L(G)). an arbitrary CFG G whether or not Proof. Let x a n d y be n-tuples of nonempty words in {a, b}*. As mentioned in the proof of Lemma 3.2, given x and y, a grammar Gx,v can be effectively constructed such that L(G~.v) = {a, b, c}* - - L ( x , y ) t~L~ and, moreover, Gx,v can be constructed in such a way that Depth(G~,y) ~ 1, Levn(Gx,v) = O. Let a be the initial symbol of G~,v and let a0, A, B, d, e, ~, S be symbols not used in G~.v. Let G~,y be the grammar arising from G~,v by adding the rules % ~ Ad, A -+ e A a S I eBbS[ e~d, B - , eB~2:[ e A a ~ 1 e~d, ~ - ~ ~a l ~b l a[ b, and choosing % to be the initial symbol of G~, v . Clearly, Lev~(G~,v) = 1 and Depth(G~,~) -= 2. By Ginsburg (1963), Lev~(L(G~,v)) = 0 if and only if ~(x, y) does not hold. Moreover, Depth(L(G~,~)) = 1 if and only if Lev~(L(G~,~)) ~ 0. Hence, by Post's theorem, (b) follows and also (a) for K ~ Levn and n ~ 0, 1 and for K ~ Depth and n = 1, 2. To show (a) for other values of n we proceed as follows. Given x and y as above and n > 1 we can effectively construct a grammar G~ generating the language L~ -~ L(G~,y) w Ln_l , where L~_ 1 is the language generated by a grammar with the rules cr---~ cyi ~ cri - ~ giaih I gihgn+iS~hg h, Si ~ gn+iSih I hcrg I h2g 2, 643/I8/5-8 l ~i~n--1, 512 ORUSKA and with ~ as the initial symbol. By Gruska (1969), Levn(L~_l) = n -- 1. Using this fact one can show easily that Levn(Ln) ~ n if and only if ~(x, y) does not hold. Thus, by Post's theorem, (a) follows for Lev~. The detailed proof that for n > 2 it is undecidable for an arbitrary grammar G whether or not Depth(G) = n is quite tedious and only the basic idea will be sketched here. Let G'£,u be grammar arising from G',, v by adding the rules A -+ pAaq, A~ --~ piAiq [ qpi+XAi+lqp, 1 <. i < n -- 2, A~_ 2 --+ pn-~An_2q [ qpn-lAqp, where p, q are symbols not in the alphabet of G'~,u . Using the technique of the proof of Lemma 2.1. in Ginsburg (1963) and of Theorem 4.2. in Gruska (1969), one can show that Depth(L(G~,v) ) = n if and only if ~(x, y) does not hold. Now the result follows at once from Post's theorem. Summarising the results of the last four lemmas we have THEOREM 3.7. For criteria Var, Lev, Levn and Depth the answers to questions Q2-Q5 are negative. Remark 3.8. Using the standard technique we can show that the last theorem remains valid also for the case that only grammars and languages with at most two terminal symbols are considered. 4. UNAMBIGUITY AND COMPLEXITY From a practical point of view it is often desirable to find for a given C F L a grammar which is unambiguous and as simple as possible. In the present section we will see that for each of the criteria of complexity K defined in Section 2 there exists an unambiguous language L such that K(G) > K(L) for every unambiguous grammar forL. Therefore, unambiguity and simplicity are in general in conflict. LEMMA 4.1. There exists an unambiguous C F L L such that Vat(G) > Var(L) and for every unambiguous grammar for L. Lev(G) > Lev(L) COMPLEXITY AND U N A M B I G U I T Y OF L A N G U A G E S 513 Proof. Let L K ~ {x; x ~ {a, b}, x has the same n u m b e r of occurrences of a's and b's} be a (Knuth) language. By K n u t h (1965), L~c is an unambiguous language and LK is generated by unambiguous g r a m m a r a --+ a A b a ] b B a a ] E, A -+ aAbA [e, B -+ bBaB l e. (1) On the other hand, L K is also generated by the grammar G' with the rules a --+ aab(~ I bcraa [ e and therefore Var(LK) = 1 = Lev(LK). We want to show that Var(G) > 1 and Lev(G) > 1 for every unambiguous grammar forL2c • T o this end, let G be any grammar f o r L which has only one variable, say a. I f G is a linear grammar then all rules of G must be either of the form a --~ u a v or of the form a ~ w where u, v, w are in {a, b}* and u v has the same n u m b e r of a's and b's. But it means that if d > Rul(G), then the word aab2aa a cannot be derived in G, a contradiction. Thus, G cannot be linear, and at least one of the rules of G must have the form a --+ u a w a v for some u, w, v in {a, b, a}*. Let x, y, z be words over {a, b} such that u => x, has two essentially different w ~ y, v ~ z. T h e n the word x a b y a b z x y a b z derivations in G. (Both derivations start with the rule ( 7 - - + u a w a v and in the first one (second one) u ~ x , (z ~ ab (a ~ a b y a b z x ) , w * y , (~ :~ a b z x y a b (a ~* ab), w ~* z . ) Thus, G must be ambiguous. Assume now that G ' be a grammar for L K with only one grammatical level. Let (r be again the initial symbol of G'. I f G' is not a linear grammar, then there must again exist words u, w, v such that e *~ u a w e v and, similarly as above, we get that G' must be ambiguous. Suppose therefore that G ' is linear. Because of the structure of the words o f L K , if A --+ u A v is a rule of G', then u v has to have the same n u m b e r of occurrences of a's and b's. L e t n o be the n u m b e r of variables of G', d > Rul(G), and k be any integer such that k > 3 d n o • W e know that G is ambiguous if n o = 1, and, therefore let n o =# 1. Consider the word z = akb2ka k which is in L K . T h e first 2n o derivation steps of a derivation of z in G' must have the form a , uoA07)o , g o g l A l V l V O , . . . , UoU1 "'" U2no_lA2no_lV2no_l ... VlCA0 , where uou 1 "" u2.o_ 1 and r o y 1 ".. V2.o_ 1 are words over the alphabet {a). Since 2n o - 1 ~ n o , t h e r e m u s t e x i s t 0 < i 1 < i 2 ~ 2 n o - l s u c h t h a t A i l ~ A G . But it means that A h * a ~ o A q a ko for some Jo + ko > 0. This is impossible 514 GRUSKA if G' generates L K . Hence, G' cannot be linear, completing the proof of the lemma. Remark. It is possible to show that Var(G) > 2 for every unambiguous grammar for LK and therefore the grammar (1) is the simplest grammar for Lrc with respect to Var. LEMMA 4.2. There exists an unambiguous (linear) language L such that Depth(G) > Depth(L) and Levn(G) > Levn(L ) f o r every unambiguous grammar for L. Proof. Let L be the language generated by the grammar G o with the rules o" --+ a2ab I aacrb ] cac [ e. L is an unambiguous C F L becauseL can also be generated by an unambiguous C F G G1 with the rules a 1 --+ aa~l b [ cac I e. Clearly, Depth(L) = 1, Lev~(L) = 0. Let us now assume that G is an unambiguous grammar for L with Depth(G) = 1 and therefore L e v n ( G ) = 0. In such a case, by Gruska (1966), there must exist an unambiguous C F G G' for L such that Depth(G') = 1, L e v n ( G ' ) = 0, and for every variable A of G 1' with a possible exception of the initial symbol, there is a rule A -~- u A v in G. We can assume without loss of generality that G has already this property. G must be a linear grammar. Indeed, if G is not, then there are words u, v, w and variables A, B such that a * u A w B v . T h e words of L have the form uev, u and v in {a, b, c}* and for every u(v) there are only finitely many v(u) such that uev e L ( G ) . Therefore, if ~ * u A v B v would be possible, then, since the sets of words which can be derived from A and B are infinite, it would be possible in G to derive an infinite number of words of the form uev with the same u (or with the same v). Thus, G must be a linear grammar. From the structure of words of L and from the assumption that G is unambiguous it follows easily that (A) if A --~ u A v is a rule of G and u ~ {a}*, then v ~ (b}*, and either u = a 2[~1 or u ~ aal~l; (B) for each variable A of G there is an integer ia such that whenever A ~ u A v holds m G, and u ~ {a}*, then v E {b}* and u = aial ~1. COMPLEXITY AND UNAMBIGUITY OF LANGUAGES 515 Consider now the word z ~ a2Nca3Nca2N "'" ca~NebNcbN "'" cbN , 2N 2N where N > dn o , n o being the n u m b e r of variables of G, d > Rul(G). Let = z o , z 1 .... , z s = z (1) be a derivation of z in G, and ~4~ --+ u~Ai+lv~, i = 0, 1,..., s - - l (2) be the rules involved in the derivation steps from zi to zi+ 1 . Since N > d, and (B) holds, (2) cannot contain the rules of the form A , --~ uiA,v i with c in uiv~. Therefore each time a new c is introduced in (1) the rule being involved in that derivation step must have the form A -~. uBv, B ~ A , u ~ {a, b}* c{a, b}*. (3) Since Depth(G) = 1, it may happen at most n o times that a rule in (2) has a form (3). But N > no, and we get a contradiction to the fact that (1) is a derivation of z. Summarizing the last two lemmas, we get THEOREM 4.3. Let K be one of the criteria Var, Lev, L e v n , Depth. There exists an unambiguous C F L L such that K ( G ) > K ( L ) for every unambiguous grammar for L. Using the results of Gruska (1969) and the same technique as in the previous section one can show a little more. THEOREM 4.4. Let K be one of the criteria Var, Lev, Lev n , D e p t h and n an integer. There exists an unambiguous language Ln such that K ( L ~ ) - = n and K ( G ) > K(L) for every unambiguous grammar for L ~ . 5. UNDECIDABILITY OF THE WEAK AMBIGUITY PROBLEM It is well known that the ambiguity problem for C F G ' s , i.e., the problem whether or not, given an arbitrary C F G G, L(G) is unambiguous C F L , is undecidable. T h e results of the previous section show that it may happen 516 GRUSKA that for an unambiguous C F L the "simplest" C F G is ambiguous. Naturally, the following question arises: Given one of the criteria K defined in Section 2, is it decidable, for an arbitrary grammar G, whether or not the simplest grammar for L ( G ) , with respect to K, must be ambiguous ? It is proved in this section that the answer is negative. LEMMA 5.1. I t is undecidable f o r an arbitrary C F G G whether or not there is an unambiguous C F G G' such that L ( G ' ) = L ( G ) and Var(G') = Var(L(G)). Proof. Let x and y be n-tuples of nonempty words in {a, b}*. Let G be a grammar with the rules -+ xiaaibd l y~ada~b ] xiaa~be [ ae [ c I cd, 1 ~ i ~ n. By Greibach (1963), G is ambiguous if and only if ~ ( x , y) holds. Because of undecidability of Post correspondence problem, in order to prove the lemma it is sufficient to show that if there is a sequence of indices/1 ,..., ik such that xqxi2 "'" xik = Y6Yi2 "'" Yi, , (1) then there is no unambiguous grammar for L ( G ) with the only one variable. Assume on the contrary that a sequence of indices satisfying (1) exists and that also there exists an unambiguous grammar G' such that Var(G') = 1 and L ( G') = L ( G). Since each word of L(G) has exactly one occurrence of c, G' must be a linear grammar. It is also easy to see, since c is in L ( G ) , that all rules of G which have at most one e must have one of the following forms: (r ~ Ph "'" PJ~aYi~ "'" YJl ' (2a) --+ Ph "'" pj~oyj~ "'" yhe, (2b) (2c) (2d) a --+ xjopy 1 "" pj~ayj~ "" y~la~°be, - + Psi "'" pj Try~ "'" Y31 , a --~ pj~ "'" ph~ry~ "'" 7he, (2e) a - + XjoPh "" pj~ry~z "'" yj~a3°be, (2f) where either 7J~ = aJ~bd and p~. ~ x ~ , or ),~ = daJ~b and pj, = y~,, and n is either c or cd. COMPLEXITY AND UNAMBIGUITY OF LANGUAGES 517 Let m > Rul(G). Let X = x i l " " x i ~ , where i l , i 2 ,..., i k satisfy (1) and R = ai*bdai~-ibd " . dailb. Let K = m I X I. Consider the word w o = X~cR(dR)m-~e (where parentheses are used only to simplify the description of w o and are not elements of Wo). Clearly, w o e L ( G ) . Since the rules of G ' have form (2a) to (2f), the first rule used in a derivation of w 0 in G' must have the form c~ ~ x h "'" x~,ya~mbd "'" ahbe, (3) where 11 --" lm is a n o n e m p t y initial part of the sequence i I ,..., ik, il ,..., ik..... is ..... ik of K integers. Let/m+l ,..-, lk be the rest of this sequence. T h e word xl~+l "'" x~Kcda tKb "'" a ~ + I b d is in L ( G ) and, therefore, there exists a derivation of X"~c(dR)~e (4) in G' which starts with the rule (3). Consider now the word w 1 = Xmcd(dR)~ne which is in L ( G ) . I n this case, in view of the structure of the rules of G ' , the last rule used in the derivation of w I in G ' must have the form cr --~ y~ . . . . yZKcdda~Kb ... d d ~ b ; but this means that in G ' we must have a ~ y~ . . . . y ~ - i a d a ~ - l b "" d a h b e = w 2 and the derivation of w2 must start with a rule of the form a ~ y h ... yZ~aa~ b ... daZlbe (5) or with a rule cr --~ ae. (5a) T h e word yl~+t ... y~Kcda~Kb ... daZ~+~b is in L ( G ) . But this means that for the word (4) there are two derivations in G ' ; one which starts with the rule (3) 518 CRUS~.a and the second which starts with a rule (5) or (5a). Thus G' is an ambiguous grammar. LEMMA 5.2. It is undecidable, for an arbitrary CFG G whether there is an unambiguous CFG G' such that L(G') ~-L(G) and Lev(G') = Lev(L(G)). Proof. Let x, y be n-tuples of nonempty words in {a, b}* and n be an integer. Let G be the grammar with the initial symbol a and the rules ( r -+ Abrl I ~2bYbB, A --+ aAa I bXb, B -~ aBa ] aba j aeaea, X --~ b w X x i l c [ eae, Y --* ba*Yy, I c{eae, T 1 --* a'r1 ] a ] e~e, T 2 -~ -r~a l a[ecre. 1 ~ i ~ n, I ~ i ~ n, Clearly, L(G) n {a, b, c}* = U i,j>/1 aibL(x) baibaJ u U a'bL(y) bMba J. i,3~1 By Ginsburg (1966), L(G) n {a, b, c}* is an ambiguous language if ~(x, y) holds and, therefore, L(G) is an ambiguous language if ~(x, y) holds. If ~(x, y) does not hold then it is easy to verify that G is an unambiguous grammar. L e v ( G ) ~ 1 in both eases and, therefore, an unambiguous grammar G' such that Lev(G') -~ 1 and L(G') = L(G) exists if and only if ~(x, y) holds. Hence the lemma. LEMMA 5.3. Let K be either Depth or L e v , . It is undecidable for an arbitrary grammar G whether there is" an unambiguous grammar G' such that L(G') = L(G) and K(G') ~ K(L(G)). Proof. Let x and y be as in the proof of the previous lemma and G Obe the grammar obtained from the grammar G of the previous proof by dropping out all rules with the symbol e. Clearly, Depth (Go) -- 1, Lev~(G0) ~ 0. L(Go) ~- L(G) n {a, b, c}*. Now the lemma follows by the same reasoning as in the proof of the foregoing lemma. Thus we have THEOREM 5.4. Let K be one of the criteria Vat, Lev, Levn, Depth. It is undecidable, for an arbitrary grammar G, whether there is an unambiguous grammar G' such that L(G) L(G'), K(G') ~- K(L(G)). COMPLEXITY AND UNAMBIGUITY OF LANGUAGES 519 U s i n g the results of G r u s k a (1969) and the same t e c h n i q u e as in Section 2 we can p r o v e THEOREM 5.5. Let n be an integer and K one of the criteria Var, L e v , L e v ~ , D e p t h . I t is undecidable, for an arbitrary grammar G such that K(G) = n, whether or not there is an unambiguous grammar G' such that L(G) = L(G') and K(C') = K(L(G)). RECEIVED: May 15, 1970; REVISED:December 28, 1970 REFERENCES GINSBURG, S. AND ROSE, G. F. (1963), Some recursively unsolvable problems in ALGOL-like languages, J. Assoc. Comput. Mach. 10, 29-47. GINSBIJtlG, S. (1966), " T h e Mathematical Theory of Context-Free Languages," McGraw-Hill, New York. GINSBVRG, S. AND SPANIER, E. H. (1968), "Derivation Bounded Languages," SDC Document TM738/041/00. GREIBACH, S. A. (1963), Undecidability of the ambiguity problem for minimal linear grammars, Information and Control. 6, 119-125. GRUSKA, J. (1966), Two operations with formal languages and their influence upon structural unambiguity, Mat. Casopis Sloven. Akad. Vied. 16, 58-65. GRUSKA, J. (1967), On a classification of context-free grammars, Kybernetika 3, 22-29. GRUSI~A, J. (1969), Some classifications of context-free languages, Information and Control 14, 152-173. KNUTH, D. E. (1963), On the translation of languages from left to right, Information and Control 8, 607-639.