INFORMATION
AND CONTROL
18, 502-519 (1971)
Complexity and Unambiguity
of Context-Free Grammars and Languages
J. GRUSKAt
Department of Computer, Information, and Control Sciences
University of Minnesota, Minneapolis, Minnesota 55455
Four of the criteria of complexity of the description of context-free languages
by context-free grammars are considered. The unsolvability of the basic
problems is proved for each of these criteria. For instance, it is unsolvable to
determine the complexity of the language generated by a given grammar, or to
find out the simplest grammar, or to decide whether a given grammar is the
simplest one and so on.
Next, it is shown that in some cases one can obtain unambiguity only by
increasing complexity. Namely, for each of the four criteria, in any complexity
class there are unambiguous languages, all simplest grammars of which are
ambiguous. As one would expect, it is unsolvable whether for an arbitrary
grammar G there are unambiguous grammars within the simplest grammars
for the language generated by G.
1. INTRODUCTION AND SUMMARY
If the number of states is taken as a criterion of complexity of finite state
acceptors, then effective procedures to construct a minimal finite state
acceptor, equivalent to the one given, are well-known. T h e states of a finitestate acceptor correspond roughly to nonterminal symbols (variables) of a
finite state grammar and vice versa. This leads to the idea of considering
the number of nonterminal symbols as a criterion of complexity of contextfree grammars (CFG's). However, in this case, as it is shown in Section 3,
there is no effective procedure to construct the minimal grammar. In addition
to the n u m b e r of nonterminal symbols the three other criteria of complexity
of C F G ' s are explored in this paper. T h e y are closely related to the concept
of grammatical level and express in a way the intrinsic complexity, or loop
complexity, of the description of context-free languages (CFL's).
For all of these criteria the unsolvability of the basic problems is proved.
For instance, it is unsolvable to determine the complexity of the language
tPresent address: Mathematical Institute of Slovak Academy of Sciences, Bratislava,
Czechoslovakia.
502
COMPLEXITY
AND
UNAMBIGUITY
OF L A N G U A G E S
503
generated by a given grammar, or to find the minimal grammar, or to decide
whether a given grammar is the minimal one, and so on.
F r o m a practical point of view, it is usually desirable to have, for a given
context-free language L, a grammar which is unambiguous and as simple
as possible. It is proved in Section 4, for the criteria of complexity of C F G ' s
and C F L ' s defined in Section 2, that these two requirements of simplicity
and unambiguity are, in general, in conflict. I n other words, it may happen
for an unambiguous C F L L that the simplest grammar for L, with respect
to one of the criteria of Section 2, must be ambiguous. Moreover, it is
undecidable for an arbitrary C F G G whether this is true for the language
generated by G.
2. PRELIMINARIES
1. I n this paper, we shall consider only C F G ' s G = { V , X , P , @1,
*
#
•
such that for each variable M ~ V - - Z ,
(a) the set { x ; A ~ x e Z } is
nonempty and, (b) there exist words x and y such that a ~ x A y in G.
2. If G = { V , Z, P , a ) is a CFG, then a subset G o C P is said to be a
grammatical level of G if (A --~ ~) ~ G O implies that [(B --~/3) ~ G o if and
only if A * x B y and B ~* x x A y 1 for some x, x1 , y, Yl in V*]. (In other
words, a grammatical level G o of a C F G G is a maximal set of rules of G such
that the symbols on the left sides of these rules are mutually dependent.)
T h e n u m b e r of variables on the left sides of the rules of a grammatical level
G o is said to be the depth of Go, and is denoted by Depth (Go). A grammatical level G o of G is termed nontrivial if Depth (Go) > 1.
3. I n a recent paper by Gruska (1969), the following criteria of complexity
of C F G ' s were considered:
Var(G)
Depth(G)
Lev(G)
Lev,(G)
=
-=
=
the n u m b e r of variables of G.
max{Depth(Go); G o is a grammatical level of G}.
the n u m b e r of grammatical levels of G.
the n u m b e r of nontrivial grammatical levels of G.
1 A context-free grammar is a quadruple G = /V, Z, P, a) where V is a finite set
of symbols called nonterminals (or variables), Z C V with the elements of Z being
called terminal symbols, P is a finite set of rules of the form A -~ c~where N E V -- Z,
c g*, a a V -- X is called the initial symbol of G. If A --~ c~is in P and w1 and w2
are in V*, we write WlAW 2 ~ wlc~w2. Then * is the transitive and reflexive closure
of ~, and we defineL(G) = (w, a * w ~ E*}. A languageL is context-free ilL = L(G)
for a context free grammar G. The symbol e will denote the empty word.
504
GRUSKA
4. If K is one of the above criteria of complexity of CFG's, then K
induces a criterion of complexity of C F L ' s which is also denoted by K and
defined by
K(L) -~ min{K(G); L(G) = L}.
5. In the next section we will often use Rul(G) to denote the maximal
lenght of the right sides of the rules of a C F G G.
6. As usual, our proofs of undecidability and unsolvability will be based
on the unsolvability of the Post correspondence problem. T o simplify the
ensuing discussion we now introduce some notation.
First let ~(x, y) be a predicate which holds if and only if x = (xl, x2 ,... , xn)
and y = (Yl, Y2 .... , Yn) are n-tuples of nonempty words over the alphabet
{a, b} and, moreover, there exists a sequence of indices
i~ , i~ ,..., i~
with 1 ~ ij ~ n and such that
xilxi~ "'" xi~ = Yi~Yi~ "'" Yi~ .
2.1. THEOREM (Post).
It is undecidable, for arbitrary n-tuples x and y,
whether ~(x, y) holds.
7. For n-tuples x and y of nonempty words over the alphabet {a, b} we
define the languages L(x), L(x, y) and the language L s by
L(x) = {bail "" ba%xi~ "" xil ; 1 ~ ij ~ n},
L(x, y) = L(x) cLR(y),
and
L s = {WlCW2CW2RCWxR; WiW2 e {a, b)*},
where, for a word w, w R is the reverse of w and for a language L,
L R = {w R; w ~ L}.
8. In this paper, we are concerned only with context-free grammars and
languages and therefore, unless stated otherwise, by "grammar" we shall
mean context-free grammar, and by "language" we shall mean context-free
language.
COMPLEXITY AND UNAMBIGUITY OF LANGUAGES
505
3. UNSOLVABILITY OF BASIC COMPLEXITY PROBLEMS
Let @ be the class of C F G ' s and K be a mapping K : do --~/, where I is
the set of nonnegative integers. Let the domain of K be extended to the class
of C F L ' s by defining, for a C F L L, K(L) = min{K(G); L(G) = L}. K may
be interpreted as a criterion of complexity for C F G ' s and C F L ' s and then
the following questions arise in a natural way:
Q1.
Is there an algorithm to determine K ( G ) for an arbitrary C F G G ?
Q2. Is there an algorithm to determine K(L(G)) for an arbitrary
CFG G ?
Q3. Are there integers n such that it is decidable for an arbitrary
C F G G, whether or not K(L(G)) ~ n ?
Q4.
Is it decidable, for an arbitrary C F G G, whether or not K ( G )
is, whether G is the simplest grammar for L(G) with respect
K(L(a)) ? (That
to K.)
Q5. Is there an algorithm to construct to an arbitrary C F G G a C F G G'
such that L(G) = L(G') and K(G') = K(L(G)). (In other words, whether
there exists an algorithm to construct the simplest grammar for L(G).)
Remark 3.1. For criteria K defined in Section 2, the answer to the
question Q1 is yes, what is easy to see, but, as it will be shown in this section,
the answers to Q 2 - Q 5 are negative. T o begin with, observe that the negative
answer to Q3 for a K implies the negative answer to Q2 and, moreover,
if the answer to Q1 is positive, also the negative answer to Q5. Therefore, in
order to prove the negative answers to Q 2 - Q 5 for criteria K of the Section 2,
it is sufficient to show the negative answers to Q3 and Q4. It will be done in
this section in a series of lemmas. Unfortunately, their proofs are quite
cumbersome but we were not able to find more elegant ones.
LEMMA 3.2. For no integer n is it decidable for an arbitrary C F G G
whether or not Var(L(G)) ~ n.
Proof. Let x and y be n-tuples of nonempty words in {a, b}*, e be a
symbol not in {a, b} and Lx,~ be the language defined by
Lx, v = {a, b, c}* -- L(x, y) c~ L s .
T h e proofs of Lemmas 4.2.4 and 4.2.6 in Ginsburg's book (1966) give an
effective procedure to construct, given x and y, a linear C F G Gx, ~ generating
the language L~. v . For this language we will be able to show below that
(*)
Var(Lx.~) = 1 if and only i f L ( x , y ) n L ~
= O.
506
GRUSKA
On the other hand, L ( x , y) c~ L~ = 0, if and only if ~ ( x , y) holds. Thus,
from ( , ) and from Post's theorem it immediately follows that if n = 1, then
it is undecidable for an arbitrary grammar G whether or not Var(L(G)) = n.
T o show it for n > 1 we use the results of Gruska (1967). Consider the
languages
L~ = L x . u U {d}
and
Lj = L~.~ w {dd}* w .." w {d#-~} *
for j > 2,
where d and e are distinct symbols not in {a, b, c}. Clearly, there is an effective
procedure to construct a linear grammar for Lj., j ~ 2, whenever x, y and j
Using
are given. By Gruska (1967), Var({de2}* ~3 "" u {de~-l} *) = j - - 1 .
this fact, one can show easily that for j ~ 2, Var(Lj) = j if and only if
~ ( x , y) does not hold. Hence, by Post's theorem, the lemma follows if ( . )
is true.
I n order to prove ( . ) we proceed as follows. I f L ( x , y ) c ~ L ,
= 0, then
trivially Var(L~.v) = 1 and, therefore, let us assume that L ( x , y ) n L~ ~ O.
Then, there exists a sequence i 1 , i 2 ,..., i k of indices such that, if we denote
I = bail ' ' ' baik,
X
= xi~ ".. x q ,
J = I R,
Y = X R,
then I ' ~ c X " * c Y ~ c J ~ ~ L ~ . v for no integer m >/ 1. W e want to show that in
this case Var(Lx,v) > 1. T h e proof will be by contradiction, and we will make
use of the fact that the words of L~.~ have a very regular structure, namely,
(.~)
for every word u ( v ) there is at most one word v ( u ) such that
UCvcvRcu R is not in Lx. u .
T o derive a contradiction, let G be a grammar with the only one variable,
say a, which generates L , . v and let d > Rul(G) be an integer. If i > d, then
clearly the word w = P c X i + l c y i + l c J ~+1 is in Lx, u and therefore there is an
a such that ~--~ ~ =~ w and words ~0, ~1, Wo, wl such that a ~ C~oaa1 ,
w = %WoW 1 , a ~ w o , oh ~ w 1 and aoW1 J= e. Since i ~ d,~o does not
contain the symbol c and therefore there must exist words z 0 and z 1 such that
ZoZ 1 ~- I and a = P o z o for some i o >~ 0. Obviously, aoZaZoWoW1 ~ L~, v . On
the other hand, since %w 1 =/: E and (**) holds, ZlZoW o 6 L , . u . But then,
o~ ~ %ach ~ %ZlZoWoWa E L ~ , ~ , a contradiction. This completes the proof
of (*) and the lemma.
LEMMA 3.3. I t is u n d e c i d a b l e f o r a n a r b i t r a r y C F G G, w h e t h e r or n o t
Var(G) = Var(L(G)).
COMPLEXITY AND UNAMBIGUITY OF LANGUAGES
507
Proof. Let x a n d y be again n-tuples of n o n e m p t y words in {a, b}* a n d
G'x, u be a g r a m m a r with the initial s y m b o l a and with the rules 2
cr --~ a~ra I bcrb l b~pb I a~pa bp~b l a p ( a l b~'p~'a l a~'p~'b
~--+ ~a[ ~ b [ a l b
~'~1~
p --+ ccrc I ccr'c[ d~d,
~' --> xz~'Yi R ] x~ d~ dyi R,
l ~i~n.
T h e language L'~, v generated b y this g r a m m a r has the form
L~, u = {w; w = ulc "" cukdwodvkc "" c731 , h 7~/ 1,
U~ a n d vi are in {a, b}*,
uj :# %R i f j < k and either uk @ v1;R or uk = x~ '-"
x,, = Yi~ "" Y ~ = vk R for some i i , i~ ,..., iz}.
Note that the variables ~:' a n d p in the description of G'x, u are superfluous
a n d can be easily reduced. T h u s , there is an effective procedure, say ~-, to
construct, given x and y, a g r a m m a r with three variables for L'~, u . If ~ ( x , y)
does not hold, such a g r a m m a r is n o t the simplest one since in this case one
can also remove all rules with ~'. O n the other hand, if ~ ( x , y) holds, then,
as will be shown below, Var(L'~,u) = 3 and therefore the g r a m m a r obtained
b y ~r is the simplest one with respect to Var. Hence, b y Post's theorem, the
l e m m a follows. It only remains to show that
Var(L;,~) = 3 if ~@(x,y) holds.
I n doing so we will often implicitly make use of the fact that each word of
L~, u has exactly two d's, that there is no occurrence of c between these two
d's and that there is the same n u m b e r of c's to the right a n d to the left of d's.
I f ~ ( x , y) holds, t h e n there m u s t exist a sequence i 1 , i 2 .... , i~ of integers
such that xilx~2 "" xi~ = y~ly,~ " " Y i ~ . P u t X = x~lxi2 "" xi~ and Y = X R.
t
Now, let us assume that G is a g r a m m a r for L~, v with a m i n i m a l n u m b e r of
Here and in the sequel we will mostly describe the rules of a grammar in an
abbreviated form, namely, we will write A ~ cq I c% I "'" ] c~ instead of A --* oq ,
A -~a2,...,A -~.
508
GRUSKA
variables and let m ~ Rul(G). It is easy to see that Vat(G) ~ 3 and that for
each variable B of G all terminal words derived from B must have the same
n u m b e r of d's. If all terminal words derived from a variable of G had two
d's, G would be linear and the only rules with d would be terminal ones
and the word ada ~+1 da 2 ~L'~, u could not be derived in G. T h u s G has to
have at least one variable generating words with less than two occurrences
of d's and we have Var(G) > 1.
Now assume that Var(G) ~- 2 and that a and A are two variables of G,
a being the initial symbol of G. We start by observing that if B ~ a or
B ---- A, then all words derived from B have the same n u m b e r of d's and c's.
T o derive a contradiction we proceed as follows:
T h e words w1 ----adada~+1 and w 2 = aca~nda~da~+lca ~n+l are in L'
If a --~ ~ ~ w~, a =# ~i, i = 1, 2, then a t does not contain a. Otherwise ~,
would have the form aa~ for some j ~ 0 yielding aJ+ldada J+l ~ L ( G ) , a
contradiction. Thus, al contain only symbols from {a, c, d, A} and it implies
that the words derived from A have no c and no d.
t
Consider now the word w ~ - a ~ e X ~ d a d Y ~ c a ~+t in L~, u . Let a ~--w0,
Wl ,..., w~ = w be a derivation of w in G. We can assume that this derivation
has already the property that if w~+t is obtained from Wl by using a rule
a --+ a then w~ contains no A. But it means that an io must exist such that
wi0 contains no c and wi0+~ = a~cuovca ~+~. Since w~0+~ ~ w, there must
exist u, w, v such that u => ~, a => ~, v ~ ~7 and w = a ~ c ~ c a ~+z with
no c in u-~. If g = gR, then ~ = UoC~CUoR for some u 0 ~ {a, b}* but this
t
cannot happen for a word in L~,~. If g v~ gn, then v7 ----u o d N d v o , u o ~ Vo~
and therefore also uocXdadXRcvo is in L'x,u. But then
amcguocXdadX~cvogea m+l = a m c X ~ c X d a d Y c Y ~ c a ~ + l
is inL'~.y.
Therefore the assumption Var(G) = 2 leads to a contradiction.
A detailed study of the proofs of the last two lemmas shows that we have
actually proved more. T h e next corollary summarizes whas has been proved.
I n doing so, the concepts of semilinear grammars and languages are used.
A semilinear grammar--see Gruska (1970) and Ginsburg and Spanier (1968)
where semilinear languages are called derivation bounded languages--is a
grammar all grammatical levels of which are linear. A language generated
by a semilinear grammar is called semilinear. For a semilinear language L,
let Vary(L) = min{Var(G); L ( G ) = L, G is a semilinear grammar} and for
a linear language L, let Vary(L) = min{Var(G); G is a linear grammar for L}.
COROLLARY 3.4. (i) I t is unsolvable to determine Var(L(G)) (Var,(L(G))
f o r an arbitrary linear g r a m m a r G.
COMPLEXITY AND UNAMBIGUITY OF LANGUAGES
509
(ii) For no integer n is it decidable for an arbitrary linear grammar G
whether or not Var(L(G)) = n (Var~(L(G)) -----n).
(iii) It is unsolvable to construct for an arbitrary linear grammar G a
(linear) grammar G' generating L ( G ) and such that Var(G') ----Var(L(G))
( V a r ( a ' ) = Var~(L(a)).
(The statements (i)--(iii) remain true if the term "linear" is replaced by
"semilinear" and Var, by Var~ .)
(iv) It is undecidable for an arbitrary semilinear grammar G whether
or not Var(G) = Var(L(G))(Var(G) = Var~(L(G)).
Let us now proceed to study the criterion Lev.
LEMMA 3.5. (a) For no integer n is it decidable for an arbitrary C F G G
whether or not Lev(L(G)) = n.
(b) It is undecidable for
Lev(G) : Lev(L(G)).
an
arbitrary
CFG G
whether
or not
Proof. Let x and y be n-tuples of nonempty words in {a, b}* and G~, u
be the grammar with the initial symbol a and with the rules which arise from
the following rules by replacing r' by r or • and p by cac or d or c¢'c.
a ~ aaa ] bab [ arpa J brpb [ apra ] bprb [ ar'pr'b l b-c'p'r'a,
-r --~ za [ rb d a [ b ] eae,
a' ~ x j y i a ] x,ca'cy, R [ x, dy~ R,
1 <~ i <~ n.
The proof of L e m m a follows easily once we have proved
(,)
Lev(L(G~.u) ) ~ 1 if and only if ~(x, y) does not hold.
Indeed, combining (.) with Post's theorem we get (b) and (a) for n = 1
and n = 2. To show (a) for n > 2, it is sufficient to consider the languages
L,~ : L(G~.~,) w L'.-2 ,
where
L t
n-2
~"
{fg}* t5 {f2g}* W
..,
W {f~-~g}*
and f, g are symbols not in {a, b, c, d, e). By Gruska (1969), Lev(L~_2) ~ n -- 1
if n > 2. Using this fact one can show easily that Lev(Ln) = n if and only if
~(x, y) holds. Hence, by Post's theorem, (a) follows for n > 2.
It remains only to prove (,). If ~(x, y) does not hold, then we can remove
from G~. u all rules with a' without affecting the language generated by G~.~.
Hence, Lev(L(G~,u) ) = 1. Put Lz,u = L(Gz,v).
If ~ ( x , y ) holds and Lev(Lx.u) : 1, then there must exist a grammar G
forLx, ~ such that Lev(G) = 1. In order to finish the proof of (.) it is sufficient
510
GRUSKA
to derive a contradiction from our last assumption. T o do that we proceed
as follows.
We can assume without loss of generality that G is ~-free and has no rules
of the form A - + B with B being a variable. Let e be the initial symbol of G,
n o the number of variables, and m > Rul(G).
Since ~ ( x , y) holds, there are indices i 1 , i., ,..., i~ such that
X = x~xi2 "'" xik = Yi~Yi~ "'" Y~k = yR.
Now, consider the word
z = a ( c X ~ ) g d(Ymc) N aa,
where N > n o + 4 and parantheses are used only to abbreviate the
description of z and are not symbols of z. W e shall show that z cannot be
derived in G which will give a desired contradiction because z ~ L ~ . y .
We start by proving a.
(**)
If u, v are terminal words, A is a variable, a * u A v *~ z,
then either ] u [ ~ l acXmc[ or I v ] ~ [ c y m c a a ] .
Since Lev(G) = 1, if (**) were not true, the words ff and g would exist such
that acX~cffegcY~caa e L ( G ) which is impossible as one can easily verify
from the description of G~,v. Thus, (**) holds.
Now, let ¢ be a derivation tree for a derivation of z in G. For any node
of ¢, let z e be the subword of z derived from ~: in ¢. Let ~7 be the node in ~b
of the maximal order and such that z, - - udv for some u and v. By (**),
max(] u 1, [ v 1) >/ [ Y'~c I~T-1. This in turn implies the existence of a node
in ~b such that z e is the subword either (i) of (Y~c)n; or (ii) of (cXm) g and
]ze] ) [ y m c [ N - 1 m. Assume that (i) takes place. T h e case (ii) goes
through similarly.
Denote by ¢¢ the subtree of ¢ induced by ~:. A node ~ of Ce is called
external if z, is a subword of cP~caa, otherwise ~ is called internal. Because
of (**), in Ce there is only one path ~r which starts with se and contains only
internal nodes.
Since N > n o + 4, ~r contains j >~ n o @ 1 nodes ~:1, ~2 ,..., ~J such that
the rules applied in ¢~ at ~, have the form A t - ~ u , c v , B i % with u~v, terminal
and B, yielding an internal node. S i n c e j >~ n o + 1, there are 1 ~<j~ < J 2 ~<J
such that _//a~ = -//3~ • This implies the existence of terminal words g and ~7in
{a, b, c}* such that Aj~ * g a n g and ~ has at least one c. This in turn implies
3 For a word x, ] x [ is the number of symbols in x.
COMPLEXITY AND UNAMBIGUITY OF LANGUAGES
511
that in G a word of the form uodv o with no e in u o and v 0 and unequal number
of c's can be derived. T h e words of such a form are not in L~,~. Hence, the
assumption Lev(G) = 1 leads to a contradiction proving (.) and thereby
the Lemma.
LEMMA 3.6.
Let K be one of the criteria Lev~ and Depth.
(a) For no integer n is it decidable for an arbitrary C F G G whether or not
K(L(G)) = n.
(b) It is undecidable for
K ( C ) = K(L(G)).
an
arbitrary
CFG G
whether
or not
Proof. Let x a n d y be n-tuples of nonempty words in {a, b}*. As mentioned
in the proof of Lemma 3.2, given x and y, a grammar Gx,v can be effectively
constructed such that L(G~.v) = {a, b, c}* - - L ( x , y ) t~L~ and, moreover,
Gx,v can be constructed in such a way that Depth(G~,y) ~ 1, Levn(Gx,v) = O.
Let a be the initial symbol of G~,v and let a0, A, B, d, e, ~, S be symbols not
used in G~.v. Let G~,y be the grammar arising from G~,v by adding the rules
% ~ Ad,
A -+ e A a S I eBbS[ e~d,
B - , eB~2:[ e A a ~ 1 e~d,
~ - ~ ~a l ~b l a[ b,
and choosing % to be the initial symbol of G~, v . Clearly, Lev~(G~,v) = 1
and Depth(G~,~) -= 2. By Ginsburg (1963), Lev~(L(G~,v)) = 0 if and only
if ~(x, y) does not hold. Moreover, Depth(L(G~,~)) = 1 if and only if
Lev~(L(G~,~)) ~ 0. Hence, by Post's theorem, (b) follows and also (a) for
K ~ Levn and n ~ 0, 1 and for K ~ Depth and n = 1, 2. To show (a)
for other values of n we proceed as follows.
Given x and y as above and n > 1 we can effectively construct a grammar
G~ generating the language
L~ -~ L(G~,y) w Ln_l ,
where L~_ 1 is the language generated by a grammar with the rules
cr---~ cyi ~
cri - ~ giaih I gihgn+iS~hg h,
Si ~ gn+iSih I hcrg I h2g 2,
643/I8/5-8
l ~i~n--1,
512
ORUSKA
and with ~ as the initial symbol. By Gruska (1969), Levn(L~_l) = n -- 1.
Using this fact one can show easily that Levn(Ln) ~ n if and only if ~(x, y)
does not hold. Thus, by Post's theorem, (a) follows for Lev~.
The detailed proof that for n > 2 it is undecidable for an arbitrary
grammar G whether or not Depth(G) = n is quite tedious and only the basic
idea will be sketched here.
Let G'£,u be grammar arising from G',, v by adding the rules
A -+ pAaq,
A~ --~ piAiq [ qpi+XAi+lqp,
1 <. i < n -- 2,
A~_ 2 --+ pn-~An_2q [ qpn-lAqp,
where p, q are symbols not in the alphabet of G'~,u .
Using the technique of the proof of Lemma 2.1. in Ginsburg (1963) and
of Theorem 4.2. in Gruska (1969), one can show that Depth(L(G~,v) ) = n
if and only if ~(x, y) does not hold. Now the result follows at once from
Post's theorem.
Summarising the results of the last four lemmas we have
THEOREM 3.7. For criteria Var, Lev, Levn and Depth the answers to
questions Q2-Q5 are negative.
Remark 3.8. Using the standard technique we can show that the last
theorem remains valid also for the case that only grammars and languages
with at most two terminal symbols are considered.
4. UNAMBIGUITY AND COMPLEXITY
From a practical point of view it is often desirable to find for a given C F L
a grammar which is unambiguous and as simple as possible. In the present
section we will see that for each of the criteria of complexity K defined in
Section 2 there exists an unambiguous language L such that K(G) > K(L)
for every unambiguous grammar forL. Therefore, unambiguity and simplicity
are in general in conflict.
LEMMA 4.1.
There exists an unambiguous C F L L such that
Vat(G) > Var(L)
and
for every unambiguous grammar for L.
Lev(G) > Lev(L)
COMPLEXITY AND U N A M B I G U I T Y OF L A N G U A G E S
513
Proof.
Let L K ~ {x; x ~ {a, b}, x has the same n u m b e r of occurrences of
a's and b's} be a (Knuth) language. By K n u t h (1965), L~c is an unambiguous
language and LK is generated by unambiguous g r a m m a r
a --+ a A b a
] b B a a ] E,
A -+ aAbA
[e,
B -+ bBaB
l e.
(1)
On the other hand, L K is also generated by the grammar G' with the rules
a --+ aab(~ I bcraa [ e
and therefore Var(LK) = 1 = Lev(LK).
We want to show that Var(G) > 1 and Lev(G) > 1 for every unambiguous
grammar forL2c • T o this end, let G be any grammar f o r L which has only one
variable, say a. I f G is a linear grammar then all rules of G must be either
of the form a --~ u a v or of the form a ~ w where u, v, w are in {a, b}* and u v
has the same n u m b e r of a's and b's. But it means that if d > Rul(G), then
the word aab2aa a cannot be derived in G, a contradiction. Thus, G cannot be
linear, and at least one of the rules of G must have the form a --+ u a w a v for
some u, w, v in {a, b, a}*. Let x, y, z be words over {a, b} such that u => x,
has two essentially different
w ~ y, v ~ z. T h e n the word x a b y a b z x y a b z
derivations in G. (Both derivations start with the rule ( 7 - - + u a w a v
and in the first one (second one) u ~ x , (z ~ ab (a ~ a b y a b z x ) , w * y ,
(~ :~ a b z x y a b (a ~* ab), w ~* z . ) Thus, G must be ambiguous.
Assume now that G ' be a grammar for L K with only one grammatical
level. Let (r be again the initial symbol of G'. I f G' is not a linear grammar,
then there must again exist words u, w, v such that e *~ u a w e v and, similarly
as above, we get that G' must be ambiguous. Suppose therefore that G ' is
linear. Because of the structure of the words o f L K , if A --+ u A v is a rule of
G', then u v has to have the same n u m b e r of occurrences of a's and b's. L e t n o
be the n u m b e r of variables of G', d > Rul(G), and k be any integer such that
k > 3 d n o • W e know that G is ambiguous if n o = 1, and, therefore let
n o =# 1. Consider the word z = akb2ka k which is in L K . T h e first 2n o derivation
steps of a derivation of z in G' must have the form
a , uoA07)o , g o g l A l V l V O , . . . , UoU1 "'" U2no_lA2no_lV2no_l
... VlCA0 ,
where uou 1 "" u2.o_ 1 and r o y 1 ".. V2.o_ 1 are words over the alphabet {a). Since
2n o - 1 ~ n o , t h e r e m u s t e x i s t 0 < i 1 < i 2 ~ 2 n o - l s u c h t h a t A i l ~ A G .
But it means that A h * a ~ o A q a ko for some Jo + ko > 0. This is impossible
514
GRUSKA
if G' generates L K . Hence, G' cannot be linear, completing the proof of the
lemma.
Remark. It is possible to show that Var(G) > 2 for every unambiguous
grammar for LK and therefore the grammar (1) is the simplest grammar for
Lrc with respect to Var.
LEMMA 4.2. There exists an unambiguous (linear) language L such that
Depth(G) > Depth(L) and Levn(G) > Levn(L ) f o r every unambiguous
grammar for L.
Proof.
Let L be the language generated by the grammar G o with the rules
o" --+ a2ab I aacrb ] cac [ e.
L is an unambiguous C F L becauseL can also be generated by an unambiguous
C F G G1 with the rules
a 1 --+ aa~l b [ cac I e.
Clearly, Depth(L) = 1, Lev~(L) = 0.
Let us now assume that G is an unambiguous grammar for L with
Depth(G) = 1 and therefore L e v n ( G ) = 0. In such a case, by Gruska
(1966), there must exist an unambiguous C F G G' for L such that
Depth(G') = 1, L e v n ( G ' ) = 0, and for every variable A of G 1' with a
possible exception of the initial symbol, there is a rule A -~- u A v in G. We
can assume without loss of generality that G has already this property.
G must be a linear grammar. Indeed, if G is not, then there are words u, v, w
and variables A, B such that a * u A w B v . T h e words of L have the form
uev, u and v in {a, b, c}* and for every u(v) there are only finitely many v(u)
such that uev e L ( G ) . Therefore, if ~ * u A v B v would be possible, then,
since the sets of words which can be derived from A and B are infinite, it
would be possible in G to derive an infinite number of words of the form uev
with the same u (or with the same v). Thus, G must be a linear grammar.
From the structure of words of L and from the assumption that G is
unambiguous it follows easily that
(A) if A --~ u A v is a rule of G and u ~ {a}*, then v ~ (b}*, and either
u = a 2[~1 or u ~ aal~l;
(B)
for each variable A of G there is an integer ia such that whenever
A ~ u A v holds m G, and u ~ {a}*, then v E {b}* and u = aial ~1.
COMPLEXITY AND UNAMBIGUITY OF LANGUAGES
515
Consider now the word
z ~ a2Nca3Nca2N "'" ca~NebNcbN "'" cbN ,
2N
2N
where N > dn o , n o being the n u m b e r of variables of G, d > Rul(G).
Let
= z o , z 1 .... , z s = z
(1)
be a derivation of z in G, and
~4~ --+ u~Ai+lv~, i = 0, 1,..., s - - l
(2)
be the rules involved in the derivation steps from zi to zi+ 1 . Since N > d,
and (B) holds, (2) cannot contain the rules of the form A , --~ uiA,v i with c
in uiv~.
Therefore each time a new c is introduced in (1) the rule being involved
in that derivation step must have the form
A -~. uBv, B ~ A , u ~ {a, b}* c{a, b}*.
(3)
Since Depth(G) = 1, it may happen at most n o times that a rule in (2) has a
form (3). But N > no, and we get a contradiction to the fact that (1) is a
derivation of z.
Summarizing the last two lemmas, we get
THEOREM 4.3. Let K be one of the criteria Var, Lev, L e v n , Depth.
There exists an unambiguous C F L L such that K ( G ) > K ( L ) for every
unambiguous grammar for L.
Using the results of Gruska (1969) and the same technique as in the
previous section one can show a little more.
THEOREM 4.4. Let K be one of the criteria Var, Lev, Lev n , D e p t h and n
an integer. There exists an unambiguous language Ln such that K ( L ~ ) - = n
and K ( G ) > K(L) for every unambiguous grammar for L ~ .
5. UNDECIDABILITY OF THE WEAK AMBIGUITY PROBLEM
It is well known that the ambiguity problem for C F G ' s , i.e., the problem
whether or not, given an arbitrary C F G G, L(G) is unambiguous C F L , is
undecidable. T h e results of the previous section show that it may happen
516
GRUSKA
that for an unambiguous C F L the "simplest" C F G is ambiguous. Naturally,
the following question arises: Given one of the criteria K defined in Section 2,
is it decidable, for an arbitrary grammar G, whether or not the simplest
grammar for L ( G ) , with respect to K, must be ambiguous ? It is proved in
this section that the answer is negative.
LEMMA 5.1. I t is undecidable f o r an arbitrary C F G G whether or not
there is an unambiguous C F G G' such that L ( G ' ) = L ( G ) and Var(G') =
Var(L(G)).
Proof. Let x and y be n-tuples of nonempty words in {a, b}*. Let G be a
grammar with the rules
-+ xiaaibd l y~ada~b ] xiaa~be [ ae [ c I cd,
1 ~ i ~ n.
By Greibach (1963), G is ambiguous if and only if ~ ( x , y) holds. Because of
undecidability of Post correspondence problem, in order to prove the lemma
it is sufficient to show that if there is a sequence of indices/1 ,..., ik such that
xqxi2 "'" xik = Y6Yi2 "'" Yi, ,
(1)
then there is no unambiguous grammar for L ( G ) with the only one variable.
Assume on the contrary that a sequence of indices satisfying (1) exists
and that also there exists an unambiguous grammar G' such that Var(G') = 1
and L ( G') = L ( G).
Since each word of L(G) has exactly one occurrence of c, G' must be a linear
grammar. It is also easy to see, since c is in L ( G ) , that all rules of G which
have at most one e must have one of the following forms:
(r ~ Ph "'" PJ~aYi~ "'" YJl '
(2a)
--+ Ph "'" pj~oyj~ "'" yhe,
(2b)
(2c)
(2d)
a --+ xjopy 1 "" pj~ayj~ "" y~la~°be,
- + Psi "'" pj Try~ "'" Y31 ,
a --~ pj~ "'" ph~ry~ "'" 7he,
(2e)
a - + XjoPh "" pj~ry~z "'" yj~a3°be,
(2f)
where either 7J~ = aJ~bd and p~. ~ x ~ , or ),~ = daJ~b and pj, = y~,, and
n is either c or cd.
COMPLEXITY AND UNAMBIGUITY OF LANGUAGES
517
Let m > Rul(G). Let X = x i l " " x i ~ , where i l , i 2 ,..., i k satisfy (1) and
R = ai*bdai~-ibd " . dailb. Let K = m I X I. Consider the word
w o = X~cR(dR)m-~e
(where parentheses are used only to simplify the description of w o and are
not elements of Wo). Clearly, w o e L ( G ) . Since the rules of G ' have form (2a)
to (2f), the first rule used in a derivation of w 0 in G' must have the form
c~ ~
x h "'" x~,ya~mbd "'" ahbe,
(3)
where 11 --" lm is a n o n e m p t y initial part of the sequence i I ,..., ik, il ,..., ik.....
is ..... ik of K integers. Let/m+l ,..-, lk be the rest of this sequence. T h e word
xl~+l "'" x~Kcda tKb "'" a ~ + I b d
is in L ( G ) and, therefore, there exists a derivation of
X"~c(dR)~e
(4)
in G' which starts with the rule (3).
Consider now the word
w 1 = Xmcd(dR)~ne
which is in L ( G ) . I n this case, in view of the structure of the rules of G ' , the
last rule used in the derivation of w I in G ' must have the form
cr --~ y~ . . . . yZKcdda~Kb ... d d ~ b ;
but this means that in G ' we must have
a ~ y~ . . . . y ~ - i a d a ~ - l b
"" d a h b e = w 2
and the derivation of w2 must start with a rule of the form
a ~
y h ... yZ~aa~ b ... daZlbe
(5)
or with a rule
cr --~ ae.
(5a)
T h e word yl~+t ... y~Kcda~Kb ... daZ~+~b is in L ( G ) . But this means that for the
word (4) there are two derivations in G ' ; one which starts with the rule (3)
518
CRUS~.a
and the second which starts with a rule (5) or (5a). Thus G' is an ambiguous
grammar.
LEMMA 5.2. It is undecidable, for an arbitrary CFG G whether there is an
unambiguous CFG G' such that L(G') ~-L(G) and Lev(G') = Lev(L(G)).
Proof. Let x, y be n-tuples of nonempty words in {a, b}* and n be an
integer. Let G be the grammar with the initial symbol a and the rules
( r -+ Abrl I ~2bYbB,
A --+ aAa I bXb,
B -~ aBa ] aba j aeaea,
X --~ b w X x i l c [ eae,
Y --* ba*Yy, I c{eae,
T 1 --* a'r1 ] a ] e~e,
T 2 -~ -r~a l a[ecre.
1 ~ i ~ n,
I ~ i ~ n,
Clearly,
L(G) n {a, b, c}* =
U
i,j>/1
aibL(x) baibaJ u
U
a'bL(y) bMba J.
i,3~1
By Ginsburg (1966), L(G) n {a, b, c}* is an ambiguous language if ~(x, y)
holds and, therefore, L(G) is an ambiguous language if ~(x, y) holds. If
~(x, y) does not hold then it is easy to verify that G is an unambiguous
grammar. L e v ( G ) ~ 1 in both eases and, therefore, an unambiguous
grammar G' such that Lev(G') -~ 1 and L(G') = L(G) exists if and only if
~(x, y) holds. Hence the lemma.
LEMMA 5.3. Let K be either Depth or L e v , . It is undecidable for an
arbitrary grammar G whether there is" an unambiguous grammar G' such that
L(G') = L(G) and K(G') ~ K(L(G)).
Proof. Let x and y be as in the proof of the previous lemma and G Obe
the grammar obtained from the grammar G of the previous proof by dropping
out all rules with the symbol e. Clearly, Depth (Go) -- 1, Lev~(G0) ~ 0.
L(Go) ~- L(G) n {a, b, c}*. Now the lemma follows by the same reasoning
as in the proof of the foregoing lemma.
Thus we have
THEOREM 5.4. Let K be one of the criteria Vat, Lev, Levn, Depth.
It is undecidable, for an arbitrary grammar G, whether there is an unambiguous
grammar G' such that L(G)
L(G'), K(G') ~- K(L(G)).
COMPLEXITY AND UNAMBIGUITY OF LANGUAGES
519
U s i n g the results of G r u s k a (1969) and the same t e c h n i q u e as in Section 2
we can p r o v e
THEOREM 5.5. Let n be an integer and K one of the criteria Var, L e v , L e v ~ ,
D e p t h . I t is undecidable, for an arbitrary grammar G such that K(G) = n,
whether or not there is an unambiguous grammar G' such that L(G) = L(G')
and K(C') = K(L(G)).
RECEIVED: May 15, 1970; REVISED:December 28, 1970
REFERENCES
GINSBURG, S. AND ROSE, G. F. (1963), Some recursively unsolvable problems in
ALGOL-like languages, J. Assoc. Comput. Mach. 10, 29-47.
GINSBIJtlG, S. (1966), " T h e Mathematical Theory of Context-Free Languages,"
McGraw-Hill, New York.
GINSBVRG, S. AND SPANIER, E. H. (1968), "Derivation Bounded Languages," SDC
Document TM738/041/00.
GREIBACH, S. A. (1963), Undecidability of the ambiguity problem for minimal linear
grammars, Information and Control. 6, 119-125.
GRUSKA, J. (1966), Two operations with formal languages and their influence upon
structural unambiguity, Mat. Casopis Sloven. Akad. Vied. 16, 58-65.
GRUSKA, J. (1967), On a classification of context-free grammars, Kybernetika 3, 22-29.
GRUSI~A, J. (1969), Some classifications of context-free languages, Information and
Control 14, 152-173.
KNUTH, D. E. (1963), On the translation of languages from left to right, Information
and Control 8, 607-639.