Academia.eduAcademia.edu

On the size of context-free grammars

1972, Kybernetika

In papers [2] and [3] four criteria of complexity of context-free grammars (CFG's), denoted by Var, Lev, Lev„, and Depth, have been studied. These criteria reflect the intrinsic complexity of CFG's and they induce the criteria of complexity of contextfree languages (CFL's) which reflect the intrinsic complexity of the description of CFL's by CFG's. The criterion Prod (G) = the number of rules of a CFG G, studied in [3] represents the size of CFG's. In the present paper one more criterion of complexity of CFG's, namely Symb (G) = = the number of all occurrences of all symbols in the rules of G, is defined and some results concerning the criteria Prod and Symb are derived.

K Y B E R N E T I K A — V O L U M E 8 (1972), N U M B E R 3 On the Size of Context-free Grammars JOZEF GRUSKA Two criteria of complexity of context-free grammars and languages are considered — the number of rules and the number of symbols — and hierarchy of complexity classes, undecidability of basic complexity problems and relation between complexity and unambiguity are established. 1. INTRODUCTION In papers [2] and [3] four criteria of complexity of context-free grammars (CFG's), denoted by Var, Lev, Lev„, and Depth, have been studied. These criteria reflect the intrinsic complexity of CFG's and they induce the criteria of complexity of contextfree languages (CFL's) which reflect the intrinsic complexity of the description of CFL's by CFG's. The criterion Prod (G) = the number of rules of a CFG G, studied in [3] represents the size of CFG's. In the present paper one more criterion of complexity of CFG's, namely Symb (G) = = the number of all occurrences of all symbols in the rules of G, is defined and some results concerning the criteria Prod and Symb are derived. 2. PRELIMINARIES A CFG G is quadruple G = (V, I, P, <r> where V is a finite set of symbols, I cz V and elements of I (of V — I) are called terminal symbols or terminals (nonterminal symbols or nonterminals); P is a finite set of rules of the form A. -* a where AeV — I, a e V*; <x e V — E is called the initial symbol of G. If A -> a is in P and co,. a>2 are in V*, then we write co1Aco2 => co1aco2- Let => be the transitive and reflexive closure of => and let L(G) = {w; <r => w e I*}. A language L is said to be context-free if L = L(G) for a CFG G. The symbol e will denote the empty word. For a CFG G = <V, I, P, <r> let Prod (G) be the number of rules of G and Symb (G) = £ Symb (p) where Symb (p) is the length of the right side of p increased 214 by 2. For a CFL L and K = Symb or Prod let K(L) = {min K(G); L(G) = L} , be the complexity of L with respect to K. 3. HIERARCHY OF COMPLEXITY CLASSES The criteria Prod and Symb induce infinite hierarchies of CFL's and as the following theorem shows there are no gaps in these hierarchies. Theorem 1. For any integer n (n 5; 2) there is a CFL L„ c {a}*, (L„ c {a}*) such that Prod (L„) = n (Symb (L'„) = n). Proof. The existence of a language Ln <= {a}* with Prod (L„) = n was shown in [3] for any integer n and the existence of Ln c {a}*, n > 2, follows immediately from (i) to (iii): (i) (ii) (iii) Symb ({e}) = 2 . Symb ({aj+1}) S Symb ({aJ}) + 1 for any j ^ 0 . For any k there are only a finite number of /"s such that Symb ({aj}) g k. Remark. If can be shown that Symb ({a 2 '}) = 3/ for i even and 3/ + 1 for i odd. 4. UNDECIDABILTTY OF SOME COMPLEXITY PROBLEMS One can effectively determine Prod (G) and Symb*'(G), given an arbitrary CFQ G. However, can one effectively determine Prod (L(G)) and Symb (L(G))1 The negative answer to this question and the undecidability of some other complexity problems concerning the criteria Prod and Symb is shown in this section. Theorem 2. If n 3: 2, then it is undecidable for an arbitrary CFG G whether or not Prod (L(G)) = n. Proof. Let x = (xu ..., xn) and y = (yu ..., y„) be H-tuples of nonempty words over {a, b} and L(x), L(y), L(x, y), Ls and Lah be languages defined by* L(x) = {baik... ik L(y) = {ba ... bahcxh ...xik; 1 g ij g n, 1 S J ^ k} , ba c yh ...yik\ 1 ^ ij jg n, 1 g J S k] , h L(x, y) = L(x) c LR(y), Ls -= {w,cw2cw2cw^; w ^ j 6 {a, b}} * If iv is a word, then vfR is the reverse of w and for a language L, L = {wR; w e Z,}. and Lab be the language generated by the grammar with two rules a -» aaab, a -» e. Let q> be a homomorphism on {a, b, c}* defined by q>(a) = ab, <p(b) = aabb and <p(c) = aaabbb. By [ l ] given x and y, a CFG GXJ, generating the language Lxy = = {a, b, c}* - L(x, >>) A L s can be effectively constructed. From that it follows that for given x and y also CFG's G'xy and G"x such that L(G'xy) = Lab — cp(L(x,y) A L S ) = = (Lafi - {flb, aabb, aaabbb}*) u q»(_,'-), -(G*,,) = {a, b}* - q>(L(x, y) A L S ) can be effectively constructed. It is easy to see that Prod (L(Gxy)) = 2 (Prod (L(G'X y)) = = 3) if and only if L(x, y) A L S = 0. On the other hand L(x, y) A L s = 0 if and only if the Post correspondence problem for x and y has a solution and therefore the undecidability of Post correspondence problem implies the Theorem for n = 2 and n = 3. For n > 3 we proceed as follows. By [3] for m = n — 3 a CFG G,„ can be effectively constructed such that L(Gm) is a finite subset of {d}* and Prod (L(G,„)) = m. Combining G„, and G x ,. we get a grammar for L(Gm) u L(G'xy). Clearly, Prod (L(G,„) U L(G'Xty)) = n if and only if the Post correspondence problem for x and y has a solution and therefore also for n > 3 the Theorem follows from undecidability of the Post correspondence problem. Corollary 3. There is no effective method to determine Prod (L(G)) for an arbitrary CFG G. Theorem 4. J / « _ 7 ( n ^ 8), then it is decidable (undecidable) for an CFG G whether or not Symb (L(G)) = n. arbitrary Proof. Symb (L(G)) g 7 if and only if the language L(G) has one of the following forms: {x}, |x| S 5; { x „ x 2 } , |x,| + |x 2 | S 3; {a}*; {a}* b; b{a}*; {albl, i _? 0}; {ab}*; where a and b are symbols or L(G) is empty. Since any of these language is bounded and, moreover, see [1], given any bounded language L 0 it is decidable for an arbitrary CFG G whether of not L(G) = L0, the theorem holds for n ^ 7. We will use notation of the proof of Theorem 2 in order to prove Theorem for n _ 8 and the proof will be again based on the undecidability of the Post correspondence problem from what it follows that it is undecidable for arbitrary x and y whether or not L(x, y) A L S = 0. Now the proof can be reduce to determine Symb (L) for several simple languages and in all cases this can be done very easily. First, we can see that Symb (L(G'xy)) = 8 if and only if L(x, y) A L S = 0 and therefore the Theorem holds for n = 8. If now (pt and <p2 are homomorphisms on {a, b}* defined by <p,(a) = <p2(a) = a, (px(b) = b2, cp2(b) = b3, then for i = 1,2, Symb ((Pi(La<b - cp(L(x, y) A L S ))) = 8 + i if and only if L(x, y) A L S = 0 and the Theorem follows for n = 9 and n = 10. Moreover, Symb ({a, b, c}* - L(x, y) A A Ls) = 11 if and only if L(x, y) A LS = 9 and we have the Theorem for n = 11. In order to prove Theorem for n > 11 we proceed as follows. By Theorem 1, there is a language L„_ 9 = {d1"}, i„ is an integer, such that Symb (L„_ 9 ) = n — 9. Now it is easy to show that Symb (L„_9 . L(G'X2)) = n if and only if L(x, y) A LS = 8 and this completes the proof of Theorem for n ^ 8. Corollary 5. There is no algorithm to determine Symb (L(G)) for an arbitrary CFGG. Another question which is naturally to ask is whether or not one can effectively determine the simplest grammar for the language generated by a CFG. The answer follows immediately from Corollaries 3 and 5. (See also [5] for the first part of the corollary.) Corollary 6. There is no effective method to construct to an arbitrary CFG G a new CFG G' such that L(G) = L(G') and Prod (G') = Prod (L(G)) (Symb (G') = = Symb (L(G))). We know now that there is no effective way to find the simplest grammar but can we at least to decide whether a given CFG is the simplest one? Theorem 7. It is undecidable for an arbitrary CFG G whether or not Symb (G) = = Symb (L(G)). Proof. Would it be decidable, the following procedure would determine Symb (L(G)) for an arbitrary CFG G. (i) Decide if G is the simplest grammar. If yes Symb (L(G)) = Symb (G). If not go to step (ii). (ii) Construct all CFG's which are simpler than G with respect to the criterion Symb. (There is only a finite number of such grammars (*) GuG2,...,Gk if we do not distinguish grammars which differ only in names of nonterminals.) (iii) Remove from (*) all CFG's which are not the simplest CFG's with respect to Symb. Let H Gi,...,c; be the resulting sequence of CFG's. (iv) Starting with (**), do for n = 1, 2, ... step n by which the sequence (**) is subsequently reduced until Symb (Gj) = Symb (G 2 ) for any two remaining CFG's d and G 2 . Then Symb (L(G)) = Symb (Gj). (n) For each grammar, say G0, currently in (**) compare {x; x e L(G0), \x\ ^ n). and {x; x e L(G), \X\ ^ n}. If this two sets differ remove G0 from (**); otherwise let G0 in (**). Now the Theorem follows from Corollary 6. In the preceding Theorem only the criterion Symb is considered. We are convinced that the same is true for the criterion Prod but have no proof. Open problem 1. Is it decidable for an arbitrary CFG G whether or not Prod (G) = = Prod (L(G))7 Open problem 2. Are the undecidabihty results of this section valid if only bounded CFG's and CFL's are considered? 5. COMPLEXITY AND UNAMBIGUITY If was shown in [4], for the criteria Var, Lev, Lev„ and Depth that the complexity and unambiguity are, in general, in conflict. The same is true for the criteria Prod and Symb. Indeed, let Lk be the language generated by the grammar. a -> aaba , a -> baaa , a -> e . By [4], any unambiguous CFG for Lk has at least two nonterminals and from that it follows easily that Prod (G) > 3 and Symb (G) > 14 for any unambiguous grammar G for Lk. By using the technique of the proofs of the foregoing Section we can show even more. Theorem 8. For any n > 3 (n > 14), there is an unambiguous CFL L„ (L'„) such that Prod (L„) = n (Symb (L'n) = n) and Prod (G) > n (Symb (G) > n for any unambigous CFG for L„ (for L'n). Remark. The only case which makes a little trouble is the case n = 15 for the criterion Symb. In this case the language generated by the grammar a -> a2ab, a -> a3ab, a -> e should be considered. (Received August 2, 1971.) REFERENCES [1] Ginsburg, S.: The mathematical theory of context-free languages. McGraw-Hill, New York 1966. [2] Gruska, J.: On a classification of context-free grammars. Kybernetika 3 (1967), 1, 22—29. [3] Gruska, J.: Some classifications of context-free languages. Information and Control 14 (1969), 152-179. [4] Gruska, J.: Complexity and unambiguity of context-free grammars and languages. Information and Control 18 (1971), 502-519. [5] Taniguchi, K , Kasami, T.: Reduction of Context-Free Grammars. Information and Control 17 (1970), 9 2 - 1 0 8 . 217 O velkosti bezkontextových gramatik JOZEF GRXISKA V práci sa vyšetrujú dve kritéria zložitosti bezkontextových gramatik a jazykov — počet pravidiel a počet symbolov. Ukazuje sa, že obe kritéria indukujú nekonečné hierarchie bezkontextových jazykov. Dokazuje sa nerozhodnutefnosť základných problémov, týkajúcich sa vyšetřovaných kritérií zložitosti. V závěre práce sa ukazuje, že pre niektoré jednoznačné bezkontextové jazyky sú jednoznačné gramatiky nutné zložitejšie, než viacznačné. RNDr. Jozef Gruska CSc, Matematický ústav SA V (Mathematical of Sciences), Štefánikova 41, Bratislava. Institute — Slovák Academy