Udostępnij za pośrednictwem


Every Program There Is, Part Seven

[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]

Things seem to be going so well. We have a sketch of a recursive solution for enumerating all strings of a particular grammar and it looks like the recursion is well-founded. Let’s consider one of our more complex grammars from a previous posting:

DECL: MODIFIER class ID BASE { }
MODIFIER: NIL | public | internal
BASE: NIL | : ID
ID: a | b | c | d

Consider the smallest string in this language class a { }. It’s derivation is in DECL[5]:

DECL
MODIFIER class ID BASE { }
NIL class ID BASE { }
class ID BASE { }
class a BASE { }
class a { }

(Remember that we are characterizing NIL as being just a fancy way of writing the empty string. Here we substitute NIL for BASE and it just goes away.)

How would we work out DECL[5] using our algorithm? Easy: just solve all these subproblems and combine the solutions:

MODIFIER[0] class[0] ID[0] BASE[0] {[0] }[4]
MODIFIER[0] class[0] ID[0] BASE[0] {[1] }[3]
MODIFIER[0] class[0] ID[0] BASE[0] {[2] }[2]
MODIFIER[0] class[0] ID[0] BASE[0] {[3] }[1]
… hundreds more…
MODIFIER[3] class[1] ID[0] BASE[0] {[0] }[0]
MODIFIER[4] class[0] ID[0] BASE[0] {[0] }[0]

Um… yuck. Even if we write optimizations in there so that we skip even considering X[0] on nonterminals and X[n>0] on terminals, it is still a total pain in the rear to write the code to do this. We didn’t have this problem in our earlier example because we only ever had one or two terms for a given substitution. If only we could have the simplicity of having only two terms per production.

Hmm.

Remember back when we had the grammar

S: N | S + N
N: 1 | 2 | 3

and we noticed that this was the same language as the grammar

S: N | S P
P: + N
N: 1 | 2 | 3

? Notice that what we did was we turned a grammar where there was a rule with three terms into a grammar where every rule is either one or two terms. Now, what if we also did this?

S: N NIL | S P
P: + N
N: 1 NIL | 2 NIL | 3 NIL

That’s clearly also the same language. We can take any grammar and turn it into an equivalent grammar that has exactly two terms per substitution. Either we add a NIL to a “single” production, or we make an n-ary production into an n-1-ary production by creating a new substitution rule.

Here’s an unambiguous grammar that lets us define public generic classes with base types and nested classes:

CLASSLIST: CLASSDECL CLASSLIST | NIL
CLASSDECL: public class ID PARAMSLIST BASE { CLASSLIST }
PARAMSLIST: NIL | < PARAMSLISTBODY >
PARAMSLISTBODY: ID | ID , PARAMSLISTBODY
BASE: NIL | : TYPE
TYPE: ID TYPEARGS | ID TYPEARGS . TYPE
TYPEARGS: NIL | < ARGLISTBODY >
ARGLISTBODY: TYPE | TYPE , ARGLISTBODY
ID: a | b | c | d

We can rewrite this into one of our special grammars that have two terms per production like this:

CLASSLIST: CLASSDECL CLASSLIST | NIL NIL
CLASSDECL: HEADER CLASSBODY
HEADER: CLSNAME BASE
CLSNAME: PUBCLS CLASSNAME
PUBCLS: public class
CLASSNAME: ID PARAMSLIST
PARAMSLIST: NIL NIL | < PARAMSLISTEND
PARAMSLISTEND: PARAMSLISTBODY >
PARAMSLISTBODY: ID PARAMSLISTREST
PARAMSLISTREST: NIL NIL | , PARAMSLISTBODY
BASE: NIL NIL | : TYPE
TYPE: NODOTTYPE NIL | TYPEDOT TYPE
NODOTTYPE: ID TYPEARGS
TYPEDOT: NODOTTYPE .
TYPEARGS: NIL NIL | < ARGLISTEND
ARGLISTEND: ARGLISTBODY >
ARGLISTBODY: TYPE ARGLISTREST
ARGLISTREST: NIL NIL | , ARGLISTBODY
CLASSBODY: { CLASSBODYEND
CLASSBODYEND: CLASSLIST }
ID: a NIL | b NIL | c NIL | d NIL

Harder to read of course, but defining the same language. And in fact, we could write a program that takes a grammar of the first form and automatically produces a grammar of the second form, though we’re not going to do that.

Next time: Looks like we’re actually ready to start writing code!

[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]

Comments

  • Anonymous
    June 01, 2010
    Isn't  the derivation of  class a { } actually in DECL[4] or DECL[6], rather than in DECL[5] ? DECL MODIFIER class ID BASE { }     NIL class ID BASE { }   <- If this step should be included in the derivation.... class ID BASE { } class a BASE { } class a { }  <- .... then a step class a NIL { } should be included before this step as well... ? Well, I know this is of no consequence to the argument, but just to be sure I understand....

  • Anonymous
    April 21, 2011
    Hey, haven't we got names for those concepts, established quite some time ago? Why not just use (or at least mention) them? "exactly two terms per substituition" boils down to what is usually referred to as Chomsky Normal Form, aka CNF; where NIL is usually named epsilon and directly allowed as right-hand-side, which, together with additionally allowing a single terminal as right-hand-side is more succinct, IMHO. The bin-tree encoding of arbitrary trees (and back) is implied by the fact that any context-free grammar can be turned into one in CNF. Yet there are things that I haven't sorted out: how to translate the "context-free" prerequisite to graph theory? I suppose that it's (got to do with) the monotonicity ("can grow only bigger or stay the same length") that is the equivalent of talking of trees only (instead of DAGs or even arbitrary graphs). On the other hand: it's perfectly fine to have "cycles" in a context-free grammar; not even that, to be anything but trivial this recursion is necessary! Well sure, it's about the derivations which will yield trees, possibly even ones with an infinite number of nodes. Imagine an endless stream of tokens in the formal-language view: isn't it, at any time, possible to tell whether the sequence of tokens seen so far (=finite prefix) is either itself already in the language or can(not anymore) be completed to a (possibly infinite) word in the language? Hmm, seems that I can't even properly formulate that question; as said: haven't sorted it out... Well, finally the link to Wikipedia on Greibach Normal Form (GNF): en.wikipedia.org/.../Greibach_normal_form ; you'll spot the links to other normal forms etc. at a glance.