link ichartp
August 3, 2000; Richard L. Goerwitz
Requires: co-expressions
This file is in the public domain.
General: Ichartp implements a simple chart parser - a slow but easy-to-implement strategy for parsing context free grammars (it has a cubic worst-case time factor). Chart parsers are flexible enough to handle a lot of natural language constructs. They also lack many of the troubles associated with empty and left-recursive derivations. To obtain a parse, just create a BNF file, obtain a line of input, and then invoke parse_sentence(sentence, bnf_filename, start-symbol). Parse_sentence suspends successive edge structures corresponding to possible parses of the input sentence. There is a routine called edge_2_tree() that converts these edges to a more standard form. See the stub main() procedure for an example of how to make use of all these facilities. ____________________________________________________________ Implementation details: The parser itself operates in bottom-up fashion, but it might just as well have been coded top-down, or for that matter as a combination bottom-up/top-down parser (chart parsers don't care). The parser operates in breadth-first fashion, rather than walking through each alternative until it is exhausted. As a result, there tends to be a pregnant pause before any results appear, but when they appear they come out in rapid succession. To use a depth-first strategy, just change the "put" in "put(ch.active, new_e)" to read "push." I haven't tried to do this, but it should be that simple to implement. BNFs are specified using the same notation used in Griswold & Griswold, and as described in the IPL program "pargen.icn," with the following difference: All metacharacters (space, tab, vertical slash, right/left parends, brackets and angle brackets) are converted to literals by prepending a backslash. Comments can be include along with BNFs using the same notation as for Icon code (i.e. #-sign). ____________________________________________________________ Gotchas: Pitfalls to be aware of include things like <L> ::= <L> | ha | () (a weak attempt at a laugh recognizer). This grammar will accept "ha," "ha ha," etc. but will suspend an infinite number of possible parses. The right way to do this sort of thing is <L> ::= ha <S> | ha, or if you really insist on having the empty string as a possibility, try things like: <S> ::= () | <LAUGHS> <LAUGHS> ::= ha <LAUGHS> | ha Of course, the whole problem of infinite parses can be avoided by simply invoking the parser in a context where it is not going to be resumed, or else one in which it will be resumed a finite number of times. ____________________________________________________________ Motivation: I was reading Byte Magazine (vol. 17:2 [February, 1992]), and ran into an article entitled "A Natural Solution" (pages 237-244) in which a standard chart parser was described in terms of its C++ implementation. The author remarked at how his optimizations made it possible to parse a 14-word sentence in only 32 seconds (versus 146 for a straight Gazdar-Mellish LISP chart parser). 32 seconds struck me as hardly anything to write home about, so I coded up a quick system in Icon to see how it compared. This library is the result. I'm quite sure that this code could be very much improved upon. As it stands, its performance seems as good as the C++ parser in BYTE, if not better. It's hard to tell, though, seeing as I have no idea what hardware the guy was using. I'd guess a 386 running DOS. On a 386 running Xenix the Icon version beats the BYTE times by a factor of about four. The Icon compiler creates an executable that (in the above environment) parses 14-15 word sentences in anywhere from 6 to 8 seconds. Once the BNF file is read, it does short sentences in a second or two. If I get around to writing it, I'll probably use the code here as the basic parsing engine for an adventure game my son wants me to write. ____________________________________________________________ Here's a sample BNF file (taken, modified, from the BYTE Magazine article mentioned above). Note again the conventions a) that nonterminals be enclosed in angle brackets & b) that overlong lines be continued by terminating the preceding line with a backslash. Although not illustrated below, the metacharacters <, >, (, ), and | can all be escaped (i.e. can all have their special meaning neutralized) with a backslash (e.g. \<). Comments can also be included using the Icon #-notation. Empty symbols are illegal, so if you want to specify a zero-derivation, use "()." There is an example of this usage below. <S> ::= <NP> <VP> | <S> <CONJ> <S> <VP> ::= <VP> <CONJ> <VP> | <IV> ( () | <PP> ) | \ <TV> ( <NP> | <NP> <PP> | <NP> <VP> | <REL> <S> ) <NP> ::= <DET> ( <NP> | <ADJ> <NP> | <ADJ> <NP> <PP> | <NP> <PP> ) | \ <ADJ> <NP> | <N> | <N> <CONJ> <N> | \ <NP> <CONJ> <NP> <PP> ::= <P> ( <NP> | <ADJ> <NP> ) | <PP> <CONJ> <PP> <ADJ> ::= <ADJ> <CONJ> <ADJ> <CONJ> ::= and <DET> ::= the | a | his | her <NP> ::= her | he | they <N> ::= nurse | nurses | book | books | travel | arrow | arrows | \ fortune | fortunes | report <ADJ> ::= outrageous | silly | blue | green | heavy | white | red | \ black | yellow <IV> ::= travel | travels | report | see | suffer <TV> ::= hear | see | suffer <P> ::= on | of <REL> ::= that ____________________________________________________________ Addendum: Sometimes, when writing BNFs, one finds oneself repeatedly writing the same things. In efforts to help eliminate the need for doing this, I've written a simple macro facility. It involves one reserved word: "define." Just make sure it begins a line. It takes two arguments. The first is the macro. The second is its expansion. The first argument must not contain any spaces. The second, however, may. Here's an example: define <silluq-clause> ( <silluq-phrase> | \ <tifcha-silluq-clause> | \ <zaqef-silluq-clause> \ )