Icon Program Library: procs/ichartp.icn

ichartp.icn: Procedures for a simple chart parser

link ichartp
August 3, 2000; Richard L. Goerwitz
Requires: co-expressions
This file is in the public domain.
General:

    Ichartp implements a simple chart parser - a slow but
easy-to-implement strategy for parsing context free grammars (it
has a cubic worst-case time factor).  Chart parsers are flexible
enough to handle a lot of natural language constructs.  They also
lack many of the troubles associated with empty and left-recursive
derivations.  To obtain a parse, just create a BNF file, obtain a
line of input, and then invoke parse_sentence(sentence,
bnf_filename, start-symbol).  Parse_sentence suspends successive
edge structures corresponding to possible parses of the input
sentence.  There is a routine called edge_2_tree() that converts
these edges to a more standard form.  See the stub main() procedure
for an example of how to make use of all these facilities.
____________________________________________________________

Implementation details:

    The parser itself operates in bottom-up fashion, but it might
just as well have been coded top-down, or for that matter as a
combination bottom-up/top-down parser (chart parsers don't care).
The parser operates in breadth-first fashion, rather than walking
through each alternative until it is exhausted.  As a result, there
tends to be a pregnant pause before any results appear, but when
they appear they come out in rapid succession.  To use a depth-first
strategy, just change the "put" in "put(ch.active, new_e)" to read
"push."  I haven't tried to do this, but it should be that simple
to implement.
    BNFs are specified using the same notation used in Griswold &
Griswold, and as described in the IPL program "pargen.icn," with
the following difference:  All metacharacters (space, tab, vertical
slash, right/left parends, brackets and angle brackets) are
converted to literals by prepending a backslash.  Comments can be
include along with BNFs using the same notation as for Icon code
(i.e. #-sign).
____________________________________________________________

Gotchas:

    Pitfalls to be aware of include things like <L> ::= <L> | ha |
() (a weak attempt at a laugh recognizer).  This grammar will
accept "ha," "ha ha," etc. but will suspend an infinite number of
possible parses.  The right way to do this sort of thing is <L> ::=
ha <S> | ha, or if you really insist on having the empty string as
a possibility, try things like:

        <S>      ::= () | <LAUGHS>
        <LAUGHS> ::= ha <LAUGHS> | ha

Of course, the whole problem of infinite parses can be avoided by
simply invoking the parser in a context where it is not going to
be resumed, or else one in which it will be resumed a finite number
of times.
____________________________________________________________

Motivation:

    I was reading Byte Magazine (vol. 17:2 [February, 1992]), and
ran into an article entitled "A Natural Solution" (pages 237-244)
in which a standard chart parser was described in terms of its C++
implementation.  The author remarked at how his optimizations made
it possible to parse a 14-word sentence in only 32 seconds (versus
146 for a straight Gazdar-Mellish LISP chart parser).  32 seconds
struck me as hardly anything to write home about, so I coded up a
quick system in Icon to see how it compared.  This library is the
result.
    I'm quite sure that this code could be very much improved upon.
As it stands, its performance seems as good as the C++ parser in
BYTE, if not better.  It's hard to tell, though, seeing as I have
no idea what hardware the guy was using.  I'd guess a 386 running
DOS.  On a 386 running Xenix the Icon version beats the BYTE times
by a factor of about four.  The Icon compiler creates an executable
that (in the above environment) parses 14-15 word sentences in
anywhere from 6 to 8 seconds.  Once the BNF file is read, it does
short sentences in a second or two.  If I get around to writing it,
I'll probably use the code here as the basic parsing engine for an
adventure game my son wants me to write.
____________________________________________________________

     Here's a sample BNF file (taken, modified, from the BYTE
Magazine article mentioned above).  Note again the conventions a)
that nonterminals be enclosed in angle brackets & b) that overlong
lines be continued by terminating the preceding line with a
backslash.  Although not illustrated below, the metacharacters <,
>, (, ), and | can all be escaped (i.e. can all have their special
meaning neutralized) with a backslash (e.g. \<).  Comments can also
be included using the Icon #-notation.  Empty symbols are illegal,
so if you want to specify a zero-derivation, use "()."  There is an
example of this usage below.

<S>    ::= <NP> <VP> | <S> <CONJ> <S>
<VP>   ::= <VP> <CONJ> <VP> | <IV> ( () | <PP> ) | \
        <TV> ( <NP> | <NP> <PP> | <NP> <VP> | <REL> <S> )
<NP>   ::= <DET> ( <NP> | <ADJ> <NP> | <ADJ> <NP> <PP> | <NP> <PP> ) | \
        <ADJ> <NP> | <N> | <N> <CONJ> <N> | \
        <NP> <CONJ> <NP>
<PP>   ::= <P> ( <NP> | <ADJ> <NP> ) | <PP> <CONJ> <PP>
<ADJ>  ::= <ADJ> <CONJ> <ADJ>
<CONJ> ::= and
<DET>  ::= the | a | his | her
<NP>   ::= her | he | they
<N>    ::= nurse | nurses | book | books | travel | arrow | arrows | \
       fortune | fortunes | report
<ADJ>  ::= outrageous | silly | blue | green | heavy | white | red | \
       black | yellow
<IV>   ::= travel | travels | report | see | suffer
<TV>   ::= hear | see | suffer
<P>    ::= on | of
<REL>  ::= that
____________________________________________________________

Addendum:

    Sometimes, when writing BNFs, one finds oneself repeatedly
writing the same things.  In efforts to help eliminate the need for
doing this, I've written a simple macro facility.  It involves one
reserved word:  "define."  Just make sure it begins a line.  It
takes two arguments.  The first is the macro.  The second is its
expansion.  The first argument must not contain any spaces.  The
second, however, may.  Here's an example:

    define <silluq-clause>    (   <silluq-phrase> | \
                                  <tifcha-silluq-clause> | \
                                  <zaqef-silluq-clause> \
                              )
Source code | Program Library Page | Icon Home Page