rsg.icn: Program to generate randomly selected sentences

March 26, 2002; Ralph E. Griswold
This file is in the public domain.
   This program generates randomly selected strings (``sen-
tences'') from a grammar specified by the user.  Grammars are
basically context-free and resemble BNF in form, although there
are a number of extensions.
____________________________________________________________

   The program works interactively, allowing the user to build,
test, modify, and save grammars. Input to rsg consists of various
kinds of specifications, which can be intermixed:

   Productions define nonterminal symbols in a syntax similar to
the rewriting rules of BNF with various alternatives consisting
of the concatenation of nonterminal and terminal symbols.  Gen-
eration specifications cause the generation of a specified number
of sentences from the language defined by a given nonterminal
symbol.  Grammar output specifications cause the definition of a
specified nonterminal or the entire current grammar to be written
to a given file.  Source specifications cause subsequent input to
be read from a specified file.

   In addition, any line beginning with # is considered to be a
comment, while any line beginning with = causes the rest of that
line to be used subsequently as a prompt to the user whenever rsg
is ready for input (there normally is no prompt). A line consist-
ing of a single = stops prompting.

Productions: Examples of productions are:

        <expr>::=<term>|<term>+<expr>
        <term>::=<elem>|<elem>*<term>
        <elem>::=x|y|z|(<expr>)

Productions may occur in any order. The definition for a nonter-
minal symbol can be changed by specifying a new production for
it.

   There are a number of special devices to facilitate the defin-
ition of grammars, including eight predefined, built-in nontermi-
nal symbols:
   symbol   definition
   <lb>     <
   <rb>     >
   <vb>     |
   <nl>     newline
   <>       empty string
   <&lcase> any single lowercase letter
   <&ucase> any single uppercase letter
   <&digit> any single digit

In addition, if the string between a < and a > begins and ends
with a single quotation mark, it stands for any single character
between the quotation marks. For example,

        <'xyz'>

is equivalent to

        x|y|z

Generation Specifications: A generation specification consists of
a nonterminal symbol followed by a nonnegative integer. An exam-
ple is

        <expr>10

which specifies the generation of 10 <expr>s. If the integer is
omitted, it is assumed to be 1. Generated sentences are written
to standard output.

Grammar Output Specifications: A grammar output specification
consists of a nonterminal symbol, followed by ->, followed by a
file name. Such a specification causes the current definition of
the nonterminal symbol to be written to the given file. If the
file is omitted, standard output is assumed. If the nonterminal
symbol is omitted, the entire grammar is written out. Thus,

        ->

causes the entire grammar to be written to standard output.

Source Specifications: A source specification consists of @ fol-
lowed by a file name.  Subsequent input is read from that file.
When an end of file is encountered, input reverts to the previous
file. Input files can be nested.

Options: The following options are available:

     -s n Set the seed for random generation to n.

     -r   In the absence of -s, set the seed to 0 for repeatable
          results.  Otherwise the seed is set to a different value
          for each run (as far as this is possible). -r is equivalent
          to -s 0.

     -l n Terminate generation if the number of symbols remaining
          to be processed exceeds n. The default is limit is 1000.

     -t   Trace the generation of sentences. Trace output goes to
          standard error output.

Diagnostics: Syntactically erroneous input lines are noted but
are otherwise ignored.  Specifications for a file that cannot be
opened are noted and treated as erroneous.

   If an undefined nonterminal symbol is encountered during gen-
eration, an error message that identifies the undefined symbol is
produced, followed by the partial sentence generated to that
point. Exceeding the limit of symbols remaining to be generated
as specified by the -l option is handled similarly.

Caveats: Generation may fail to terminate because of a loop in
the rewriting rules or, more seriously, because of the progres-
sive accumulation of nonterminal symbols. The latter problem can
be identified by using the -t option and controlled by using the
-l option. The problem often can be circumvented by duplicating
alternatives that lead to fewer rather than more nonterminal sym-
bols. For example, changing

        <term>::=<elem>|<elem>*<term>

to

        <term>::=<elem>|<elem>|<elem>*<term>

increases the probability of selecting <elem> from 1/2 to 2/3.

   There are many possible extensions to the program. One of the
most useful would be a way to specify the probability of select-
ing an alternative.

Source code | Program Library Page | Icon Home Page