CSc 453: A Grammar for a Simple Subset of HTML

Lexical Rules

A tag is a sequence of characters of the form <S>, where S is a sequence of printable characters not beginning with a whitespace character and not containing any ">" characters. Our grammar recognizes the following tags:
DOC_START : <html>
DOC_END </html>
HEAD_START : <head>
HEAD_END : </head>
BODY_START : <body>
BODY_END </body>
BF_START <b>
BF_END </b>
IT_START <i>
IT_END </i>
UL_START <ul>
UL_END </ul>
OL_START <ol>
OL_END </ol>
LI_START <li>
LI_END </li>
Additionally, the token TAG will match any tag that is not one of the tags listed above, and the token TEXT will match any (single) character that is not within a tag or comment; ; and the token SPACE will match any non-empty sequence of whitespace characters.

Syntax Rules

Syntax rules are made up of tokens and nonterminals. A token denotes one or more related strings that are matched by the scanner (e.g., "identifier", "integer constant"). A nonterminal denotes a set of strings with similar syntax structure (e.g., "declaration", "while loop"). In the rules below, tokens are written in teletype font, like this; nonterminals are written in italics, like this. The symbol epsilon denotes the empty sequence.

A syntax rule consists of a left hand side and a right hand side, separated by a colon ":". The left hand side is a nonterminal whose structure is defined by the rule. A right hand side consists of a set of alternatives, separated by "|". Each alternative is a sequence (possibly empty) of tokens and nonterminals.

Doc   :   Wspace   DOC_START   Wspace   Head   Wspace   Body   Wspace   DOC_END Wspace  

Head   :   HEAD_START   Html   HEAD_END

Body   :   BODY_START   Html   BODY_END

Wspace   :   SPACE
  |   epsilon

Html   :   Item Html
  |   epsilon

Item   :   BF_START  Html  BF_END
  |   IT_START  Html  IT_END
  |   List
  |   Other

List   :   UL_START   Wspace   ItemList  Wspace   UL_END
  |   OL_START   Wspace   ItemList  Wspace   OL_END

ItemList   :   ItemList  Wspace   OneItem
  |   OneItem

OneItem   :   LI_START  Html  LI_END

Other   :   TAG
  |   TEXT
  |   SPACE

Thus, the rule for the nonterminal Html above consists of two alternatives. The first says that one possible structure for Html is to have something with the structure of Item (which is then defined by its own rules), followed by something else which again has the structure of Html; the second says that Html can simply the the empty sequence. (For those of you who have unwound the recursion here in your head, this amounts to saying that Html consists of zero or more Items.)

The start symbol for the grammar is Doc.