University of Arizona, Department of Computer
			       Science

CSc 453: A Grammar for a Simple Subset of HTML

Lexical Rules

A tag is a sequence of characters of the form <S>, where S is a sequence of printable characters not beginning with a whitespace character and not containing any ">" characters. Our grammar recognizes the following tags:
DOC_START : <html>
DOC_END </html>
HEAD_START : <head>
HEAD_END : </head>
BODY_START : <body>
BODY_END </body>
BF_START <b>
BF_END </b>
IT_START <i>
IT_END </i>
UL_START <ul>
UL_END </ul>
OL_START <ol>
OL_END </ol>
LI_START <li>
LI_END </li>
Additionally, the token TAG will match any tag that is not one of the tags listed above; SPACE will match any nonempty sequence of whitespace characters (i.e., those for which isspace(3) is nonzero); and the token TEXT will match any (single) non-space character that is not within a tag.

Syntax Rules

Syntax rules are made up of tokens and nonterminals. A token denotes one or more related strings that are matched by the scanner (e.g., "identifier", "integer constant"). A nonterminal denotes a set of strings with similar syntax structure (e.g., "declaration", "while loop"). In the rules below, tokens are written in teletype font, like this; nonterminals are written in italics, like this. The symbol epsilon denotes the empty sequence.

A syntax rule consists of a left hand side and a right hand side, separated by a colon ":". The left hand side is a nonterminal whose structure is defined by the rule. A right hand side consists of a set of alternatives, separated by "|". Each alternative is a sequence (possibly empty) of tokens and nonterminals.

Doc   :   WSpace   DOC_START   WSpace   Head   WSpace   Body   WSpace   DOC_END   WSpace
Head   :   HEAD_START   Html   HEAD_END  
Body   :   BODY_START   Html   BODY_END  
WSpace   :   SPACE
  |   epsilon
Html   :   item   Html  
  |   epsilon
item   :   BF_START  Html  BF_END
  |   IT_START  Html  IT_END
  |   List
  |   Other
List   :   UL_START   WSpace   ItemList   WSpace   UL_END
  |   OL_START   WSpace   ItemList  WSpace   OL_END
ItemList   :   OneItem   WSpace   ItemList
  |   epsilon
OneItem   :   LI_START  Html  LI_END
Other   :   TAG
  |   TEXT
  |   WSpace
Thus, the rule for the nonterminal Html above consists of two alternatives. The first says that one possible structure for Html is to have something with the structure of item (which is then defined by its own rules), followed by something else which again has the structure of Html; the second says that Html can simply the the empty sequence. (For those of you who have unwound the recursion here in your head, this amounts to saying that Html consists of zero or more items.)

The start symbol for the grammar is Doc.