CSc 453: A Grammar for a Simple Subset of HTML

Lexical Rules

A tag is a sequence of characters of the form <S>, where S is a sequence of printable characters not beginning with a whitespace character and not containing any ">" characters. Our grammar recognizes the following tags:

DOC_START : <html>

DOC_END : </html>

HEAD_START : <head>

HEAD_END : </head>

BODY_START : <body>

BODY_END : </body>

BF_START : <b>

BF_END : </b>

IT_START : <i>

IT_END : </i>

UL_START : <ul>

UL_END : </ul>

OL_START : <ol>

OL_END : </ol>

LI_START : <li>

LI_END : </li>

Additionally, the token TAG will match any tag that is not one of the tags listed above, and the token TEXT will match any (single) character that is not within a tag or comment; ; and the token SPACE will match any non-empty sequence of whitespace characters.

Syntax Rules

Syntax rules are made up of tokens and nonterminals. A token denotes one or more related strings that are matched by the scanner (e.g., "identifier", "integer constant"). A nonterminal denotes a set of strings with similar syntax structure (e.g., "declaration", "while loop"). In the rules below, tokens are written in teletype font, like this; nonterminals are written in italics, like this. The symbol epsilon

denotes the empty sequence.

A syntax rule consists of a left hand side and a right hand side, separated by a colon ":". The left hand side is a nonterminal whose structure is defined by the rule. A right hand side consists of a set of alternatives, separated by "|". Each alternative is a sequence (possibly empty) of tokens and nonterminals.

Doc : Wspace DOC_START Wspace Head Wspace Body Wspace DOC_END Wspace

Head : HEAD_START Html HEAD_END

Body : BODY_START Html BODY_END

Wspace : SPACE

|

Html : Item Html

|

Item : BF_START Html BF_END

| IT_START Html IT_END

| List

| Other

List : UL_START Wspace ItemList Wspace UL_END

| OL_START Wspace ItemList Wspace OL_END

ItemList : ItemList Wspace OneItem

| OneItem

OneItem : LI_START Html LI_END

Other : TAG

| TEXT

| SPACE

Thus, the rule for the nonterminal Html above consists of two alternatives. The first says that one possible structure for Html is to have something with the structure of Item (which is then defined by its own rules), followed by something else which again has the structure of Html; the second says that Html can simply the the empty sequence. (For those of you who have unwound the recursion here in your head, this amounts to saying that Html consists of zero or more Items.)

The start symbol for the grammar is Doc.

`DOC_START`	:	`<html>`
`DOC_END`	:	`</html>`
`HEAD_START`	:	`<head>`
`HEAD_END`	:	`</head>`
`BODY_START`	:	`<body>`
`BODY_END`	:	`</body>`
`BF_START`	:	`<b>`
`BF_END`	:	`</b>`
`IT_START`	:	`<i>`
`IT_END`	:	`</i>`
`UL_START`	:	`<ul>`
`UL_END`	:	`</ul>`
`OL_START`	:	`<ol>`
`OL_END`	:	`</ol>`
`LI_START`	:	`<li>`
`LI_END`	:	`</li>`

Doc	: Wspace `DOC_START` Wspace Head Wspace Body Wspace `DOC_END` Wspace
Head	: `HEAD_START` Html `HEAD_END`
Body	: `BODY_START` Html `BODY_END`
Wspace	: SPACE
Wspace	\|
Html	: Item Html
Html	\|
Item	: `BF_START` Html `BF_END`
	\| `IT_START` Html `IT_END`
	\| List
	\| Other
List	: `UL_START` Wspace ItemList Wspace `UL_END`
List	\| `OL_START` Wspace ItemList Wspace `OL_END`
ItemList	: ItemList Wspace OneItem
ItemList	\| OneItem
OneItem	: `LI_START` Html `LI_END`
Other	: `TAG`
	\| `TEXT`
	\| `SPACE`