University of Arizona, Department of Computer
			       Science

CSc 453: Programming Assignment 1 (html2txt): Part 2

Start Date: Mon Sept 6, 2010

Due Date: 11:59 PM, Sun Sept 12, 2010


1. General

This assignment involves writing a scanner and parser for HTML that enforce some simple grammar rules, e.g., that the <li> tag for list items can occur only within lists, or that tags specifying boldface <b>...</b> and italics <i>...</i>should be properly nested.

As with part 1 of this assignment, the goal is to get you sufficiently acquainted with lex and yacc that you can start the main compiler project. For this reason, the grammar rules used in this assignment cover only a very small part of the complete HTML syntax.

2.1. Functionality

Your program should read its input from stdin, ensure that the input follows the grammar rules for our subset of HTML, discard all HTML tags, and write the remaining text to stdout. Error messages (see below) should be written to stderr.

2.2. Conflicts

The HTML grammar provided generates a lot of conflicts if you translate it directly to a YACC input file. These conflicts arise from the introduction of rules to allow optional whitespace in between HTML tags. For this assignment you are not required to remove these conflicts.

2.3. Syntax Errors

Your program will be expected to deal with errors in a "reasonable" way. Error messages should be printed to stderr. They should be specific and should contain enough information (with at least a line number) to allow the user to locate the problems.

For this assignment, you are not required to implement error recovery. In other words, your parser can exit after detecting and reporting the first error in the input. (This will not be true for future assignments, but right now the focus is on learning to use YACC.)

3. Invoking Your Program

Your executable program will be called myhtml2txt. It will read input from stdin and write its output to stdout. Thus, to translate an HTML file foo.html to a text file bar.txt, invoke your program as
myhtml2txt < foo.html > bar.txt

4. Turnin

Turn in your files on host lectura.cs.arizona.edu. You should turn in all of your source files, as well as a Makefile that supports the following targets:

clean
Executing the command make clean should delete the *.o files, as well as the executable myhtml2txt, from the current directory.

myhtml2txt
Executing the command make myhtml2txt should create, in the current directory, an executable file myhtml2txt that implements your HTML-to-text translator from scratch, by invoking the appropriate tools (lex/flex) on the input specifications.
To turn in your files, use the command
turnin cs453f10-html2txt-part2 files
For more information on the turnin command, try man turnin. Note: The turnin command copies the files submitted into another directory. Because of this, programs that compile and execute without problems in your directory may not work correctly once they are turned in, because of problems with relative path names in include files and make files. Such problems are considered to be sloppiness inappropriate in an upper division course, and are liable to be penalized heavily.

The output of your program will be compared with our output using diff utility (see diff(1)), so it is recommended that you follow the specification, and instructions for turnin, closely.