CSc 453: Programming Assignment 1 (html2txt): Part 2
Start Date: Mon Sept 6, 2010
Due Date: 11:59 PM, Sun Sept 12, 2010
1. General
This assignment involves writing a scanner and parser for HTML that enforce
some simple grammar rules, e.g., that the <li> tag for list
items can occur only within lists, or that tags specifying boldface
<b>...</b> and italics
<i>...</i>should be properly nested.
As with part 1 of this assignment,
the goal is to get you sufficiently acquainted with lex
and yacc that you can start the main compiler project. For this
reason, the grammar rules used in this assignment cover only a very small
part of the complete HTML syntax.
2.1. Functionality
Your program should read its input from stdin,
ensure that the input follows
the grammar rules for our subset of
HTML, discard all HTML tags, and write the remaining text to stdout.
Error messages (see below) should
be written to stderr.
2.2. Conflicts
The HTML grammar provided generates
a lot of conflicts if you translate it directly to a YACC input file.
These conflicts arise from the introduction of rules to allow optional
whitespace in between HTML tags. For this assignment you are not required
to remove these conflicts.
2.3. Syntax Errors
Your program will be expected to deal with errors in a
"reasonable" way. Error messages should be printed to stderr.
They should be specific and should contain
enough information (with at least a line number) to allow the user to locate
the problems.
For this assignment, you are not required to implement error recovery. In
other words, your parser can exit after detecting and reporting the first
error in the input. (This will not be true for future assignments, but
right now the focus is on learning to use YACC.)
3. Invoking Your Program
Your executable program will be called myhtml2txt. It will read
input from stdin and write its output to stdout. Thus, to
translate an HTML file foo.html to a text file bar.txt,
invoke your program as
myhtml2txt < foo.html > bar.txt
4. Turnin
Turn in your files on host lectura.cs.arizona.edu. You should turn
in all of your source files, as well as a Makefile that supports the
following targets:
- clean
-
Executing the command make clean should delete the *.o files,
as well as the executable myhtml2txt, from the current directory.
- myhtml2txt
-
Executing the command make myhtml2txt should create, in the current
directory, an executable file myhtml2txt that implements your
HTML-to-text translator from scratch, by invoking the appropriate tools
(lex/flex) on the input specifications.
To turn in your files, use the command
turnin cs453f10-html2txt-part2 files
For more information on the turnin command, try man turnin.
Note: The turnin command copies the files submitted
into another directory. Because of this, programs that compile and execute
without problems in your directory may not work correctly once they are
turned in, because of problems with relative path names in include files
and make files. Such problems are considered to be sloppiness inappropriate
in an upper division course, and are liable to be penalized heavily.
The output of your program will be compared with our output using
diff utility (see diff(1)), so it is recommended that you follow
the specification, and instructions for turnin, closely.