University of Arizona, Department of Computer Science

CSc 453 : Programming Assignment 1 (html2txt)

Start Date: Fri Jan 15, 2016

Due Date: 11:59 PM, Sun Jan 24, 2016


1. General

This assignment involves writing a simple HTML-to-TXT translator. The primary goal of this assignment is to get you acquainted with the compiler front-end tools lex and yacc, which we'll be using for the rest of the project. A secondary goal is to point out that compiler ideas and tools are applicable for non-compiler problems as well. Since the main focus of this assignment is to get you started with these compiler tools, we'll keep things simple and not try to handle all of the subtleties of HTML the way they should really be handled (this means that if you compare your output with the results of a commercial HTML-to-text translator, there may very well be some differences).

Documentation on flex/lex and yacc/bison is available here.

2. Functionality

Use flex (or lex) and yacc (or bison) to write a program that translates HTML to text. Your program should have the following functionality.

  1. It should read its input from stdin, discard all HTML tags (including comments: see below), recognize and handle a small set of "special entities" appropriately, and write the remaining text to stdout.
  2. It should enforce some simple grammar rules (e.g., that the <li> tag for list items can occur only within lists, or that tags specifying boldface <b>...</b> and italics <i>...</i>should be properly nested), giving an appropriate error message if any grammar rule is violated.
  3. The exit status of your program should be 0 if no errors are encountered during processing, and 1 if any errors are encountered at any point.

3. HTML Specification

The lexical and syntactic structure of our subset of HTML is given here.

4. Invoking Your Program

Your executable program will be called myhtml2txt. It will read input from stdin and write its output to stdout. Thus, to translate an HTML file foo.html to a text file bar.txt, invoke your program as
myhtml2txt < foo.html > bar.txt

5. Getting Started

To help you get started, I have placed the following files in the directory /home/cs453/spring16/assignments/html2txt on lectura:

6. Turnin

Turn in your files on host lectura.cs.arizona.edu. You should turn in all of your source files, as well as a Makefile that supports the following targets:

clean
Executing the command make clean should delete the *.o files, as well as the executable myhtml2txt, from the current directory.

myhtml2txt
Executing the command make myhtml2txt should create, in the current directory, an executable file myhtml2txt that implements your HTML-to-text translator from scratch, by invoking the appropriate tools (lex/flex) on the input specifications.
To turn in your files, use the command
turnin cs453s16-html2txt file1 file2 ... filen
Please submit your files just as they are: do not submit a directory containing your files, or zip them up into a single file, or do anything else that requires additional manual intervention.

For more information on the turnin command, try man turnin. Note: The turnin command copies the files submitted into another directory. Because of this, programs that compile and execute without problems in your directory may not work correctly once they are turned in, because of problems with relative path names in include files and make files. Such problems are considered to be sloppiness inappropriate in an upper division course, and are liable to be penalized heavily.

The output of your program will be compared with our output using diff utility (see diff(1)), so it is recommended that you follow the specification, and instructions for turnin, closely.