Icon Program Library: procs/html.icn

html.icn: Procedures for parsing HTML

procedure htchunks:        generate chunks of HTML file
procedure htrefs:          generate references from HTML file
procedure httag:           extract name of HTML tag
procedure htvals:          generate values in HTML tag
procedure urlmerge:        merge URLs
procedure canpath:         put path in canonical form

link html
April 26, 2005; Gregg M. Townsend
This file is in the public domain.

These procedures parse HTML files:

htchunks(f)     generates the basic chunks -- tags and text --
                that compose an HTML file.

htrefs(f)       generates the tagname/keyword/value combinations
                that reference other files.

These procedures process strings from HTML files:

httag(s)        extracts the name of a tag.

htvals(s)       generates the keyword/value pairs from a tag.

urlmerge(base,new) interprets a new URL in the context of a base.

canpath(s)      puts a path in canonical form
____________________________________________________________

htchunks(f) generates the HTML chunks from file f.
It returns strings beginning with

        <!--    for unclosed comments (legal comments are deleted)
        <       for tags (will end with ">" unless unclosed at EOF)
anything else   for text

At this level entities such as &amp are left unprocessed and all
whitespace is preserved, including newlines.
____________________________________________________________

htrefs(f) extracts file/url references from within an HTML file
and generates a string of the form
        tagname keyword value
for each reference.

A single space character separates the three fields, but if no
value is supplied for the keyword, no space follows the keyword.
Tag and keyword names are always returned in upper case.

Quotation marks are stripped from the value, but note that the
value can contain spaces or other special characters (although
by strict HTML rules it probably shouldn't).

A table in the code determines which fields are references to
other files.  For example, with <IMG>, SRC= is a reference but
WIDTH= is not.  The table is based on the HTML 4.0 standard:
        http://www.w3.org/TR/REC-html40/
____________________________________________________________

httag(s) extracts and returns the tag name from within an HTML
tag string of the form "<tagname...>".   The tag name is returned
in upper case.
____________________________________________________________

htvals(s) generates the tag values contained within an HTML tag
string of the form "<tagname kw=val kw=val ...>".   For each
keyword=value pair beyond the tagname, a string of the form

        keyword value

is generated.  One space follows the keyword, which is returned
in upper case, and quotation marks are stripped from the value.
The value itself can be an empty string.

For each keyword given without a value, the keyword is generated
in upper case with no following space.

Parsing is somewhat tolerant of errors.
____________________________________________________________

urlmerge(base,new) interprets a full or partial new URL in the
context of a base URL, returning the combined URL.

Here are some examples of applying urlmerge() with a base value
of "http://www.vcu.edu/misc/sched.html" and a new value as given:

new             result
-------------   -------------------
#tuesday        http://www.vcu.edu/misc/sched.html#tuesday
bulletin.html   http://www.vcu.edu/misc/bulletin.html
./results.html  http://www.vcu.edu/misc/results.html
images/rs.gif   http://www.vcu.edu/misc/images/rs.gif
../             http://www.vcu.edu/
/greet.html     http://www.vcu.edu/greet.html
file:a.html     file:a.html
____________________________________________________________

canpath(s) returns the canonical form of a file path by squeezing
out components such as "./" and "dir/../".

Source code | Program Library Page | Icon Home Page