procedure htchunks: generate chunks of HTML file procedure htrefs: generate references from HTML file procedure httag: extract name of HTML tag procedure htvals: generate values in HTML tag procedure urlmerge: merge URLs procedure canpath: put path in canonical form
link html
April 26, 2005; Gregg M. Townsend
This file is in the public domain.
These procedures parse HTML files: htchunks(f) generates the basic chunks -- tags and text -- that compose an HTML file. htrefs(f) generates the tagname/keyword/value combinations that reference other files. These procedures process strings from HTML files: httag(s) extracts the name of a tag. htvals(s) generates the keyword/value pairs from a tag. urlmerge(base,new) interprets a new URL in the context of a base. canpath(s) puts a path in canonical form ____________________________________________________________ htchunks(f) generates the HTML chunks from file f. It returns strings beginning with <!-- for unclosed comments (legal comments are deleted) < for tags (will end with ">" unless unclosed at EOF) anything else for text At this level entities such as & are left unprocessed and all whitespace is preserved, including newlines. ____________________________________________________________ htrefs(f) extracts file/url references from within an HTML file and generates a string of the form tagname keyword value for each reference. A single space character separates the three fields, but if no value is supplied for the keyword, no space follows the keyword. Tag and keyword names are always returned in upper case. Quotation marks are stripped from the value, but note that the value can contain spaces or other special characters (although by strict HTML rules it probably shouldn't). A table in the code determines which fields are references to other files. For example, with <IMG>, SRC= is a reference but WIDTH= is not. The table is based on the HTML 4.0 standard: http://www.w3.org/TR/REC-html40/ ____________________________________________________________ httag(s) extracts and returns the tag name from within an HTML tag string of the form "<tagname...>". The tag name is returned in upper case. ____________________________________________________________ htvals(s) generates the tag values contained within an HTML tag string of the form "<tagname kw=val kw=val ...>". For each keyword=value pair beyond the tagname, a string of the form keyword value is generated. One space follows the keyword, which is returned in upper case, and quotation marks are stripped from the value. The value itself can be an empty string. For each keyword given without a value, the keyword is generated in upper case with no following space. Parsing is somewhat tolerant of errors. ____________________________________________________________ urlmerge(base,new) interprets a full or partial new URL in the context of a base URL, returning the combined URL. Here are some examples of applying urlmerge() with a base value of "http://www.vcu.edu/misc/sched.html" and a new value as given: new result ------------- ------------------- #tuesday http://www.vcu.edu/misc/sched.html#tuesday bulletin.html http://www.vcu.edu/misc/bulletin.html ./results.html http://www.vcu.edu/misc/results.html images/rs.gif http://www.vcu.edu/misc/images/rs.gif ../ http://www.vcu.edu/ /greet.html http://www.vcu.edu/greet.html file:a.html file:a.html ____________________________________________________________ canpath(s) returns the canonical form of a file path by squeezing out components such as "./" and "dir/../".