weblinks.icn: Program to check links in HTML files

October 6, 2010; Gregg M. Townsend
Requires: Unix, dynamic loading
This file is in the public domain.
Weblinks is a program for checking links in a collection of HTML
files.  It is designed for use directly on the file structure
containing the HTML files.

Given one or more starting points, weblinks parses each file and
validates the HTTP: and FILE: links it finds.  Errors are reported
on standard output.  FILE: links, including relative links, can be
followed recursively.
____________________________________________________________

By design, only local files are scanned.  Only an existence check is
performed for HTTP: links.  Validation of HTTP: links is aided by
caching and subject to speed limits; see "vhttp.icn" for details.

Remote links are checked by sending an HTTP "HEAD" request.
Unfortunately, some sites respond with "Server Error" or even with
snide remarks like "Because I felt like it".  These are reported
as errors and must be inspected manually.

NOTE:  if the environment variable USER is set, as it usually is,
then "From: $USER@hostname" is sent as part of each remote inquiry
in order to identify the source.  This is standard etiquette for
automated checkers.  If USER is not set, but LOGNAME is, then
$LOGNAME is used.

Limitations:
   url(...) links within embedded stylesheets are not recognized.
   FTP:, MAILTO:, and other link types are not validated.
   Files are checked recursively only if named *.htm*.
   Proper file permission (for web export) is not checked.

The common error of failing to put a trailing slash on a directory
specification results in a "453 Is A Directory" error message for a
local file or, typically, a "301 Moved Permanently" message for a
remote file.
____________________________________________________________

usage:   weblinks [options] file...

-R      follow file links recursively
        (http links are never followed recursively)

-t      trace files as visited

-s      report successes as well as problems

-v      report tracing and successes, if selected, more verbosely

-i      invert output (sort by referencing page, not by status)

-r root
        specify starting point for file names beginning with "/"
        (e.g. -r /cs/www).  This is needed if such references are
        to be followed or checked.  If a root is specified it
        affects all file specifications including those on the
        command line.

-h home
        specify starting point for file names beginning with "/~".

-p prefix[,prefix...]
        prune (don't check) files beginning with given prefix

-b prefix
        specify bounds for files scanned:  do not scan files
        that do not begin with prefix.  Default bounds are
        directory of last file name.  For example,
                weblinks /foo/bar /foo/baz
        implies "-b /foo/".

If the environment variable WEBLINKS_INIT is set, its whitespace-
separated words are prepended to the explicit command argument list.
____________________________________________________________

Examples (all assuming a web area rooted at /cs/www)

        To check one new page:
        weblinks -r /cs/www  /icon/books.htm

        To check a personal hierarchy, with tracing:
        setenv WEBLINKS_INIT "-r /cs/www -h /cs/www/people"
        weblinks -R -t /~gmt/

        To check with pruning:
        weblinks -R -t -r /cs/www -p /icon/library /icon/index.htm

Source code | Program Library Page | Icon Home Page