The following article appeared in TEXT Technolgy, Vol. 4, No.1, Spring
1994, pp. 13-19.
Minor changes have been made to bring information up to date as of February
27, 1996.
Icon in the Humanities
Ralph E. Griswold
Department of Computer Science, The University of Arizona
In Susan Hockey's recent paper in TEXT Technology, "SNOBOL in the Humanities"
[1], she stated "Other programming languages are not good for humanities
computing " and went on to say that she thinks SNOBOL (SNOBOL4, to
be precise [2]) is the ideal tool for such work. In her article she compares
SNOBOL4 programs to those in some other programming languages and provides
examples in which SNOBOL4 excels. The main points she makes are that:
1. SNOBOL4 has many powerful features that are useful in problems that
arise in the humanities,
2. SNOBOL4 programs are short and easy to write, and
3. SNOBOL4 is easy for humanities students to learn.
As the principal architect of the SNOBOL languages, I confess to having
a warm feeling about Hockey's views. But I also feel, as I'm sure other
readers must, that her position is too strong.
You can argue about the merits of different programming languages from a
variety of positions. I can imagine, although I certainly would not undertake
it, an argument that C is better than SNOBOL4 (or any other language) for
humanities computing (or anything else). And there are other programming
languages for which better arguments can be made.
In this article I want to make an argument for Icon [3], which is, in some
ways, a successor to SNOBOL4. I will contend that Icon is at least as good
as SNOBOL4 for humanities computing and, more generally, for text processing.
A little history is in order. The first SNOBOL language was developed in
1962. A succession of improvements led to SNOBOL4 in 1968. Few changes have
been made to SNOBOL4 since then. Further language development led to Icon
in 1978. Icon has evolved through a series of versions, the latest of which
was released in 1993.
The design of Icon was motivated by the desire to recast some features of
SNOBOL4 in more general ways and to provide facilities that SNOBOL4 lacks.
Icon was intended to be good for the same kinds of applications as SNOBOL4.
Icon has many of the features of SNOBOL4, but in some respects it is quite
different from SNOBOL4.
The important similarities between SNOBOL4 and Icon are:
1. Both have extensive facilities for processing strings (text) and
structured data.
2. Both treat strings as "first-class" data valves, not arrays
of characters.
3. Both support tables that, unlike arrays, can be subscripted by strings
and other kinds of values.
4. Both use the concept of success or failure of computations to control
program flow.
5. Neither has type or storage declarations.
6. Both have automatic storage management in which space for data created
during program execution is provided automatically, while garbage collection
recovers unused space for reuse.
The most obvious differences between SNOBOL4 and Icon are in appearance
and structure. A SNOBOL4 program consists of a sequence of statements that
are executed one after another with program flow being controlled by gotos.
Icon, on the other hand, is an expression-based language that resembles
Pascal and C. Icon supports a variety of control structures and classifies
as a structured programming language.
The differences in structure are important. Individual SNOBOL4 statements
usually are simple, but in order to construct even the simplest loop in
SNOBOL4, it's necessary to label a statement and provide a goto to it. Icon,
on the other hand, supports conventional control structures like if-then-else
and while-do, which correspond to the logic of common programming tasks.
A short example, based on one in Hockey's paper, illustrates the difference
between SNOBOL4 and Icon in this regard -- and also the similarities between
the two languages. The problem is a simple one: Process a text file, replacing
all occurrences of uppercase characters by lowercase ones, while leaving
all other characters unchanged.
In SNOBOL4, a program to do this requires only one statement, although it's
too long to fit on a single line:
READ OUTPUT = REPLACE(INPUT, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
+ 'abcdefghijklmnopqrstuvwxyz') :S(READ)
END
Reference to INPUT
causes a line to be read. REPLACE
performs a one-to-one mapping as specified by its second and third arguments.
Assignment to OUTPUT
causes the result to be written out. The
goto at the end of the statement transfers control back to the beginning
of the statement, provided INPUT
succeeded in reading a line.
When the end-of-file is reached, the statement fails, and the program terminates
by flowing into END
.
Here's the corresponding Icon program:
procedure main()
while write(map(read(), &ucase, &lcase))
end
The first line begins the program. read
and write
are functions that read and write a line, respectively. map
works like SNOBOL4's REPLACE
. The keywords &ucase
and &lcase
contain the upper- and lowercase characters,
saving a little keyboarding. The while
loop repeats the computation
until read
, like SNOBOL4's INPUT
, fails when there
are no more lines.
Both programs are short. The syntaxes differ, and some things are cast in
different ways, but the underlying semantics are very similar. It's arguable
as to which language is better for this problem. Your view probably reflects
the kinds of programming languages with which you are familiar. With respect
to Hockey's arguments about ease of programming and brevity, you might find
it instructive to write a corresponding program in BASIC, Pascal, or C.
I could go on with larger examples, detailed comparisons, and argue why
Icon's built-in control structures are superior to SNOBOL4's do-it-yourself
goto-construction kit. But there are more important differences between
the two programming languages. One is in high-level string analysis.
SNOBOL4 has an incredibly powerful pattern-matching facility -- no language
before or after it, including Icon, rivals it. SNOBOL4's pattern-matching
facility is large and complex; over 50 pages in the language reference manual
are devoted to it. I won't attempt to do more than scratch the surface here,
commenting only on the essential aspects of pattern matching.
In SNOBOL4, a pattern characterizes the properties of a set of strings.
In pattern matching, a pattern matches any string that it characterizes.
For example, LEN(10)
is a pattern that matches any string that
is 10 character long, while ARB
is a pattern that matches any
string whatsoever. Patterns are data values that can be combined in various
ways to form other patterns. For example, P1 | P2
is a pattern
that matches anything that P1
or P2
matches. Pattern
matching is done in a statement in which a pattern is applied to a string.
Like INPUT
, pattern matching may succeed or fail -- that is,
a pattern may or may not match a specific string.
The way pattern matching is used is illustrated by the following program
that breaks up lines of input into "words" (strings of consecutive
letters) and writes out each word:
LETTERS := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
WORDPAT = BREAK(LETTERS) SPAN(LETTERS) . OUTPUT
READ LINE = INPUT :F(END)
NEXT LINE ? WORDPAT = '' :F(READ)S(NEXT)
END
WORDPAT
is a pattern that first looks for a letter with BREAK(LETTERS)
.
If there is a letter, SPAN(LETTERS)
matches a substring of
consecutive letters. OUTPUT
is attached to this component of
the pattern so that the result is written. The processing loop is similar
to the one for mapping characters. A line is read in and assigned to LINE
.
If there is no more input, the failure goto transfers control to the end
of the program. In the statement labeled NEXT
, the pattern
WORDPAT
is applied to LINE
. If it is successful,
a word is written and everything matched is deleted from LINE
by assigning an empty string to it. If the match succeeds, the statement
succeeds and control is returned to the beginning of the statement. Otherwise,
control is transferred to the statement labeled READ
to read
another line. An important point, although this simple example does not
require it, is that WORDPAT
is a data value constructed before
the loop. In a more complicated program, the same pattern might be used
in several statements.
Icon takes a significantly different approach to high-level string analysis
in a facility called string scanning. Icon's approach was motivated by a
fundamental weaknesses in SNOBOL4's pattern matching. Despite its power,
there is no way to extend its built-in repertoire and there is no way to
put even a simple loop inside a pattern.
In Icon's string scanning, matching functions are on a par with all other
functions. They can be augmented by programmer-defined functions and they
can be used in combination with control structures. The Icon version of
the program above is
procedure main()
while line := read() do
line ? {
while tab(upto(&letters)) do
write(tab(many(&letters)))
}
end
The expression line ? { ... }
performs string scanning on line
.
tab(upto(&letters))
does what BREAK(LETTERS)
does in SNOBOL4, while tab(many(&letters))
does what SPAN(LETTERS)
does. Icon uses two functions for each so that the location of a character
and matching to it can be decomposed for situations in which this is useful.
Two points are important in this comparison: (1) String analysis in Icon
is integrated with other kinds of computation, while pattern matching is
separate from the other facilities of SNOBOL4, and (2) string scanning is
more flexible than pattern matching but is done at a lower level.
Which is better? There are different opinions, but it's safe to say that
pattern matching stands out as a powerful facility that distinguishes SNOBOL4
from other programming languages and is largely responsible for SNOBOL4's
dedicated following.
In addition to the differences I've mentioned so far, it's worth noting
that both programming languages have powerful features that the other lacks.
For example, SNOBOL4 has the ability to turn any string into SNOBOL4 code
during program execution. Icon lacks this capability, but is has an expression-level
coroutine facility and extensive graphics capabilities.
But such features are not the arguments Hockey makes for using SNOBOL4 in
humanities computing. Her third point -- that SNOBOL4 is easy for humanities
students to learn -- deserves discussion.
She argues that SNOBOL4 requires no mathematical knowledge to use and that
SNOBOL4's syntactic simplicity allows students to write simple programs
without having to learn a lot of technical material first. Icon requires
no more mathematical knowledge than SNOBOL4 does. Icon does, however, require
mastering some aspects of program structure, including control structures
and the nesting of expressions. No doubt this is a burden for students with
no computer experience. But the situation in this regard is much different
from what it was 25 or even 10 years ago. Most young persons now are exposed
to computer-related technology at an early age and many learn to program
in BASIC or Pascal in high school, regardless of their intellectual orientation.
A student who has had even a casual exposure to one of these languages has
little trouble with the structure of Icon programs. They are more likely
to be puzzled by Icon's lack of type declarations. On the other hand, such
students often find SNOBOL4 totally alien. It maybe interesting to them,
but it's hardly easier for them to learn than Icon.
I may not have convinced you that Icon is obviously better than SNOBOL4
for computing in the humanities, but I hope you'll agree I've refuted Hockey's
statement that languages other than SNOBOL4 are not good for humanities
computing.
On the other hand, the choice of a programming languages usually is not
made entirely (and sometimes hardly at all) on its intrinsic merits. For
example, there is a considerable advantage in continuing to use a language
you already know, even if there's an obviously better alternative. Many
SNOBOL4 programmers have continued to use it at least in part for this reason.
Others are enthusiastic converts to Icon.
There are entirely different issues to consider. For all its merits, SNOBOL4
is an old language. Most books about it, including the language reference
manual, are out of print. There are public-domain implementations of SNOBOL4
for only a few platforms, and there are no implementations at all for many
computers. Documentation on Icon, on the other hand, is readily available
and there are public-domain implementations for the Amiga, the Atari ST,
CMS, the Macintosh, MS-DOS, MVS, OS/2, UNIX, and VMS.
If you're interested in learning more about Icon, contact
Icon Project
Department of Computer Science
The University of Arizona
Tucson, AZ 85721
(520)-621-6613 (voice)
(520)-621-4246 (fax)
icon-project@cs.arizona.edu
Public-domain implementations of Icon and a large collection of Icon programs
are available via anonymous FTP to ftp.cs.arizona.edu
. After
connecting, cd /icon
and get READ.ME
for navigation
instructions.
References
1. Hockey, Susan. "SNOBOL in the Humanities",
TEXT
Technology, March, 1993, pp. 7-15.
2. Griswold, Ralph E., Poage, James F., and Polonsky, Ivan P.
The SNOBOL4 Programming Language, second edition, Prentice-Hall,
Englewood Cliffs, New Jersey, 1971.
3. Griswold, Ralph E. and Griswold, Madge T.
The Icon
Programming Language, second edition. Prentice-Hall, Englewood Cliffs,
New Jersey, 1990.
Icon home page