Icon in the Humanities

The following article appeared in TEXT Technolgy, Vol. 4, No.1, Spring 1994, pp. 13-19.

Minor changes have been made to bring information up to date as of February 27, 1996.

Icon in the Humanities

Ralph E. Griswold

Department of Computer Science, The University of Arizona

In Susan Hockey's recent paper in TEXT Technology, "SNOBOL in the Humanities" [1], she stated "Other programming languages are not good for humanities computing " and went on to say that she thinks SNOBOL (SNOBOL4, to be precise [2]) is the ideal tool for such work. In her article she compares SNOBOL4 programs to those in some other programming languages and provides examples in which SNOBOL4 excels. The main points she makes are that:

1. SNOBOL4 has many powerful features that are useful in problems that arise in the humanities,

2. SNOBOL4 programs are short and easy to write, and

3. SNOBOL4 is easy for humanities students to learn.

As the principal architect of the SNOBOL languages, I confess to having a warm feeling about Hockey's views. But I also feel, as I'm sure other readers must, that her position is too strong.

You can argue about the merits of different programming languages from a variety of positions. I can imagine, although I certainly would not undertake it, an argument that C is better than SNOBOL4 (or any other language) for humanities computing (or anything else). And there are other programming languages for which better arguments can be made.

In this article I want to make an argument for Icon [3], which is, in some ways, a successor to SNOBOL4. I will contend that Icon is at least as good as SNOBOL4 for humanities computing and, more generally, for text processing.

A little history is in order. The first SNOBOL language was developed in 1962. A succession of improvements led to SNOBOL4 in 1968. Few changes have been made to SNOBOL4 since then. Further language development led to Icon in 1978. Icon has evolved through a series of versions, the latest of which was released in 1993.

The design of Icon was motivated by the desire to recast some features of SNOBOL4 in more general ways and to provide facilities that SNOBOL4 lacks. Icon was intended to be good for the same kinds of applications as SNOBOL4. Icon has many of the features of SNOBOL4, but in some respects it is quite different from SNOBOL4.

The important similarities between SNOBOL4 and Icon are:

1. Both have extensive facilities for processing strings (text) and structured data.

2. Both treat strings as "first-class" data valves, not arrays of characters.

3. Both support tables that, unlike arrays, can be subscripted by strings and other kinds of values.

4. Both use the concept of success or failure of computations to control program flow.

5. Neither has type or storage declarations.

6. Both have automatic storage management in which space for data created during program execution is provided automatically, while garbage collection recovers unused space for reuse.

The most obvious differences between SNOBOL4 and Icon are in appearance and structure. A SNOBOL4 program consists of a sequence of statements that are executed one after another with program flow being controlled by gotos. Icon, on the other hand, is an expression-based language that resembles Pascal and C. Icon supports a variety of control structures and classifies as a structured programming language.

The differences in structure are important. Individual SNOBOL4 statements usually are simple, but in order to construct even the simplest loop in SNOBOL4, it's necessary to label a statement and provide a goto to it. Icon, on the other hand, supports conventional control structures like if-then-else and while-do, which correspond to the logic of common programming tasks.

A short example, based on one in Hockey's paper, illustrates the difference between SNOBOL4 and Icon in this regard -- and also the similarities between the two languages. The problem is a simple one: Process a text file, replacing all occurrences of uppercase characters by lowercase ones, while leaving all other characters unchanged.

In SNOBOL4, a program to do this requires only one statement, although it's too long to fit on a single line:

READ	OUTPUT = REPLACE(INPUT, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
+	'abcdefghijklmnopqrstuvwxyz')			:S(READ)
END

Reference to INPUT causes a line to be read. REPLACE performs a one-to-one mapping as specified by its second and third arguments. Assignment to OUTPUT causes the result to be written out. The goto at the end of the statement transfers control back to the beginning of the statement, provided INPUT succeeded in reading a line. When the end-of-file is reached, the statement fails, and the program terminates by flowing into END.

Here's the corresponding Icon program:

procedure main()
   while write(map(read(), &ucase, &lcase))
end

The first line begins the program. read and write are functions that read and write a line, respectively. map works like SNOBOL4's REPLACE. The keywords &ucase and &lcase contain the upper- and lowercase characters, saving a little keyboarding. The while loop repeats the computation until read, like SNOBOL4's INPUT, fails when there are no more lines.

Both programs are short. The syntaxes differ, and some things are cast in different ways, but the underlying semantics are very similar. It's arguable as to which language is better for this problem. Your view probably reflects the kinds of programming languages with which you are familiar. With respect to Hockey's arguments about ease of programming and brevity, you might find it instructive to write a corresponding program in BASIC, Pascal, or C.

I could go on with larger examples, detailed comparisons, and argue why Icon's built-in control structures are superior to SNOBOL4's do-it-yourself goto-construction kit. But there are more important differences between the two programming languages. One is in high-level string analysis.

SNOBOL4 has an incredibly powerful pattern-matching facility -- no language before or after it, including Icon, rivals it. SNOBOL4's pattern-matching facility is large and complex; over 50 pages in the language reference manual are devoted to it. I won't attempt to do more than scratch the surface here, commenting only on the essential aspects of pattern matching.

In SNOBOL4, a pattern characterizes the properties of a set of strings. In pattern matching, a pattern matches any string that it characterizes. For example, LEN(10) is a pattern that matches any string that is 10 character long, while ARB is a pattern that matches any string whatsoever. Patterns are data values that can be combined in various ways to form other patterns. For example, P1 | P2 is a pattern that matches anything that P1 or P2 matches. Pattern matching is done in a statement in which a pattern is applied to a string. Like INPUT, pattern matching may succeed or fail -- that is, a pattern may or may not match a specific string.

The way pattern matching is used is illustrated by the following program that breaks up lines of input into "words" (strings of consecutive letters) and writes out each word:

	LETTERS := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
	WORDPAT = BREAK(LETTERS) SPAN(LETTERS) . OUTPUT
READ	LINE = INPUT	:F(END)
NEXT	LINE ? WORDPAT = ''	:F(READ)S(NEXT)
END

WORDPAT is a pattern that first looks for a letter with BREAK(LETTERS). If there is a letter, SPAN(LETTERS) matches a substring of consecutive letters. OUTPUT is attached to this component of the pattern so that the result is written. The processing loop is similar to the one for mapping characters. A line is read in and assigned to LINE. If there is no more input, the failure goto transfers control to the end of the program. In the statement labeled NEXT, the pattern WORDPAT is applied to LINE. If it is successful, a word is written and everything matched is deleted from LINE by assigning an empty string to it. If the match succeeds, the statement succeeds and control is returned to the beginning of the statement. Otherwise, control is transferred to the statement labeled READ to read another line. An important point, although this simple example does not require it, is that WORDPAT is a data value constructed before the loop. In a more complicated program, the same pattern might be used in several statements.

Icon takes a significantly different approach to high-level string analysis in a facility called string scanning. Icon's approach was motivated by a fundamental weaknesses in SNOBOL4's pattern matching. Despite its power, there is no way to extend its built-in repertoire and there is no way to put even a simple loop inside a pattern.

In Icon's string scanning, matching functions are on a par with all other functions. They can be augmented by programmer-defined functions and they can be used in combination with control structures. The Icon version of the program above is

procedure main()
   while line := read() do
      line ? {
         while tab(upto(&letters)) do
            write(tab(many(&letters)))
            }
end

The expression line ? { ... } performs string scanning on line. tab(upto(&letters)) does what BREAK(LETTERS) does in SNOBOL4, while tab(many(&letters)) does what SPAN(LETTERS) does. Icon uses two functions for each so that the location of a character and matching to it can be decomposed for situations in which this is useful.

Two points are important in this comparison: (1) String analysis in Icon is integrated with other kinds of computation, while pattern matching is separate from the other facilities of SNOBOL4, and (2) string scanning is more flexible than pattern matching but is done at a lower level.

Which is better? There are different opinions, but it's safe to say that pattern matching stands out as a powerful facility that distinguishes SNOBOL4 from other programming languages and is largely responsible for SNOBOL4's dedicated following.

In addition to the differences I've mentioned so far, it's worth noting that both programming languages have powerful features that the other lacks. For example, SNOBOL4 has the ability to turn any string into SNOBOL4 code during program execution. Icon lacks this capability, but is has an expression-level coroutine facility and extensive graphics capabilities.

But such features are not the arguments Hockey makes for using SNOBOL4 in humanities computing. Her third point -- that SNOBOL4 is easy for humanities students to learn -- deserves discussion.

She argues that SNOBOL4 requires no mathematical knowledge to use and that SNOBOL4's syntactic simplicity allows students to write simple programs without having to learn a lot of technical material first. Icon requires no more mathematical knowledge than SNOBOL4 does. Icon does, however, require mastering some aspects of program structure, including control structures and the nesting of expressions. No doubt this is a burden for students with no computer experience. But the situation in this regard is much different from what it was 25 or even 10 years ago. Most young persons now are exposed to computer-related technology at an early age and many learn to program in BASIC or Pascal in high school, regardless of their intellectual orientation. A student who has had even a casual exposure to one of these languages has little trouble with the structure of Icon programs. They are more likely to be puzzled by Icon's lack of type declarations. On the other hand, such students often find SNOBOL4 totally alien. It maybe interesting to them, but it's hardly easier for them to learn than Icon.

I may not have convinced you that Icon is obviously better than SNOBOL4 for computing in the humanities, but I hope you'll agree I've refuted Hockey's statement that languages other than SNOBOL4 are not good for humanities computing.

On the other hand, the choice of a programming languages usually is not made entirely (and sometimes hardly at all) on its intrinsic merits. For example, there is a considerable advantage in continuing to use a language you already know, even if there's an obviously better alternative. Many SNOBOL4 programmers have continued to use it at least in part for this reason. Others are enthusiastic converts to Icon.

There are entirely different issues to consider. For all its merits, SNOBOL4 is an old language. Most books about it, including the language reference manual, are out of print. There are public-domain implementations of SNOBOL4 for only a few platforms, and there are no implementations at all for many computers. Documentation on Icon, on the other hand, is readily available and there are public-domain implementations for the Amiga, the Atari ST, CMS, the Macintosh, MS-DOS, MVS, OS/2, UNIX, and VMS.

If you're interested in learning more about Icon, contact

Icon Project
Department of Computer Science
The University of Arizona
Tucson, AZ 85721

(520)-621-6613 (voice)
(520)-621-4246 (fax)

icon-project@cs.arizona.edu

Public-domain implementations of Icon and a large collection of Icon programs are available via anonymous FTP to ftp.cs.arizona.edu. After connecting, cd /icon and get READ.ME for navigation instructions.

References

1. Hockey, Susan. "SNOBOL in the Humanities", TEXT Technology, March, 1993, pp. 7-15.

2. Griswold, Ralph E., Poage, James F., and Polonsky, Ivan P.

The SNOBOL4 Programming Language, second edition, Prentice-Hall, Englewood Cliffs, New Jersey, 1971.

3. Griswold, Ralph E. and Griswold, Madge T. The Icon Programming Language, second edition. Prentice-Hall, Englewood Cliffs, New Jersey, 1990.

Icon home page