The Icon Programming Language

Ralph E. Griswold and Gregg M. Townsend

Department of Computer Science, The University of Arizona


key words: programming languages, string analysis, text processing, data structures, graphics, visual interfaces

Abstract

The Icon programming language, a descendant of the SNOBOL languages, has extensive facilities for string analysis, the representation of complex relationships among data, and graphics programming. This paper describes the main characteristics of Icon in an informal manner and by example.

Introduction

The humanities, like any field in which programming plays an important role, benefits from programming languages with features that are well suited for the subject matter. Computing in the humanities requires many capabilities, including: Research benefits from ease in programming so that ideas can be tested quickly and with minimum effort. A high-level language therefore offers a considerable advantage over a lower-level one.

This paper describes the Icon programming language, which is a high-level language that offers the facilities needed for computing in the humanities.

Background

Icon has its origins is the SNOBOL programming languages. The first SNOBOL language [1], which supported string analysis with high-level pattern matching, quickly captured the attention of researchers in the humanities. String analysis, although lying at the core of many programming problems in the humanities, is not enough. Largely in response to requests by researchers in the humanities, SNOBOL evolved through a series of versions to SNOBOL4 [2], which still is in use.

SNOBOL4 greatly extended the concept of pattern matching and added support for structures: arrays, tables with associative lookup, and programmer-defined data types.

SNOBOL4 is one of the most powerful programming languages ever designed. But its idiosyncratic syntax and lack of conventional control structures discourage many programmers, especially ones who have experience with more conventional languages. More fundamentally, the lack of integration of SNOBOL4's pattern-matching facilities with other language features makes it awkward to use for some programming tasks [3].

The Icon programming language [4] was developed to integrate and generalize SNOBOL4's facilities, as well as to provide a familiar syntax resembling Pascal and C. More recently, Icon has added graphics facilities to support the visual presentation of data and interactive user interfaces [5].

An Overview of Icon

Icon is a high-level programming language; its features are cast close to the problem domain instead of the hardware of the computers on which it runs. Icon favors ease of programming and dispenses with much of the tedious detail of lower-level languages.
Icon's most important capabilities are: The rest of this paper describes the major features of Icon in an informal manner. The examples used are designed to illustrate simple applications in the humanities, and the ease with which these can be carried out in Icon.

Expression Evaluation

Icon is an expression-based language. Computations are performed by evaluating expressions, much in the manner of Pascal and C. There are, however, important differences between expression evaluation in Icon and in most other programming languages.

Most programming languages base decision making on a logical model -- whether the result of a conditional computation, such as comparing the magnitude of two numbers, is true or false. Icon takes a different approach in which a conditional computation may succeed or fail. This offers two advantages over the logical model: (1) success and failure are natural concepts to persons with or without a technical background, and (2) a conditional expression in Icon can produce a useful value (other than true or false) if it succeeds.

Consider, for example, reading the lines of a text file. An attempt to read succeeds (and produces a string) as long as there are lines remaining in the file. When, however, the end of the file is reached, an attempt to read fails (and produces nothing). Here's how it's cast in Icon:
while line := read() do
   process(line)
where process() is a procedure that performs some computation on line. The loop continues, processing one line after another, until the end of the file is reached, in which case read() fails and the loop is terminated.

Just as success and failure are familiar in everyday life, so is the possibility of alternatives. For example, the word "student" occurs twice in the sentence "My student is lazier than even the average student.". In text processing, this might take the form of locating the substring "student" in the string
"My student is lazier than even the average student."
Most programming languages can't deal with such situations directly and can produce only one location in a given string, even though this kind of situation arises commonly in text analysis and other contexts. Icon allows a computation to have more than one value. An expression that can have more than one value is called a generator. A generator produces successive values as requested. The number of alternatives a generator produces depends on how many alternatives there are and how many of them are requested.

For example, find(s1, s2) generates the locations at which s1 occurs as a substring of s2. Continuing the example above, suppose the strings are
s1 := "student"
s2 := "My student is lazier than even the average student."
then find(s1, s2) has the alternative locations 4 and 44 (Icon numbers from 1 as human beings do, not from 0 as computers do).

In a situation where only one value is needed, the first, 4, is produced, so that
first := find(s1, s2)
assigns 4 to first.

All the alternatives of a generator can be obtained, as illustrated by the following expression, which writes the locations at which s1 occurs in s2:
every i := find(s1, s2) do
   write(i)
The every loop requests an alternative from find(). If find() has one, it suspends evaluation and returns the value. The value is assigned to i and then written, and every resumes find() for another alternative. This continues until find() has no more alternatives and fails. This loop can be written more compactly as
every write(find(s1, s2))
Generators are useful in many contexts. For example, i to j generates the integers from i to j, so that
every i := 1 to 10 do
   write(sqrt(i))
writes the square roots of the integers 1 through 10.

Alternatives also can be give explicitly, as in
"the" | "The" | "THE"
which generates three strings. Thus,
find("the" | "The" | "THE", text)
finds all the locations of the three strings in text.

String Scanning

Icon has a high-level facility for analyzing strings, called string scanning. String scanning is based on two observations about the nature of most string analysis:
1. It is typical for many analysis operations to be performed on the same string. Imagine parsing a sentence, for example. The parsing is likely to require many operations on the sentence to identify its components.

2. Many analysis operations occur at a particular place in a string, and the place typically advances as analysis continues. Again, think of parsing a sentence; parsing typically starts at the beginning of the sentence and progresses toward the end as components are identified. Of course, if an initial analysis proves to be incorrect later on, the analysis may go back to an earlier position and look for an alternative.
To simplify string analysis, string scanning maintains a subject on which analysis operations can be performed without explicitly mentioning the string being analyzed. String scanning also automatically maintains a position that serves as a focus of attention in the subject as the analysis proceeds.

A string scanning expression has the form
s ? expr
where s is the subject string and expr is a scanning expression that analyzes (scans) it. When a string-scanning expression is evaluated, the subject is set to s and the position is set to 1, the beginning of the subject.

There are two functions that change the position in the subject:
tab(i), which sets the position to i

move(i)
, which increments the position by i
Both of these functions produce the substring of the subject between the position prior to their evaluation and the position after their evaluation. Both of these functions fail and leave the position unchanged if the specified position is out of the range of the subject. This failure can be used to control loops, as in
text ? {
   while write(move(1)) do
      move(1)
   }
which writes the odd-numbered characters of text, one per line.

String analysis functions, which produce positions, are necessary for much string scanning. Two of the most useful analysis functions are:
upto(s), which returns the position at which a character of s occurs in the subject

many(s), which returns the position after a sequence of characters of s
These functions examine the part of the subject to the right of the current position. For example, upto("aeiou") produces the position of the first "vowel" to the right of the current position. Analysis functions fail if what's being looked for doesn't exist.

Analysis functions produce a position; they do not change the position -- tab() is used for this. For example, the "words" in a string can be written as follows:
text ? {
   while tab(upto(&letters)) do
      write(tab(many(&letters)))
   }
The keyword &letters contains the upper- and lowercase letters. The expression upto(&letters) produces the position of the first letter in the subject and provides the argument for tab(), which moves the position to that letter. Next tab(many(&letters)) moves the position to the end of the sequence of letters and produces that substring of the subject, which is written. (The definition of a "word" used here is overly simple, but it illustrates the general method of string scanning. It is easy to formulate more sophisticated definitions of a word that include, for example, compound words, possessives, and contractions. There is, of course, no algorithmic definition of "word" that fits all situations.)

Procedures

An Icon program consists of procedures that contain the program's computational components. Procedures can be generators, as in this example, which generates the words in a string:
procedure genword(text)
   text ? {
      while tab(upto(&letters)) do {
         word := tab(many(&letters))
         suspend word
         }
      }
   fail
end
The argument of genword() is text, which is the subject of string scanning. The expression suspend word suspends evaluation in the procedure and returns a word. If the procedure is resumed, the procedure continues evaluation, suspending with another word, if there is one. Otherwise, the procedure fails. Thus,
while line := read() do
   every write(genword(line))
writes all the words in the input file.

Structures

Structures allow collections of values to be organized and accessed in different ways. Icon provides data structures not commonly found in other programming languages.
Lists
A list consists of a sequence of values. The values can be given explicitly, as in
parties := ["Socialist", "Democratic", "Republican"]
which assigns a list of three strings to parties.

The values in a list can be accessed by their position, as in
write(parties[2])

which writes Democratic.

The operator ! generates the values in a list in order, so that
every write(!parties)
writes Socialist, Democratic, and Republican.

Lists also can be used as stacks and queues. For example,
put(parties, "Libertarian")
adds "Libertarian" to the right end of parties. Similarly,
put(parties, "Communist")
adds "Communist" to the left end of parties.

The function get(parties) removes and produces the left-most value in parties, while pull(parties) removes and produces the right-most value. Both get() and pull() fail if the list is empty.

These capabilities allow a list to be built without knowing its final size in advance. For example, to build a list of words in the input file using genword(), all that is needed is
word_list := []                           # empty list
while line := read() do
   every put(word_list, genword(line))    # append words
This works if there is only one word or if there are thousands. The only limitation is the amount of available memory.

The function sort() sorts the value in a list. For strings, sorting is lexical ("alphabetical"). Consequently, all that's needed to write the words in alphabetical order is
every write(!sort(word_list))
Sets
A set is a collection of distinct values. Unlike for a list, there is no concept of order for the values of a set, and a set cannot contain duplicate values. When a value is inserted into a set, if the value already is in the set, the set is not changed.

This property of sets is useful for eliminating duplicates. For example,
word_set := set()                        # empty set
while line := read() do
   every insert(word_set, genword(line)) # insert words
creates a set, word_set, of all distinct words in the input file.

A set can be sorted into a list, so that
every write(!sort(word_set))
writes all the distinct words in alphabetical order.

The function member(word_set, word) succeeds if word is in word_set but fails otherwise. It can be used to write the distinct words in order of first occurrence:
word_set := set()
while line := read() do
   every word := genword(line) do
      if not member(word_set, word) then {
      write(word)
      insert(word_set, word)
      }
The expression not inverts success and failure, so that if a word is not already in word_set, it is written (first occurrence) and then inserted into word_set so that it will not be written subsequently.

The standard operations on finite sets are provided. For example,
all_words := word_set1 ++ word_set2
assigns the union of word_set1 and word_set2 to all_words -- all the words that are in either word_set1 or word_set2. Similarly,
common_words := word_set1 ** word_set2
assigns the intersection of word_set1 and word_set2 to common_words -- only the words that are in both word_set1 and word_set2.
Tables
A table consists of pairs of values. Each pair consists of a key and a value that is associated with that key. Keys are like the members of a set: They are all distinct.

Since the keys are distinct but have associated values (which need not be distinct), tables are useful for tasks like counting word occurrences. For example, to create a table of word counts, all that is needed is
word_count := table(0)                     # empty table
while line := read() do
   every word := genword(line) do
      word_count[word] := word_count[word] + 1
The argument 0 used in creating a new, empty table, provides an initial value for newly added keys. The expression word_count[word] subscripts word_count with the key word, whose value is a string. The first time the string is used as a key, it has an initial value of 0 (zero occurrences so far), and 1 is added to it to make its count 1. If the same string is used as a key again, its count is incremented. The table grows automatically as it is subscripted with new keys.

Tables can be sorted to produce lists, but since tables consist of pairs of values, the resulting list contains both keys and their corresponding values. There are several options for sorting tables. Two of the most useful are sort(word_count, 3), which produces a list of alternating keys and values, sorted by key, and sort(word_count, 4), which produces a similar list but sorted by value.

A complete program to count words and list the results ordered alphabetically is
procedure main()

   word_count := table(0)                 # empty table

   while line := read() do
      every word := genword(line) do
         word_count[word] := word_count[word] + 1

   result := sort(word_count, 3)

   while write(left(get(result), 15), right(get(result), 8))

end

procedure genword(text)
   text ? {
      while tab(upto(&letters)) do {
         word := tab(many(&letters))
         suspend word
         }
      }
   fail
end
The functions left() and right() position the words and their counts in field of fixed length. The writing loop fails when all values have been removed from result.

Here is the output from this program for a portion of Edgar Allan Poe's poem, "The Bells".
Bells                 1
Future                1
Of                    3
On                    1
To                    2
and                   2
bells                10
chiming               1
how                   1
impells               1
it                    1
of                    1
rapture               1
rhyming               1
ringing               1
swinging              1
tells                 1
that                  1
the                   9
It is easy to modify the program to produce a list sorted by count or in reverse sorting order.

Graphics

Until fairly recently, computer graphics were available only to a few persons who had access to expensive, specialized equipment. The advent of personal computers has changed all that. Personal computer operating systems now provide visual interfaces and tools for the direct manipulation of objects portrayed on the screen. Almost all commercial computer applications now provide visual interfaces and many use graphics to display information in attractive and easily understood ways.

Writing such applications, however, usually still is a difficult and time consuming task that requires considerable specialized knowledge. Programming languages, most of which were designed and implemented before graphics facilities were widely available, have lagged. Most programming languages that support graphics do so through add-on libraries, and they do not integrate graphics with other programming language facilities.

Icon addressed this problem in its most recent version, adding high-level graphics facilities and integrating them with the rest of the language.
Graphics Facilities
Icon's graphic facilities include the following features: In addition, there is an application that supports interactive, intuitive construction of visual interfaces for Icon programs.

Graphics programming is too extensive a subject to describe in detail in this paper. The following example illustrates its use. The procedure histogram(tbl) displays a histogram of word counts given in the table tbl. Figure 1 shows typical output.
procedure histogram(tbl)

   # sort the table by key

   words := sort(tbl, 3)

   # set the display parameters

   line_height := WAttrib("leading")
   window_height := ((*words / 2) + 1) * line_height
   hist_offset := 100
   scale := 10

   # open a window

   window := WOpen("label=Histogram", "size=300," || window_height)

   # display the information

   y := 0
   while WWrite(window, " ", get(words)) do {
      FillRectangle(window, hist_offset, y + 2,
         scale * get(words), line_height - 2)
      y := y + line_height
      }

  # wait for user to dismiss the window

   until WQuit(window)

   WClose(window)

   return

end
[screen snapshot]
Figure 1: A Histogram
An Example Application
An example of an Icon application with a visual interface follows. This application expands on the simple word-manipulation examples given earlier and provides the user with a visual display of information about words in a variety of ways. The interface for this application is shown in Figure 2. The display shows an analysis of an Icon technical report.

[screen snapshot]

Figure 2: The Application Interface

A list of words appears at the left with a histogram in decreasing numerical order at the right. The display can be scrolled to view long lists.

The File menu provides options for opening text files for analysis, saving the analysis as a text file or image, and so on. The Definition menu provides for different definitions of words, a list of words to be excluded, and so on. The Display menu provides various options for the presentation of the results of word analysis: counts instead of a histogram, the font for text, scaling of the histogram, colors for highlighting information of interest, and so on. The Sort menu provides various sorting operations.

Dialogs are used for customizing the analysis, as shown in Figures 3 and 4.

[screen snapshot]

Figure 3: The Sorting Option Dialog

[screen snapshot]

Figure 4: The Word Definition Dialog

A customized definition can be given using a regular expression [6].

Conclusions

It might seem obvious that a programmer should use the programming language that is best suited for the task at hand. The actual situation is more complicated. No one language is best for all programming tasks. Even for a specific task, there is little agreement about the best language. Furthermore, programmers rarely are skilled in more than a few programming languages, and many programmers only use one, regardless of its appropriateness for a particular problem.

The reasons for this are well known. There are too many programming languages for one person to master them all. Developing proficiency even in one programming language requires a large amount of time and effort. Proficiency has its value even in an inappropriate language. Proficiency also becomes a vested interest and it's natural to proclaim the virtues of a language and see it as the best language to use even when it clearly isn't.

Since learning a new programming language requires such a large investment, there needs to be a good reason to do it. There are many possible reasons, including intellectual curiosity. The best reason, perhaps, is if the programming language will produce significantly better results than another. What is better depends on context. It may be faster execution speed, more capabilities, reduced programming effort, quicker results, programs that are easier to maintain and modify, and so forth.

One aspect of a language that often is underrated is ease of programming. It is not is just a matter of less effort and quicker results. It affects what is undertaken. Graphics provides one of the best examples. If graphics are tedious and tricky to use, there is a tendency not to use them and to design an application without them. Consider a word analysis application. Without graphics and a visual interface, the application might be designed to run from the command line with various options. However, the number of features and the way they interact is very limited in such a design. With graphics, many features can be cast in understandable and easily understood ways. Such an approach often suggests useful features that might never be imagined for an application designed to run from the command line.

For situations in which ease of programming, quick results, and graphics are important, Icon is a good choice. For computing in the humanities, its string-analysis facilities, ability to represent complex relationships among data, and graphics capabilities make Icon particularly appropriate.

Status and Availability

Icon has evolved through a series of versions. The current version is 9, which is available for MS-DOS, Macintosh/MPW, Windows, Windows NT, VAX/VMS, and many UNIX platforms. There also are earlier versions for several other platforms. Icon's graphics facilities are supported by the implementations for Windows, Windows NT, VAX/VMS, and UNIX. Implementations that support graphics are under way for other platforms. All implementations of Icon are in the public domain.

There is a large library of Icon applications and procedure packages that can be used in Icon programs. The library also provides many examples of programming in Icon.

Documentation on Icon is extensive. In addition to books on the language [4], its implementation [7], and graphics programming [8], there are two newsletters [9, 10] and many technical reports.

Documentation, implementations, and other material are available via the Internet. On the World Wide Web, Icon is located at
http://www.cs.arizona.edu/icon/
The address for anonymous FTP is
ftp.cs.arizona.edu
From there, cd /icon.

Information also is available from
Icon Project
Department of Computer Science
The University of Arizona
P.O. Box 210077
Tucson, AZ85721-0077
U.S.A.

(520) 621-6613 (voice)
(520) 621-4246 (fax)

icon-project@cs.arizona.edu

Acknowledgments

Many persons have contributed to the development of Icon. Clint Jeffery was instrumental in the addition of the graphics facilities. Mary Cameron wrote the first version of the visual interface builder.

References

1. "SNOBOL, A String Manipulation Language", David J. Farber, Ralph E. Griswold, and Ivan P. Polonsky, Journal of the ACM, Vol. 11, No. 1, January 1964, pp. 21-30.

2. The SNOBOL4 Programming Language, second edition, Ralph E. Griswold, James F. Poage, and Ivan P. Polonsky, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1971.

3. "An Alternative to the Use of Patterns in String Processing", Ralph E. Griswold and David R. Hanson, ACM Transactions on Programming Languages and Systems, Vol. 2, No. 2, 1980, pp. 153-172.

4. The Icon Programming Language, second edition, Ralph E. Griswold and Madge T. Griswold, Prentice Hall, Englewood Cliffs, New Jersey, 1990.

5. Graphics Facilities for the Icon Programming Language; Version 9.1, Gregg M. Townsend, Ralph E. Griswold, and Clinton L. Jeffery, Technical Report IPD268, Department of Computer Science, The University of Arizona, 1995.

6. The Design and Analysis of Computer Algorithms, Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman, Addison-Wesley Publishing Company, Reading, Massachusetts, 1974.

7. The Implementation of the Icon Programming Language, Ralph E. Griswold and Madge T. Griswold, Princeton University Press, Princeton, New Jersey, 1986.

8. Graphics Programming in Icon, Ralph E. Griswold, Clinton L. Jeffery, and Gregg M. Townsend, forthcoming.

9. The Icon Newsletter, Ralph E. Griswold, Madge T. Griswold, and Gregg M. Townsend, editors, published three times a year.

10. The Icon Analyst, Ralph E. Griswold, Madge T. Griswold, and Gregg M. Townsend, editors, published six times a year.


Icon home page