The Icon Programming Language
Ralph E. Griswold and Gregg M. Townsend
Department of Computer Science, The University of Arizona
key words: programming languages, string analysis, text processing,
data structures, graphics, visual interfaces
Abstract
The Icon programming language, a descendant of the SNOBOL languages, has
extensive facilities for string analysis, the representation of complex
relationships among data, and graphics programming. This paper describes
the main characteristics of Icon in an informal manner and by example.
Introduction
The humanities, like any field in which programming plays an important role,
benefits from programming languages with features that are well suited for
the subject matter. Computing in the humanities requires many capabilities,
including:
- analysis of text
- representation of structural relationships
- ways of presenting results in an easily understood manner
Research benefits from ease in programming so that ideas can be tested quickly
and with minimum effort. A high-level language therefore offers a considerable
advantage over a lower-level one.
This paper describes the Icon programming language, which is a high-level
language that offers the facilities needed for computing in the humanities.
Background
Icon has its origins is the SNOBOL programming languages. The first SNOBOL
language [1], which supported string analysis with high-level pattern matching,
quickly captured the attention of researchers in the humanities. String
analysis, although lying at the core of many programming problems in the
humanities, is not enough. Largely in response to requests by researchers
in the humanities, SNOBOL evolved through a series of versions to SNOBOL4
[2], which still is in use.
SNOBOL4 greatly extended the concept of pattern matching and added support
for structures: arrays, tables with associative lookup, and programmer-defined
data types.
SNOBOL4 is one of the most powerful programming languages ever designed.
But its idiosyncratic syntax and lack of conventional control structures
discourage many programmers, especially ones who have experience with more
conventional languages. More fundamentally, the lack of integration of SNOBOL4's
pattern-matching facilities with other language features makes it awkward
to use for some programming tasks [3].
The Icon programming language [4] was developed to integrate and generalize
SNOBOL4's facilities, as well as to provide a familiar syntax resembling
Pascal and C. More recently, Icon has added graphics facilities to support
the visual presentation of data and interactive user interfaces [5].
An Overview of Icon
Icon is a high-level programming language; its features are cast close to
the problem domain instead of the hardware of the computers on which it
runs. Icon favors ease of programming and dispenses with much of the tedious
detail of lower-level languages.
Icon's most important capabilities are:
- support for strings of characters of arbitrary length
- high-level facilities for analyzing strings
- a powerful expression-evaluation mechanism
- sophisticated data structures for managing complex relationships
among data
- high-level graphics facilities
- automatic storage management
The rest of this paper describes the major features of Icon in an informal
manner. The examples used are designed to illustrate simple applications
in the humanities, and the ease with which these can be carried out in Icon.
Expression Evaluation
Icon is an expression-based language. Computations are performed by evaluating
expressions, much in the manner of Pascal and C. There are, however, important
differences between expression evaluation in Icon and in most other programming
languages.
Most programming languages base decision making on a logical model -- whether
the result of a conditional computation, such as comparing the magnitude
of two numbers, is true or false. Icon takes a different approach in which
a conditional computation may succeed or fail. This offers two advantages
over the logical model: (1) success and failure are natural concepts to
persons with or without a technical background, and (2) a conditional expression
in Icon can produce a useful value (other than true or false) if it succeeds.
Consider, for example, reading the lines of a text file. An attempt to read
succeeds (and produces a string) as long as there are lines remaining in
the file. When, however, the end of the file is reached, an attempt to read
fails (and produces nothing). Here's how it's cast in Icon:
while line := read() do
process(line)
where process()
is a procedure that performs some computation
on line. The loop continues, processing one line after another, until the
end of the file is reached, in which case read()
fails and
the loop is terminated.
Just as success and failure are familiar in everyday life, so is the possibility
of alternatives. For example, the word "student" occurs twice
in the sentence "My student is lazier than even the average student.".
In text processing, this might take the form of locating the substring "student"
in the string
"My student is lazier than even the average student."
Most programming languages can't deal with such situations directly and
can produce only one location in a given string, even though this kind of
situation arises commonly in text analysis and other contexts. Icon allows
a computation to have more than one value. An expression that can have more
than one value is called a generator. A generator produces successive values
as requested. The number of alternatives a generator produces depends on
how many alternatives there are and how many of them are requested.
For example, find(s1, s2)
generates the locations at which
s1
occurs as a substring of s2
. Continuing the
example above, suppose the strings are
s1 := "student"
s2 := "My student is lazier than even the average student."
then find(s1, s2)
has the alternative locations 4 and 44 (Icon
numbers from 1 as human beings do, not from 0 as computers do).
In a situation where only one value is needed, the first, 4, is produced,
so that
first := find(s1, s2)
assigns 4 to first.
All the alternatives of a generator can be obtained, as illustrated by the
following expression, which writes the locations at which s1
occurs in s2
:
every i := find(s1, s2) do
write(i)
The every
loop requests an alternative from find()
.
If find()
has one, it suspends evaluation and returns the value.
The value is assigned to i
and then written, and every resumes
find()
for another alternative. This continues until find()
has no more alternatives and fails. This loop can be written more compactly
as
every write(find(s1, s2))
Generators are useful in many contexts. For example, i to j
generates the integers from i
to j
, so that
every i := 1 to 10 do
write(sqrt(i))
writes the square roots of the integers 1 through 10.
Alternatives also can be give explicitly, as in
"the" | "The" | "THE"
which generates three strings. Thus,
find("the" | "The" | "THE", text)
finds all the locations of the three strings in text
.
String Scanning
Icon has a high-level facility for analyzing strings, called string scanning.
String scanning is based on two observations about the nature of most string
analysis:
1. It is typical for many analysis operations to be performed
on the same string. Imagine parsing a sentence, for example. The parsing
is likely to require many operations on the sentence to identify its components.
2. Many analysis operations occur at a particular place in a string,
and the place typically advances as analysis continues. Again, think of
parsing a sentence; parsing typically starts at the beginning of the sentence
and progresses toward the end as components are identified. Of course, if
an initial analysis proves to be incorrect later on, the analysis may go
back to an earlier position and look for an alternative.
To simplify string analysis, string scanning maintains a subject on which
analysis operations can be performed without explicitly mentioning the string
being analyzed. String scanning also automatically maintains a position
that serves as a focus of attention in the subject as the analysis proceeds.
A string scanning expression has the form
s ? expr
where s
is the subject string and expr is a scanning expression
that analyzes (scans) it. When a string-scanning expression is evaluated,
the subject is set to s
and the position is set to 1, the beginning
of the subject.
There are two functions that change the position in the subject:
tab(i)
, which sets the position to i
move(i)
, which increments the position by i
Both of these functions produce the substring of the subject between the
position prior to their evaluation and the position after their evaluation.
Both of these functions fail and leave the position unchanged if the specified
position is out of the range of the subject. This failure can be used to
control loops, as in
text ? {
while write(move(1)) do
move(1)
}
which writes the odd-numbered characters of text
, one per line.
String analysis functions, which produce positions, are necessary for much
string scanning. Two of the most useful analysis functions are:
upto(s)
, which returns the position at which a
character of s occurs in the subject
many(s)
, which returns the position after a sequence of characters
of s
These functions examine the part of the subject to the right of the current
position. For example, upto("aeiou")
produces the
position of the first "vowel" to the right of the current position.
Analysis functions fail if what's being looked for doesn't exist.
Analysis functions produce a position; they do not change the position --
tab()
is used for this. For example, the "words"
in a string can be written as follows:
text ? {
while tab(upto(&letters)) do
write(tab(many(&letters)))
}
The keyword &letters
contains the upper- and lowercase
letters. The expression upto(&letters)
produces the position
of the first letter in the subject and provides the argument for tab()
,
which moves the position to that letter. Next tab(many(&letters))
moves the position to the end of the sequence of letters and produces that
substring of the subject, which is written. (The definition of a "word"
used here is overly simple, but it illustrates the general method of string
scanning. It is easy to formulate more sophisticated definitions of a word
that include, for example, compound words, possessives, and contractions.
There is, of course, no algorithmic definition of "word" that
fits all situations.)
Procedures
An Icon program consists of procedures that contain the program's computational
components. Procedures can be generators, as in this example, which generates
the words in a string:
procedure genword(text)
text ? {
while tab(upto(&letters)) do {
word := tab(many(&letters))
suspend word
}
}
fail
end
The argument of genword()
is text
, which is the
subject of string scanning. The expression suspend word suspends evaluation
in the procedure and returns a word. If the procedure is resumed, the procedure
continues evaluation, suspending with another word, if there is one. Otherwise,
the procedure fails. Thus,
while line := read() do
every write(genword(line))
writes all the words in the input file.
Structures
Structures allow collections of values to be organized and accessed in different
ways. Icon provides data structures not commonly found in other programming
languages.
Lists
A list consists of a sequence of values. The values can be given explicitly,
as in
parties := ["Socialist", "Democratic", "Republican"]
which assigns a list of three strings to parties.
The values in a list can be accessed by their position, as in
write(parties[2])
which writes Democratic
.
The operator !
generates the values in a list in order, so
that
every write(!parties)
writes Socialist
, Democratic
, and Republican
.
Lists also can be used as stacks and queues. For example,
put(parties, "Libertarian")
adds "Libertarian"
to the right end of parties. Similarly,
put(parties, "Communist")
adds "Communist"
to the left end of parties.
The function get(parties)
removes and produces the left-most
value in parties, while pull(parties)
removes and produces
the right-most value. Both get()
and pull()
fail
if the list is empty.
These capabilities allow a list to be built without knowing its final size
in advance. For example, to build a list of words in the input file using
genword()
, all that is needed is
word_list := [] # empty list
while line := read() do
every put(word_list, genword(line)) # append words
This works if there is only one word or if there are thousands. The only
limitation is the amount of available memory.
The function sort()
sorts the value in a list. For strings,
sorting is lexical ("alphabetical"). Consequently, all that's
needed to write the words in alphabetical order is
every write(!sort(word_list))
Sets
A set is a collection of distinct values. Unlike for a list, there is no
concept of order for the values of a set, and a set cannot contain duplicate
values. When a value is inserted into a set, if the value already is in
the set, the set is not changed.
This property of sets is useful for eliminating duplicates. For example,
word_set := set() # empty set
while line := read() do
every insert(word_set, genword(line)) # insert words
creates a set, word_set
, of all distinct words in the input
file.
A set can be sorted into a list, so that
every write(!sort(word_set))
writes all the distinct words in alphabetical order.
The function member(word_set, word)
succeeds if word
is in word_set
but fails otherwise. It can be used to write
the distinct words in order of first occurrence:
word_set := set()
while line := read() do
every word := genword(line) do
if not member(word_set, word) then {
write(word)
insert(word_set, word)
}
The expression not
inverts success and failure, so that if
a word is not already in word_set
, it is written (first occurrence)
and then inserted into word_set
so that it will not be written
subsequently.
The standard operations on finite sets are provided. For example,
all_words := word_set1 ++ word_set2
assigns the union of word_set1
and word_set2
to
all_words
-- all the words that are in either word_set1
or word_set2
. Similarly,
common_words := word_set1 ** word_set2
assigns the intersection of word_set1
and word_set2
to common_words
-- only the words that are in both word_set1
and word_set2
.
Tables
A table consists of pairs of values. Each pair consists of a key and a value
that is associated with that key. Keys are like the members of a set: They
are all distinct.
Since the keys are distinct but have associated values (which need not be
distinct), tables are useful for tasks like counting word occurrences. For
example, to create a table of word counts, all that is needed is
word_count := table(0) # empty table
while line := read() do
every word := genword(line) do
word_count[word] := word_count[word] + 1
The argument 0 used in creating a new, empty table, provides an initial
value for newly added keys. The expression word_count[word]
subscripts word_count
with the key word,
whose
value is a string. The first time the string is used as a key, it has an
initial value of 0 (zero occurrences so far), and 1 is added to it to make
its count 1. If the same string is used as a key again, its count is incremented.
The table grows automatically as it is subscripted with new keys.
Tables can be sorted to produce lists, but since tables consist of pairs
of values, the resulting list contains both keys and their corresponding
values. There are several options for sorting tables. Two of the most useful
are sort(word_count, 3
), which produces a list of alternating
keys and values, sorted by key, and sort(word_count, 4)
, which
produces a similar list but sorted by value.
A complete program to count words and list the results ordered alphabetically
is
procedure main()
word_count := table(0) # empty table
while line := read() do
every word := genword(line) do
word_count[word] := word_count[word] + 1
result := sort(word_count, 3)
while write(left(get(result), 15), right(get(result), 8))
end
procedure genword(text)
text ? {
while tab(upto(&letters)) do {
word := tab(many(&letters))
suspend word
}
}
fail
end
The functions left()
and right()
position the
words and their counts in field of fixed length. The writing loop fails
when all values have been removed from result.
Here is the output from this program for a portion of Edgar Allan Poe's
poem, "The Bells".
Bells 1
Future 1
Of 3
On 1
To 2
and 2
bells 10
chiming 1
how 1
impells 1
it 1
of 1
rapture 1
rhyming 1
ringing 1
swinging 1
tells 1
that 1
the 9
It is easy to modify the program to produce a list sorted by count or in
reverse sorting order.
Graphics
Until fairly recently, computer graphics were available only to a few persons
who had access to expensive, specialized equipment. The advent of personal
computers has changed all that. Personal computer operating systems now
provide visual interfaces and tools for the direct manipulation of objects
portrayed on the screen. Almost all commercial computer applications now
provide visual interfaces and many use graphics to display information in
attractive and easily understood ways.
Writing such applications, however, usually still is a difficult and time
consuming task that requires considerable specialized knowledge. Programming
languages, most of which were designed and implemented before graphics facilities
were widely available, have lagged. Most programming languages that support
graphics do so through add-on libraries, and they do not integrate graphics
with other programming language facilities.
Icon addressed this problem in its most recent version, adding high-level
graphics facilities and integrating them with the rest of the language.
Graphics Facilities
Icon's graphic facilities include the following features:
- Windows can be opened and closed as desired.
- Text in a variety of type faces and sizes, including proportional-width
faces, can be written to awindow in the same manner as text is written to
a file.
- Characters from the keyboard can be processed as they are typed, without
waiting for a return character.
- Points, lines, polygons, circles, arcs, and smooth curves can be freely
intermixed with text.
- Color can be specified by numeric value or descriptive phrase.
- Images can be read from files and written to files.
- Buttons, sliders, and other interface tools are available.
In addition, there is an application that supports interactive, intuitive
construction of visual interfaces for Icon programs.
Graphics programming is too extensive a subject to describe in detail in
this paper. The following example illustrates its use. The procedure histogram(tbl)
displays a histogram of word counts given in the table tbl
.
Figure 1 shows typical output.
procedure histogram(tbl)
# sort the table by key
words := sort(tbl, 3)
# set the display parameters
line_height := WAttrib("leading")
window_height := ((*words / 2) + 1) * line_height
hist_offset := 100
scale := 10
# open a window
window := WOpen("label=Histogram", "size=300," || window_height)
# display the information
y := 0
while WWrite(window, " ", get(words)) do {
FillRectangle(window, hist_offset, y + 2,
scale * get(words), line_height - 2)
y := y + line_height
}
# wait for user to dismiss the window
until WQuit(window)
WClose(window)
return
end
Figure 1: A Histogram
An Example Application
An example of an Icon application with a visual interface follows. This
application expands on the simple word-manipulation examples given earlier
and provides the user with a visual display of information about words in
a variety of ways. The interface for this application is shown in Figure
2. The display shows an analysis of an Icon technical report.
Figure 2: The Application Interface
A list of words appears at the left with a histogram in decreasing numerical
order at the right. The display can be scrolled to view long lists.
The File
menu provides options for opening text files for analysis,
saving the analysis as a text file or image, and so on. The Definition
menu provides for different definitions of words, a list of words to be
excluded, and so on. The Display
menu provides various options
for the presentation of the results of word analysis: counts instead of
a histogram, the font for text, scaling of the histogram, colors for highlighting
information of interest, and so on. The Sort
menu provides
various sorting operations.
Dialogs are used for customizing the analysis, as shown in Figures 3 and
4.
Figure 3: The Sorting Option Dialog
Figure 4: The Word Definition Dialog
A customized definition can be given using a regular expression [6].
Conclusions
It might seem obvious that a programmer should use the programming language
that is best suited for the task at hand. The actual situation is more complicated.
No one language is best for all programming tasks. Even for a specific task,
there is little agreement about the best language. Furthermore, programmers
rarely are skilled in more than a few programming languages, and many programmers
only use one, regardless of its appropriateness for a particular problem.
The reasons for this are well known. There are too many programming languages
for one person to master them all. Developing proficiency even in one programming
language requires a large amount of time and effort. Proficiency has its
value even in an inappropriate language. Proficiency also becomes a vested
interest and it's natural to proclaim the virtues of a language and see
it as the best language to use even when it clearly isn't.
Since learning a new programming language requires such a large investment,
there needs to be a good reason to do it. There are many possible reasons,
including intellectual curiosity. The best reason, perhaps, is if the programming
language will produce significantly better results than another. What is
better depends on context. It may be faster execution speed, more capabilities,
reduced programming effort, quicker results, programs that are easier to
maintain and modify, and so forth.
One aspect of a language that often is underrated is ease of programming.
It is not is just a matter of less effort and quicker results. It affects
what is undertaken. Graphics provides one of the best examples. If graphics
are tedious and tricky to use, there is a tendency not to use them and to
design an application without them. Consider a word analysis application.
Without graphics and a visual interface, the application might be designed
to run from the command line with various options. However, the number of
features and the way they interact is very limited in such a design. With
graphics, many features can be cast in understandable and easily understood
ways. Such an approach often suggests useful features that might never be
imagined for an application designed to run from the command line.
For situations in which ease of programming, quick results, and graphics
are important, Icon is a good choice. For computing in the humanities, its
string-analysis facilities, ability to represent complex relationships among
data, and graphics capabilities make Icon particularly appropriate.
Status and Availability
Icon has evolved through a series of versions. The current version is 9,
which is available for MS-DOS, Macintosh/MPW, Windows, Windows NT, VAX/VMS,
and many UNIX platforms. There also are earlier versions for several other
platforms. Icon's graphics facilities are supported by the implementations
for Windows, Windows NT, VAX/VMS, and UNIX. Implementations that support
graphics are under way for other platforms. All implementations of Icon
are in the public domain.
There is a large library of Icon applications and procedure packages that
can be used in Icon programs. The library also provides many examples of
programming in Icon.
Documentation on Icon is extensive. In addition to books on the language
[4], its implementation [7], and graphics programming [8], there are two
newsletters [9, 10] and many technical reports.
Documentation, implementations, and other material are available via the
Internet. On the World Wide Web, Icon is located at
http://www.cs.arizona.edu/icon/
The address for anonymous FTP is
ftp.cs.arizona.edu
From there, cd /icon
.
Information also is available from
Icon Project
Department of Computer Science
The University of Arizona
P.O. Box 210077
Tucson, AZ85721-0077
U.S.A.
(520) 621-6613 (voice)
(520) 621-4246 (fax)
icon-project@cs.arizona.edu
Acknowledgments
Many persons have contributed to the development of Icon. Clint Jeffery
was instrumental in the addition of the graphics facilities. Mary Cameron
wrote the first version of the visual interface builder.
References
1.
"SNOBOL, A String Manipulation Language",
David J. Farber, Ralph E. Griswold, and Ivan P. Polonsky,
Journal
of the ACM, Vol. 11, No. 1, January 1964, pp. 21-30.
2.
The SNOBOL4 Programming Language, second edition,
Ralph E. Griswold, James F. Poage, and Ivan P. Polonsky, Prentice-Hall,
Inc., Englewood Cliffs, New Jersey, 1971.
3.
"An Alternative to the Use of Patterns in String
Processing", Ralph E. Griswold and David R. Hanson,
ACM
Transactions on Programming Languages and Systems, Vol. 2, No. 2, 1980,
pp. 153-172.
4.
The Icon Programming Language, second edition,
Ralph E. Griswold and Madge T. Griswold, Prentice Hall, Englewood Cliffs,
New Jersey, 1990.
5.
Graphics Facilities for the Icon Programming Language;
Version 9.1, Gregg M. Townsend, Ralph E. Griswold, and Clinton L. Jeffery,
Technical Report IPD268, Department of Computer Science, The University
of Arizona, 1995.
6.
The Design and Analysis of Computer Algorithms,
Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman, Addison-Wesley Publishing
Company, Reading, Massachusetts, 1974.
7.
The Implementation of the Icon Programming Language,
Ralph E. Griswold and Madge T. Griswold, Princeton University Press, Princeton,
New Jersey, 1986.
8.
Graphics Programming in Icon, Ralph E. Griswold,
Clinton L. Jeffery, and Gregg M. Townsend, forthcoming.
9.
The Icon Newsletter, Ralph E. Griswold, Madge
T. Griswold, and Gregg M. Townsend, editors, published three times a year.
10.
The Icon Analyst, Ralph E. Griswold, Madge
T. Griswold, and Gregg M. Townsend, editors, published six times a year.
Icon home page