Programming Corner from Icon Newsletter 40

December 21, 1992; Icon Version 8

Sets, in the generality found in Icon, are rare in programming languages, and persons who are used to programming in a language like Pascal or C tend to overlook the usefulness of sets in Icon.

The idea of a set is very simple: A set is simply a collection of distinct values. The word distinct is significant; a value can appear only once in a set, as opposed to a list, which can contain many instances of the same value. And, unlike lists, sets are unordered. Consequently, sets embody the concept of membership and are useful for holding values that share a common property, such as a set of all the (distinct) words in a file.

The nice thing about sets is that they are easy to use. A set starts out empty, with no values. Values can be inserted into a set or deleted from a set without having to worry about the space they occupy; sets grow and shrink automatically.

To see how easy it is to use sets, suppose you have a procedure genword() that generates successive words from the input file. If you're willing to accept the definition of a word as a string of consecutive letters, something as simple as the following will do:

procedure genword()

   while line := read() do
      line ? {
         while tab(upto(&letters)) do
            suspend(tab(many(&letters))) \ 1
         }

end

Most text files contain multiple instances of the same words. To find all the distinct words, it's necessary to keep track of the words that have occurred. The distinct nature of values in sets takes care of this automatically. The way a value x is added to a set S is by

insert(S, x)

If x is not in S, it is added to S. If x already is in S, it is not added. It's that simple.

So to create a set of the distinct words produced by genword(), all that's needed is

words := set()		# start empty
every insert(words, genword())

The every control structure keeps resuming genword(), causing it to produce all the words in the input file. Each new word is added to words, while words that have occurred before are ignored. The end result is that words contains all the distinct words in the input file.

The resulting set could be used in many ways. For example, to write a list of all the distinct words, all that's necessary is

every write(!words)

The element-generation operator ! generates the values of words and every causes all of them to be produced.

That's fine if you don't care about the order in which the words are listed. Since a set is (conceptually) unordered, the words may come out in any order. In theory, the last word generated by genword() might be the first one to be listed. Of course, there's some order to the set inside the computer; it's just that it's not one you can predict or that is useful.

It's simple enough, however, to list the words in alphabetical order. Sorting a set produces a list with the set values in order. In the case of strings, this is alphabetical order.

The element-generation operator works on lists as well as sets, but for lists it starts at the beginning and progresses to the end. Consequently,

every write(!sort(words))

lists the words in alphabetical order.

We've taken a bit of a shortcut here. The function sort() produces a list and we've applied the element-generation operation to this list without bothering to save the list. We could have written

wordlist := sort(words)
every write(!wordlist)

In fact, that would have been the thing to do if we needed the sorted list after listing its contents.

There's another order for listing the distinct words that you might want: in order of first occurrence. This is just a little more difficult.

What's necessary is to know when a new word is being added to words. The function insert(S, x) doesn't indicate this, but there's another function, member(S, x), that does. It succeeds if x is in S but fails if it is not.

To get the first-occurrence order of listing, this will do:

every word := genword() do {
   if not member(words, word) then {
      write(word)
      insert(words, word)
      }
   }

Thus, if a word is not in words, it is written and inserted into words.

If you need to keep a list of words in order of first occurrence, rather than just list them as they occur, you can build a list, using Icon's capability to put values on to the end of a list:

wordlist := []          # start empty

every word := genword() do {
   if not member(words, word) then {
      put(wordlist, word)
      insert(words, word)
      }
   }

When this is complete, wordlist contains all the distinct words in order of first occurrence. A listing then can be produced by

every write(!wordlist)

or wordlist can be used in any other way that is needed.

There are many other things you can do with sets. Some are simple, like forming the union and intersection of two sets. For example,

all_words := words1 ++ words2

creates a new set, all_words, that contains all the words in words1 and words2. Similarly,

common_words := words1 ** words2

creates a new set, common_words, that contains only the words that are in both words1 and words2.

In our examples, we've used words and hence all the values in the sets are strings. Sets are more versatile than this. Their values can be anything -- strings, integers, real numbers, even structures.

And a set can contain values of different types. This allows sets to be used for purposes far afield from keeping track of distinct words. But using them in such ways is more complicated.

Icon home page