Programming Corner from Icon Newsletter 16

November 12, 1984 ; Version 5

Old Business

Consider the program at the end of the last Newsletter. If line is not empty, tally(line) is called and line is written. However, since tally has no explicit return, it fails by flowing off the end of the procedure body. Since the call fails, the argument

("" ~=== line) | "<empty line>")

is resumed, resulting in the call tally("<empty line>"). Consequently, <empty line> is written after every nonempty line, as well as in place of empty lines.

The cure is simple -- insert a return at the end of the procedure body for tally. (What is another way of fixing the problem?) The lesson is a more general one -- be careful to provide a return at the end of a procedure body unless calls of the procedure are supposed to fail.

Chosing Programming Techniques in Icon

There often are several ways of doing the same thing in Icon. While this is true of most programming languages, it is exaggerated in Icon, since its expression evaluation mechanism is more general than the expression evaluation mechanisms of "Algol-like" languages, such as Pascal and C. Thus, in Icon, there is often an Algol-like solution and also a solution that makes use of generators. The fact that Icon has both low-level string operations and higher-level string scanning complicates the situation.

The novice Icon programmer (and even the more advanced one) is faced with choices that may be confusing or even bewildering. Most programmers develop a fixed set of techniques and often fail to use the full potential of the language.

In the discussion that follows, a simple text processing problem is approached from a variety of ways to illustrate and compare different programming techniques in Icon. The problem is to count the number of times the string s occurs in the file f, which is formulated in terms of a procedure scount(s,f).

The first attempt at such a procedure might be

# scount1

procedure scount(s,f)
   count := 0
   while line := read(f) do
      while i := find(s,line) do {
         count +:= 1 line := line[i + 1:0]
         }
   return count
end

This solution is very Algol-like in nature and explicitly examines successive portions of each input line. It does not take advantage of the third argument of find, which allows the starting position for the examination to be specified. Using this feature, the procedure becomes

# scount2

procedure scount(s,f)
   count := 0
   while line := read(f) do {
      i := 1
      while i := find(s,line,i) + 1 do
         count +:= 1
      }
   return count
end

While this solution is shorter than the previous one, it is still Algol-like and uses only low-level string processing. Using string scanning, an alternative solution is

# scount3

procedure scount(s,f)
   count := 0
   while line := read(f) do
      line ? while tab(find(s) + 1) do
         count +:= 1
   return count
end

This approach eliminates the auxiliary identifier i and uses scanning to move through the string. None of the solutions above uses the capacity of find to generate the positions of successive instances of s, however. If this capacity is used, it is not necessary to tab through the string, and the following solution will do:

# scount4

procedure scount(s,f)
   count := 0
   while line := read(f) do
      line ? every find(s) do
         count +:= 1
   return count
end

At this point, it becomes clear that string scanning provides no advantage and it can be eliminated in favor of the following solution:

# scount5

procedure scount(s,f)
   count := 0
   while line := read(f) do
      every find(s,line) do
         count +:= 1
   return count
end

Finally, the hard-core Icon programmer may want to get rid of the while loop, making use of the fact that the expression !f generates the input lines from f. The solution then becomes

# scount6

procedure scount(s,f)
   count := 0
   every find(s,!f) do
      count +:= 1
   return count
end

The relative merits of these different solutions are arguable on stylistic grounds. Certainly the last (scount6) is the most concise, but scount5 is probably easier to understand.

But what about efficiency. Is scount6 more efficient than scount5? In fact, how much difference in performance is there among all the solutions?

The relative efficiency varies considerably, depending on the data -- how many lines there are in f, how long the lines are, how many times s occurs in f, and so forth. The following figures are typical, however. The figures are normalized so that the fastest solution has the value 1.

scount1 2.96
scount2 2.14
scount3 2.03
scount4 1.14
scount5 1.04
scount6 1.00

The fact that the last solution is the fastest may not be surprising -- it is the shortest and uses the features of Icon that are most effective in internalizing computation. Nor should it be surprising that scount2 is significantly faster than scount1, since scount2 avoids the formation of substrings. (It is worth noting, however, that substring formation is relatively efficient in Icon -- no new strings are constructed, only pointers to portions of old ones.)

It might be surprising, however, to discover that the use of string scanning in scount3 provides a significant advantage over scount2. Evidently, the internalization of the string and position that scanning provides more than overcomes the fact that scount3 produces substrings (by tab).

The real gain in efficiency comes with the use of generators in scount4, where the state of the computation is maintained in find for all the positions in any one line.

Getting rid of string scanning, which serves no useful purpose in scount5, produces an expected improvement, although perhaps not as much as might be expected. The last step of reducing the nested loops to a single loop in scount6 also produces a slight improvement in performance.

What might be learned from these examples is that there may be a very substantial difference in performance in Icon, depending on the technique used -- a factor of nearly 3 between the naive solution of scount1 and the sophisticated one of scount6. The value of using the capabilities of generators is also evident, both in performance and in the conciseness of the solutions. It is notable that string scanning is not as expensive as one might imagine. It can be used without the fear that it will degrade performance substantially.

Different Ways of Looking at Things

Steve Wampler contributes the following interesting note on programming in Icon:

How do you test to see if the value of i is not between 1 and the length of the string s? I would write:

if not (1 <= i <= *s) then ...
but my students wrote:

if *s < i < 1 then ...

Icon home page