String Allocation in Icon

Ralph E. Griswold

Department of Computer Science
The University of Arizona
Tucson, Arizona

IPD277
May 12, 1996
http://www.cs.arizona.edu/icon/docs/ipd275.htm


Note: This report is an adaptation of an article that appeared in Issue 9 of the Icon Analyst in 1991.

Background

Strings that are created during the execution of an Icon program are stored at a place in memory called the string region. Strings in the string region are allocated contiguously, starting at the beginning. Thus, the string region is divided into two portions: an allocated portion and a free portion. Whenever a new string is created, it is added to the end of the allocated portion of the string region, decreasing the size of the free portion. The string region can be depicted as a long sequence of characters:



where free identifies the boundary between the allocated space and the free space. Of course, there are many more characters in the string region than suggested by this diagram -- typically 65K of them.

If the free portion is not large enough to hold a newly created string, a garbage collection is performed, discarding strings that are no longer needed and compressing the allocated part, making more room in the free part. See Reference 1 for more information about the details of allocation and garbage collection.

Strings created during program execution, called allocated strings, are not null-terminated as they are in C and some other programming languages. Null-termination, which adds a character with code zero to the end of a string, is used as a means of finding the end of a string. Icon accomplishes this by keeping both a pointer to the first character of a string and its length in an Icon string value.

This method makes computation of the length of a string fast, and it also allows null characters to appear in Icon strings. Perhaps more important, a substring of an existing string can be formed by changing only the pointer to the first character and the length, whereas with null termination, it's generally necessary to copy a portion of the existing string, since a terminating null character would overwrite a character in the string of which it's a part. A substring operation in Icon such as
s[i:j]
does not allocate any space in the string region. Several Icon operations do allocate space in the string region, the most notable being reading and concatenation.

Concatenation

Concatenation works by copying its two argument strings to the end of the allocated portion of the string region. The result of
s1 || s2
can be depicted as follows:



In general, such a concatenation allocates space for
*s1 + *s2
characters and copies that many characters. One of the problems with concatenating in this fashion is illustrated by the following program segment that builds up a string piece by piece:
result := ""
while s := read() do
   result := result || s
In the loop, space is allocated for the string that is read, which then is assigned to s. The concatenation then allocates space for result and s, and the resulting string is assigned back to result. Note that all new string data comes from reading. If the successive strings read are indicated by subscripts, the pattern of allocation is
s1 s1 s2 s1 s2 s3 s1 s2 s3 s4 s1 s2 s3 s4 ...
Every previously read string is copied to perform the next concatenation. Although the previous values of result are discarded when a garbage collection occurs, the amount of space allocated is clearly much larger than is needed for the final result. In addition, garbage collection can be expensive, and the cost of garbage collection may depend on other things that have gone on during program execution, not just on the concatenation loop.

If you look at the final result of the concatenation loop, you'll see it's just
 s1 s2 s3 s4 ...
The rest is "garbage". Is all the intermediate allocation necessary? That's the kind of question that the implementation of Icon attempts to address.

Optimizations

Short of changing the way concatenation, and hence string allocation, is done, the question becomes a more general one: "Are there situations in which it's not necessary to allocate all of the space needed in the most general case?"

One situation is when one or both of the arguments of concatenation is the empty string, in which case no concatenation or allocation is necessary. We'll come back to this case later.

There are two other situations that lend themselves to optimizations.

Optimization 1: The code segment discussed above illustrates a situation in which an optimization is possible. As the loop is evaluated repeatedly, result is the last string allocated when another string is read in and assigned to s by the next iteration of the loop. After reading, but before concatenating, the last two allocated strings are result and s. But that's exactly what's needed for the concatenation!

In other words, because of the order in which strings are allocated, result and s are already concatenated. And it doesn't take much to check for this situation, since an Icon string value contains a pointer to its first character and the length. For the concatenation
s1 || s2
pseudo-code for the check looks like this:
if loc(s1) + len(s1) = loc(s2) then ... # done
If this test succeeds, no allocation is done, the new value points to s1, and its length is the sum of the lengths of s1 and s2. With this optimization, the pattern of allocation for the loop given earlier is
s1 s2 s3 s4 ... 
which is exactly what's needed for the final result, with no extra allocation. Note that the test above is more general, and applies anywhere in the allocated portion of the string region, although the chance of its succeeding anywhere but for the last two allocated strings is small.

Optimization 2: There's another situation in which part of the allocation for concatenation can be avoided -- when the first argument of the concatenation is the last allocated string and hence at the end of the allocated portion of the string region:
if loc(s1) + len(s1) = free then ... # don't copy
In this case, it's only necessary to append the second argument of the concatenation to the end of the string region. A situation in which Optimization 2 applies is shown by
result := ""
while s := read() do
   result := s || result
In the absence of Optimization 2, the pattern of allocation for the loop is
s1 s1 s2 s2 s1 s3 s3 s2 s1 s4 s4 s3 s2 s1 ... 
With Optimization 2, the allocation pattern is
s1 s2 s1 s3 s2 s1 s4 s3 s2 s1 ...
The savings aren't as great as for Optimization 1; you can't expect to reverse the order of strings, which is what is happening here, without some copying.

Optimization 2 has a long standing. It was first used, to our knowledge, in the SPITBOL implementation of SNOBOL4 [4] and was incorporated in the first implementation of Icon.

Other Optimizations

Earlier we skipped over the situation in which one or both of the arguments of concatenation is the empty string. This sounds like a situation worth checking, since when it occurs, no allocation is necessary.

There's a kicker, however: The test has to be applied for every concatenation. It turns out that the cumulative cost of checking for this situation takes more time on the average than it saves (which is not true for Optimizations 1 and 2).

There's a moral here: An optimization may sound good, but the cost of testing for a special case may outweigh the savings when it does apply.

Some interesting examples of this occur in an implementation of SNOBOL4 that included some clever "heuristics" that turned out to be unfortunate in practice [2].

The trouble is that optimizations usually can't be evaluated analytically. And it may be very time-consuming and expensive to evaluate them in practice. The natural tendency is to rely on an intuitive feeling of the usefulness of an optimization. Such intuition often is faulty.

Other Allocation Strategies

If you think about how Icon allocates string space, you'll notice that the same string may occur in many different places in Icon's allocated string region.

Occasionally someone suggests that before allocating a new string, there should be a search to see if it already exists. This certainly is a bad idea -- it's very expensive to perform character comparisons just on the chance of finding a copy of a string that can be "re-used".

A different idea would be to put all strings in a hash table [3], so that each different string would be allocated only once. Although there are some fast and clever hashing techniques, they eventually come down to character comparison, which is quite expensive compared to an occasional garbage collection to remove unused strings. Furthermore, in a hashing scheme, the sharing of characters among substrings is not possible.

But again, the only way to be sure about this is to actually try it (or possibly simulate it, although a simulation in a case like this is difficult and error-prone). This would be a major effort, and one hardly worth undertaking, especially when the expectation clearly is negative.
.

Taking Advantage of the Optimizations

In most cases, the optimizations for string concatenation, like other aspects of the implementation of Icon, are "just there" and you needn't think about them when you program.

On the other hand, if you like to hone your programs for maximum performance, you may want to give a little thought to taking advantage of the concatenation optimizations.

One thing to watch out for is intermediate allocation that defeats the optimizations. For example, in
result := ""
while s := read() do {
      write(repl("=", *s))
      result := result || s
      }
the allocation for
repl("=", *s)
follows the allocation for
read()
and defeats Optimization 1 that otherwise would apply. On the other hand, other strings can be appended to the concatenation without additional allocation, as in
result := ""
while s := read() do
      result := result || s || ","
Here, Optimization 1 applies to the first (left) concatenation, while Optimzation 2 applies to the second (right) concatenation. It's worth knowing that literal strings are contained in the code produced by compiling a program, and space for using them is not allocated in the string region unless they have to be copied in concatenation or some other operation that allocates space in the string region [5]. For example,
marker := "+"
does not allocate space, but
marker := marker || "+"
does. One situation in which Optimization 1 may apply at a place other than at the end of the string region occurs in string scanning. For example, in
text ? {
   if t := tab(many(&letters)) then
      t := t || tab(upto(' ') | 0)
   }
tab(many(&letters)) and tab(upto(' ') | 0) are adjacent in text, which need not be at the end of the allocated portion of the string region.

Output as a Weak Form of Concatenation

Jim Gimpel likes to call output a "weak form of concatenation". His point is that when a string is written to a file sequentially, it is automatically appended to (concatenated onto) the last string written.

As shown above, the concatenation of strings in a program requires the allocation of space (which may eventually result in a garbage collection) and also the copying of characters. This is costly compared to, say, arithmetic.

In many programs, most strings that are built up by concatenation are eventually written to a file. If you can arrange to write the strings as they are computed and in the order they need to be output, you can save the costs involved in concatenation.

As a very simple example, it's considerably more efficient to use
write(s1, s2)
than to use
write(s1 || s2)
It's often possible to avoid actual concatenation altogether in programs that transform input data. Sometimes all it takes to use output instead of concatenation is looking at the problem in the right way. In fact, it can be fun to see how far you can carry this.

References

1. "Memory Monitoring", The Icon Analyst 2, October 1990, pp. 5-9.

2. "Performance of Storage Management in an Implementation of SNOBOL4", David G. Ripley, Ralph E. Griswold, and David R. Hanson, IEEE Transactions on Software Engineering, Vol. SE-4, No. 2 (1978), pp. 130-137.

3. The Macro Implementation of SNOBOL4; A Case Study of Machine-Independent Software Development, Ralph E. Griswold, W. H. Freeman, San Francisco, California, 1972.

4. "MACRO SPITBOL -- A SNOBOL4 Compiler", Robert B. K. Dewar and Anthony P. McCann, Software -- Practice & Experience, Vol. 7 (1977), pp. 95-113.

5. "An Imaginary Icon Computer", The Icon Analyst 8, October 1991, pp. 2-6.


Icon home page