Programming Corner from Icon Newsletter 20

January 24, 1986 ; Icon Version 5

String Scanning

The string scanning expression in Icon,

expr1 ? expr2

is often regarded by programmers as somewhat of a mystery. This expression is technically a control structure, since it cannot be cast as a procedure call. The reason for this is that actions must be taken after expr1 is evaluated but before expr2 is evaluated, while in a procedure call, all arguments are evaluated before the procedure gains control.

The actions taken between the evaluation of expr1 and expr2 relate to the global variables &subject and &pos. These "state variables" for scanning are set to the string value of expr1 and 1, respectively. If these variables were not set before expr2 was evaluated, expr2 could not "operate on" the value of expr1.

As mentioned above, &subject and &pos are global variables and their values are accessible throughout the program. They are not affected by procedure calls. If they were, it would not be possible to write "matching procedures" to extend the built-in repertoire of matching functions (tab and move).

A string scanning expression, however, cannot just set the values of these variables. In order for nested scanning and scanning in mutual evaluation to work properly, the current values of the state variables must be saved before new values are set and then restored when the scanning expression is complete. Thus, the scope of the scanning variables can be thought of as being dynamic with respect to scanning expressions.

Although the string scanning expression cannot be cast as a procedure call, it can be cast as nested procedure calls:

expr1 ? expr2 -> Escan(Bscan(expr1),expr2)

In the nested call, expr1 is evaluated first. Bscan is then called before expr2 is evaluated. Thus, Bscan can manipulate the state variables before expr2 is evaluated, first saving the current "outer" values, then setting the new "inner" values for expr2 to operate on. If expr2 produces a result, Escan is called. It restores the outer values before producing the result provided by expr2. On the other hand, if expr2 fails, Bscan is resumed and can restore the outer values before it, too, fails.

Note that Escan must have access to the outer values saved by Bscan. This is accomplished by passing these values as the result produced by Bscan. Since there are two values, a record can be used. Procedures to model the string scanning expression in this fashion are:

record ScanEnvir(subject,pos)
procedure Bscan(e1)
   local OuterEnvir

   OuterEnvir := ScanEnvir(&subject,&pos)
   &subject := e1
   &pos := 1
   suspend OuterEnvir
   &subject := OuterEnvir.subject &pos := OuterEnvir.pos
   fail
end
procedure Escan(OuterEnvir,e2)
   local InnerEnvir

   InnerEnvir := ScanEnvir(&subject,&pos)
   &subject := OuterEnvir.subject 
   &pos := OuterEnvir.pos
   suspend e2
   &subject := InnerEnvir.subject
   &pos := InnerEnvir.pos 
   fail
end

Note that if expr2 produces a result, Escan suspends. If it is resumed, Escan restores the inner values and fails, forcing expr2 to be resumed. This allows expr2 to produce a sequence of results.

This is all there is to the string scanning expression -- the maintenance of state variables. All of string analysis and "pattern matching" comes from matching functions in expr2 that examine &subject and change &pos. These functions are simple. For example, tab(i) can be modeled as a procedure as follows:

procedure tab(i)
   suspend .&subject[.&pos:&pos <- i]
end

This leaves the question of where all of the power of string scanning comes from. It derives from the expression evaluation mechanism of Icon: generators and goal-directed evaluation.

Exercises:

1. There is a slight flaw in the procedures given above as a model for string scanning. What is it? Hint: It has nothing to do with the maintenance of state variables.

2. Write a model of string scanning that uses co-expressions and a programmer-defined control operation Scan{expr1,expr2} in place of

expr1
? expr2

.

3. Why are the explicit dereferencing operations necessary in the procedure for tab?

A Programming Idiom

Version 5.10 of Icon has a new sort option for tables that produces a single list of alternating entry and assigned values. For example,

a := sort(t,3)

assigns such a list to a. This option is more convenient and much more efficient in many cases than the standard sort option

a := sort(t,1)

that assigns to a a list of two-element lists.

Consider the problem of writing the entry and assigned values of a table, side-by-side in two columns, using the new option. The obvious approach is

a := sort(t,3)
every i := 1 to *a - 1 by 2 do
   write(a[i],"\t",a[i + 1])

Students in our class on string and list processing techniques came up with a different approach:

a := sort(t,3)
while write(get(a),"\t",get(a))

To be sure, this approach "destroys" the list a, but that normally makes no difference. And not only is this approach simpler than the "obvious" one, it is faster too!

Exercise: The elements produced by sort(t,3) are in increasing order of the entry values. How could the approach above be modified to write the output in descending order of the entry values?

Teasers

Steve Wampler provides two more teasers. Suppose that

t := table([])

Also suppose that getword is a procedure that produces a word from a line of text taken from a file in which lineno is the current line number.

1. Why does the following program segment never increase the size of the table t?

while word := getword() do put(t[word]),lineno)

2. Why does the following program segment increase the size of the table when the previous one does not?

while word := getword() do t[word] |||:= [lineno]

Other Exercises

1. What is the upper bound on the number of results that find(s1,s2) can produce?

2. The two code segments

return x

and

suspend x
fail

are often thought of as being operationally equivalent. Is this really true? If not, explain the difference.

3. It is frequently claimed that the outcome of looping expressions such as

while expr1 do expr2

is failure -- that is, that the loop itself produces no results. Is this always true? If not, give a counter example.

Trivia Corner

Write the shortest possible Icon program whose translation produces at least one instance of every different ucode instruction in Icon's intermediate language. Interpret "shortest possible" to mean the fewest number of characters. (See the Icon implementation book for a discussion of ucode.)

Icon home page