findre.icn: Procedure to find regular expression

link findre
March 3, 1996; Richard L. Goerwitz
This file is in the public domain.

DESCRIPTION:  findre() is like the Icon builtin function find(),
except that it takes, as its first argument, a regular expression
pretty much like the ones the Unix egrep command uses (the few
minor differences are listed below).  Its syntax is the same as
find's (i.e. findre(s1,s2,i,j)), with the exception that a no-
argument invocation wipes out all static structures utilized by
findre, and then forces a garbage collection.
____________________________________________________________

(For those not familiar with regular expressions and the Unix egrep
command: findre() offers a simple and compact wildcard-based search
system.  If you do a lot of searches through text files, or write
programs which do searches based on user input, then findre is a
utility you might want to look over.)

IMPORTANT DIFFERENCES between find and findre:  As noted above,
findre() is just a find() function that takes a regular expression
as its first argument.  One major problem with this setup is that
it leaves the user with no easy way to tab past a matched
substring, as with

     s ? write(tab(find("hello")+5))

In order to remedy this intrinsic deficiency, findre() sets the
global variable __endpoint to the first position after any given
match occurs.  Use this variable with great care, preferably
assigning its value to some other variable immediately after the
match (for example, findre("hello [.?!]*",s) & tmp := __endpoint).
Otherwise, you will certainly run into trouble.  (See the example
below for an illustration of how __endpoint is used).

IMPORTANT DIFFERENCES between egrep and findre:  findre utilizes
the same basic language as egrep.  The only big difference is that
findre uses intrinsic Icon data structures and escaping conven-
tions rather than those of any particular Unix variant.  Be care-
ful!  If you put findre("\(hello\)",s) into your source file,
findre will treat it just like findre("(hello)",s).  If, however,
you enter '\(hello\)' at run-time (via, say, findre(!&input,s)),
what Icon receives will depend on your operating system (most
likely, a trace will show "\\(hello\\)").
____________________________________________________________

BUGS:  Space has essentially been conserved at the expense of time
in the automata produced by findre().  The algorithm, in other
words, will produce the equivalent of a pushdown automaton under
certain circumstances, rather than strive (at the expense of space)
for full determinism.  I tried to make up a nfa -> dfa converter
that would only create that portion of the dfa it needed to accept
or reject a string, but the resulting automaton was actually quite
slow (if anyone can think of a way to do this in Icon, and keep it
small and fast, please let us all know about it).  Note that under
version 8 of Icon, findre takes up negligible storage space, due to
the much improved hashing algorithm.  I have not tested it under
version 7, but I would expect it to use up quite a bit more space
in that environment.

IMPORTANT NOTE:  Findre takes a shortest-possible-match approach
to regular expressions.  In other words, if you look for "a*",
findre will not even bother looking for an "a."  It will just match
the empty string.  Without this feature, findre would perform a bit
more slowly.  The problem with such an approach is that often the
user will want to tab past the longest possible string of matched
characters (say tab((findre("a*|b*"), __endpoint)).  In circumstan-
ces like this, please just use something like:

    s ? {
        tab(find("a")) &  # or use Arb() from the IPL (patterns.icn)
        tab(many('a'))
        tab(many('b'))
    }

or else use some combination of findre and the above.
____________________________________________________________

REGULAR EXPRESSION SYNTAX: Regular expression syntax is complex,
and yet simple.  It is simple in the sense that most of its power
is concentrated in about a dozen easy-to-learn symbols.  It is
complex in the sense that, by combining these symbols with
characters, you can represent very intricate patterns.

I make no pretense here of offering a full explanation of regular
expressions, their usage, and the deeper nuances of their syntax.
As noted above, this should be gleaned from a Unix manual.  For
quick reference, however, I have included a brief summary of all
the special symbols used, accompanied by an explanation of what
they mean, and, in some cases, of how they are used (most of this
is taken from the comments prepended to Jerry Nowlin's Icon-grep
command, as posted a couple of years ago):

   ^   -  matches if the following pattern is at the beginning
          of a line (i.e. ^# matches lines beginning with "#")
   $   -  matches if the preceding pattern is at the end of a line
   .   -  matches any single character
   +   -  matches from 1 to any number of occurrences of the
          previous expression (i.e. a character, or set of paren-
          thesized/bracketed characters)
   *   -  matches from 0 to any number of occurrences of the previous
          expression
   \   -  removes the special meaning of any special characters
          recognized by this program (i.e if you want to match lines
          beginning with a "[", write ^\[, and not ^[)
   |   -  matches either the pattern before it, or the one after
          it (i.e. abc|cde matches either abc or cde)
   []  -  matches any member of the enclosed character set, or,
          if ^ is the first character, any nonmember of the
          enclosed character set (i.e. [^ab] matches any character
          _except_ a and b).
   ()  -  used for grouping (e.g. ^(abc|cde)$ matches lines consist-
          ing of either "abc" or "cde," while ^abc|cde$ matches
          lines either beginning with "abc" or ending in "cde")
____________________________________________________________

EXAMPLE program:

procedure main(a)
    while line := !&input do {
        token_list := tokenize_line(line,a[1])
        every write(!token_list)
    }
end

procedure tokenize_line(s,sep)
    tmp_lst := []
    s ? {
        while field := tab(findre(sep)|0) &
        mark := __endpoint
        do {
            put(tmp_lst,"" ~== field)
            if pos(0) then break
            else tab(mark)
        }
    }
    return tmp_lst
end

The above program would be compiled with findre (e.g. "icont
test_prg.icn findre.icn") to produce a single executable which
tokenizes each line of input based on a user-specified delimiter.
Note how __endpoint is set soon after findre() succeeds.  Note
also how empty fields are excluded with "" ~==, etc.  Finally, note
that the temporary list, tmp_lst, is not needed.  It is included
here merely to illustrate one way in which tokens might be stored.

Tokenizing is, of course, only one of many uses one might put
findre to.  It is very helpful in allowing the user to construct
automata at run-time.  If, say, you want to write a program that
searches text files for patterns given by the user, findre would be
a perfect utility to use.  Findre in general permits more compact
expression of patterns than one can obtain using intrinsic Icon
scanning facilities.  Its near complete compatibility with the Unix
regexp library, moreover, makes for greater ease of porting,
especially in cases where Icon is being used to prototype C code.

Source code | Program Library Page | Icon Home Page