link findre
March 3, 1996; Richard L. Goerwitz
This file is in the public domain.
DESCRIPTION: findre() is like the Icon builtin function find(), except that it takes, as its first argument, a regular expression pretty much like the ones the Unix egrep command uses (the few minor differences are listed below). Its syntax is the same as find's (i.e. findre(s1,s2,i,j)), with the exception that a no- argument invocation wipes out all static structures utilized by findre, and then forces a garbage collection. ____________________________________________________________ (For those not familiar with regular expressions and the Unix egrep command: findre() offers a simple and compact wildcard-based search system. If you do a lot of searches through text files, or write programs which do searches based on user input, then findre is a utility you might want to look over.) IMPORTANT DIFFERENCES between find and findre: As noted above, findre() is just a find() function that takes a regular expression as its first argument. One major problem with this setup is that it leaves the user with no easy way to tab past a matched substring, as with s ? write(tab(find("hello")+5)) In order to remedy this intrinsic deficiency, findre() sets the global variable __endpoint to the first position after any given match occurs. Use this variable with great care, preferably assigning its value to some other variable immediately after the match (for example, findre("hello [.?!]*",s) & tmp := __endpoint). Otherwise, you will certainly run into trouble. (See the example below for an illustration of how __endpoint is used). IMPORTANT DIFFERENCES between egrep and findre: findre utilizes the same basic language as egrep. The only big difference is that findre uses intrinsic Icon data structures and escaping conven- tions rather than those of any particular Unix variant. Be care- ful! If you put findre("\(hello\)",s) into your source file, findre will treat it just like findre("(hello)",s). If, however, you enter '\(hello\)' at run-time (via, say, findre(!&input,s)), what Icon receives will depend on your operating system (most likely, a trace will show "\\(hello\\)"). ____________________________________________________________ BUGS: Space has essentially been conserved at the expense of time in the automata produced by findre(). The algorithm, in other words, will produce the equivalent of a pushdown automaton under certain circumstances, rather than strive (at the expense of space) for full determinism. I tried to make up a nfa -> dfa converter that would only create that portion of the dfa it needed to accept or reject a string, but the resulting automaton was actually quite slow (if anyone can think of a way to do this in Icon, and keep it small and fast, please let us all know about it). Note that under version 8 of Icon, findre takes up negligible storage space, due to the much improved hashing algorithm. I have not tested it under version 7, but I would expect it to use up quite a bit more space in that environment. IMPORTANT NOTE: Findre takes a shortest-possible-match approach to regular expressions. In other words, if you look for "a*", findre will not even bother looking for an "a." It will just match the empty string. Without this feature, findre would perform a bit more slowly. The problem with such an approach is that often the user will want to tab past the longest possible string of matched characters (say tab((findre("a*|b*"), __endpoint)). In circumstan- ces like this, please just use something like: s ? { tab(find("a")) & # or use Arb() from the IPL (patterns.icn) tab(many('a')) tab(many('b')) } or else use some combination of findre and the above. ____________________________________________________________ REGULAR EXPRESSION SYNTAX: Regular expression syntax is complex, and yet simple. It is simple in the sense that most of its power is concentrated in about a dozen easy-to-learn symbols. It is complex in the sense that, by combining these symbols with characters, you can represent very intricate patterns. I make no pretense here of offering a full explanation of regular expressions, their usage, and the deeper nuances of their syntax. As noted above, this should be gleaned from a Unix manual. For quick reference, however, I have included a brief summary of all the special symbols used, accompanied by an explanation of what they mean, and, in some cases, of how they are used (most of this is taken from the comments prepended to Jerry Nowlin's Icon-grep command, as posted a couple of years ago): ^ - matches if the following pattern is at the beginning of a line (i.e. ^# matches lines beginning with "#") $ - matches if the preceding pattern is at the end of a line . - matches any single character + - matches from 1 to any number of occurrences of the previous expression (i.e. a character, or set of paren- thesized/bracketed characters) * - matches from 0 to any number of occurrences of the previous expression \ - removes the special meaning of any special characters recognized by this program (i.e if you want to match lines beginning with a "[", write ^\[, and not ^[) | - matches either the pattern before it, or the one after it (i.e. abc|cde matches either abc or cde) [] - matches any member of the enclosed character set, or, if ^ is the first character, any nonmember of the enclosed character set (i.e. [^ab] matches any character _except_ a and b). () - used for grouping (e.g. ^(abc|cde)$ matches lines consist- ing of either "abc" or "cde," while ^abc|cde$ matches lines either beginning with "abc" or ending in "cde") ____________________________________________________________ EXAMPLE program: procedure main(a) while line := !&input do { token_list := tokenize_line(line,a[1]) every write(!token_list) } end procedure tokenize_line(s,sep) tmp_lst := [] s ? { while field := tab(findre(sep)|0) & mark := __endpoint do { put(tmp_lst,"" ~== field) if pos(0) then break else tab(mark) } } return tmp_lst end The above program would be compiled with findre (e.g. "icont test_prg.icn findre.icn") to produce a single executable which tokenizes each line of input based on a user-specified delimiter. Note how __endpoint is set soon after findre() succeeds. Note also how empty fields are excluded with "" ~==, etc. Finally, note that the temporary list, tmp_lst, is not needed. It is included here merely to illustrate one way in which tokens might be stored. Tokenizing is, of course, only one of many uses one might put findre to. It is very helpful in allowing the user to construct automata at run-time. If, say, you want to write a program that searches text files for patterns given by the user, findre would be a perfect utility to use. Findre in general permits more compact expression of patterns than one can obtain using intrinsic Icon scanning facilities. Its near complete compatibility with the Unix regexp library, moreover, makes for greater ease of porting, especially in cases where Icon is being used to prototype C code.