procedure RePat: regular expression pattern list procedure ReMatch: position past regular expression matched procedure ReFind: position where regular expression matched
link regexp
May 19, 1996; Robert J. Alexander
This file is in the public domain.
This is a kit of procedures to deal with UNIX-like regular expression patterns. These procedures are interesting partly because of the "recursive suspension" (or "suspensive recursion" :-) technique used to simulate conjunction of an arbitrary number of computed expressions (see notes, below). The public procedures are: ReMatch(pattern,s,i1,i2) : i3,i4,...,iN ReFind(pattern,s,i1,i2) : i3,i4,...,iN RePat(s) : pattern list ____________________________________________________________ ReMatch() produces the sequence of positions in "s" past a substring starting at "i1" that matches "pattern", but fails if there is no such position. Similar to match(), but is capable of generating multiple positions. ReFind() produces the sequence of positions in "s" where substrings begin that match "pattern", but fails if there is no such position. Similar to find(). Each position is produced only once, even if several possible matches are possible at that position. "pattern" can be either a string or a pattern list -- see RePat(), below. Default values of s, i1, and i2 are handled as for Icon's built-in string scanning procedures such as match(). ____________________________________________________________ RePat(s) : L Creates a pattern element list from pattern string "s", but fails if the pattern string is not syntactically correct. ReMatch() and ReFind() will automatically convert a pattern string to a pattern list, but it is faster to do the conversion explicitly if multiple operations are done using the same pattern. An additional advantage to compiling the pattern separately is avoiding ambiguity of failure caused by an incorrect pattern and failure to match a correct pattern. ____________________________________________________________ ReCaseIndependent() : n ReCaseDependent() : n Set mode for case-independent or case-dependent matching. The initial mode is case-dependent. ____________________________________________________________ Accessible Global Variables After a match, the strings matched by parenthesized regular expressions are left in list "Re_ParenGroups", and can be accessed by subscripting in using the same number as the \N construct. If it is desired that regular expression format be similar to UNIX filename generation patterns but still retain the power of full regular expressions, make the following assignments prior to compiling the pattern string: Re_ArbString := "*" # Defaults to ".*" The sets of characters (csets) that define a word, digits, and white space can be modified. The following assignments can be made before compiling the pattern string. The character sets are captured when the pattern is compiled, so changing them after pattern compilation will not alter the behavior of matches unless the pattern string is recompiled. Re_WordChars := 'whatever you like' # Defaults to &letters ++ &digits ++ "_" Re_Digits := &digits ++ 'ABCDEFabcdef' # Defaults to &digits Re_Space := 'whatever you like' # Defaults to ' \t\v\n\r\f' These globals are normally not initialized until the first call to RePat(), and then only if they are null. They can be explicitly initialized to their defaults (if they are null) by calling Re_Default(). ____________________________________________________________ Characters compiled into patterns can be passed through a user-supplied filter procedure, provided in global variable Re_Filter. The filtering is done before the characters are bound into the pattern. The filter proc is passed one argument, the string to filter, and it must return the filtered string as its result. If the filter proc fails, the string will be used unfiltered. The filter proc is called with an argument of either type string (for characters in the pattern) or cset (for character classes [...]). Filtering is done only as the pattern is compiled. Any filtering of strings to be matched must be explicitly done. ____________________________________________________________ By default, individual pattern elements are matched in a "leftmost- longest-first" sequence, which is the order observed by perl, egrep, and most other regular expression matchers. If the order of matching is not important a performance improvement might be seen if pattern elements are matched in "shortest-first" order. The following global variable setting causes the matcher to operate in leftmost-shortest- first order. Re_LeftmostShortest := 1 ____________________________________________________________ In the case of patterns containing alternation, ReFind() will generally not produce positions in increasing order, but will produce all positions from the first term of the alternation (in increasing order) followed by all positions from the second (in increasing order). If it is necessary that the positions be generated in strictly increasing order, with no duplicates, assign any non-null value to Re_Ordered: Re_Ordered := 1 If the Re_Ordered option is chosen, there is a *small* penalty in efficiency in some cases, and the co-expression facility is required in your Icon implementation. ____________________________________________________________ Regular Expression Characters and Features Supported The regular expression format supported by procedures in this file model very closely those supported by the UNIX "egrep" program, with modifications as described in the Perl programming language definition. Following is a brief description of the special characters used in regular expressions. In the description, the abbreviation RE means regular expression. c An ordinary character (not one of the special characters discussed below) is a one-character RE that matches that character. \c A backslash followed by any special character is a one- character RE that matches the special character itself. Note that backslash escape sequences representing non-graphic characters are not supported directly by these procedures. Of course, strings coded in an Icon program will have such escapes handled by the Icon translator. If such escapes must be supported in strings read from the run-time environment (e.g. files), they will have to be converted by other means, such as the Icon Program Library procedure "escape()". . A period is a one-character RE that matches any character. [string] A non-empty string enclosed in square brackets is a one- character RE that matches any *one* character of that string. If, the first character is "^" (circumflex), the RE matches any character not in the remaining characters of the string. The "-" (minus), when between two other characters, may be used to indicate a range of consecutive ASCII characters (e.g. [0-9] is equivalent to [0123456789]). Other special characters stand for themselves in a bracketed string. * Matches zero or more occurrences of the RE to its left. + Matches one or more occurrences of the RE to its left. ? Matches zero or one occurrences of the RE to its left. {N} Matches exactly N occurrences of the RE to its left. {N,} Matches at least N occurrences of the RE to its left. {N,M} Matches at least N occurrences but at most M occurrences of the RE to its left. ^ A caret at the beginning of an entire RE constrains that RE to match an initial substring of the subject string. $ A currency symbol at the end of an entire RE constrains that RE to match a final substring of the subject string. | Alternation: two REs separated by "|" match either a match for the first or a match for the second. () A RE enclosed in parentheses matches a match for the regular expression (parenthesized groups are used for grouping, and for accessing the matched string subsequently in the match using the \N expression). \N Where N is a digit in the range 1-9, matches the same string of characters as was matched by a parenthesized RE to the left in the same RE. The sub-expression specified is that beginning with the Nth occurrence of "(" counting from the left. E.g., ^(.*)\1$ matches a string consisting of two consecutive occurrences of the same string. ____________________________________________________________ Extensions beyond UNIX egrep The following extensions to UNIX REs, as specified in the Perl programming language, are supported. \w Matches any alphanumeric (including "_"). \W Matches any non-alphanumeric. \b Matches only at a word-boundary (word defined as a string of alphanumerics as in \w). \B Matches only non-word-boundaries. \s Matches any white-space character. \S Matches any non-white-space character. \d Matches any digit [0-9]. \D Matches any non-digit. \w, \W, \s, \S, \d, \D can be used within [string] REs. ____________________________________________________________ Notes on computed conjunction expressions by "suspensive recursion" A conjunction expression of an arbitrary number of terms can be computed in a looping fashion by the following recursive technique: procedure Conjunct(v) if <there is another term to be appended to the conjunction> then suspend Conjunct(<the next term expression>) else suspend v end The argument "v" is needed for producing the value of the last term as the value of the conjunction expression, accurately modeling Icon conjunction. If the value of the conjunction is not needed, the technique can be slightly simplified by eliminating "v": procedure ConjunctAndProduceNull() if <there is another term to be appended to the conjunction> then suspend ConjunctAndProduceNull(<the next term expression>) else suspend end Note that <the next term expression> must still remain in the suspend expression to test for failure of the term, although its value is not passed to the recursive invocation. This could have been coded as suspend <the next term expression> & ConjunctAndProduceNull() but wouldn't have been as provocative. Since the computed conjunctions in this program are evaluated only for their side effects, the second technique is used in two situations: (1) To compute the conjunction of all of the elements in the regular expression pattern list (Re_match1()). (2) To evaluate the "exactly N times" and "N to M times" control operations (Re_NTimes()).