procedure RePat: regular expression pattern list procedure ReMatch: position past regular expression matched procedure ReFind: position where regular expression matched
link regexp
May 19, 1996; Robert J. Alexander
This file is in the public domain.
This is a kit of procedures to deal with UNIX-like regular expression
patterns.
These procedures are interesting partly because of the "recursive
suspension" (or "suspensive recursion" :-) technique used to simulate
conjunction of an arbitrary number of computed expressions (see
notes, below).
The public procedures are:
ReMatch(pattern,s,i1,i2) : i3,i4,...,iN
ReFind(pattern,s,i1,i2) : i3,i4,...,iN
RePat(s) : pattern list
____________________________________________________________
ReMatch() produces the sequence of positions in "s" past a substring
starting at "i1" that matches "pattern", but fails if there is no
such position. Similar to match(), but is capable of generating
multiple positions.
ReFind() produces the sequence of positions in "s" where substrings
begin that match "pattern", but fails if there is no such position.
Similar to find(). Each position is produced only once, even if
several possible matches are possible at that position.
"pattern" can be either a string or a pattern list -- see RePat(),
below.
Default values of s, i1, and i2 are handled as for Icon's built-in
string scanning procedures such as match().
____________________________________________________________
RePat(s) : L
Creates a pattern element list from pattern string "s", but fails if
the pattern string is not syntactically correct. ReMatch() and
ReFind() will automatically convert a pattern string to a pattern
list, but it is faster to do the conversion explicitly if multiple
operations are done using the same pattern. An additional advantage
to compiling the pattern separately is avoiding ambiguity of failure
caused by an incorrect pattern and failure to match a correct pattern.
____________________________________________________________
ReCaseIndependent() : n
ReCaseDependent() : n
Set mode for case-independent or case-dependent matching. The initial
mode is case-dependent.
____________________________________________________________
Accessible Global Variables
After a match, the strings matched by parenthesized regular
expressions are left in list "Re_ParenGroups", and can be accessed by
subscripting in using the same number as the \N construct.
If it is desired that regular expression format be similar to UNIX
filename generation patterns but still retain the power of full
regular expressions, make the following assignments prior to
compiling the pattern string:
Re_ArbString := "*" # Defaults to ".*"
The sets of characters (csets) that define a word, digits, and white
space can be modified. The following assignments can be made before
compiling the pattern string. The character sets are captured when
the pattern is compiled, so changing them after pattern compilation
will not alter the behavior of matches unless the pattern string is
recompiled.
Re_WordChars := 'whatever you like'
# Defaults to &letters ++ &digits ++ "_"
Re_Digits := &digits ++ 'ABCDEFabcdef'
# Defaults to &digits
Re_Space := 'whatever you like'
# Defaults to ' \t\v\n\r\f'
These globals are normally not initialized until the first call to
RePat(), and then only if they are null. They can be explicitly
initialized to their defaults (if they are null) by calling
Re_Default().
____________________________________________________________
Characters compiled into patterns can be passed through a
user-supplied filter procedure, provided in global variable
Re_Filter. The filtering is done before the characters are bound
into the pattern. The filter proc is passed one argument, the string
to filter, and it must return the filtered string as its result. If
the filter proc fails, the string will be used unfiltered. The
filter proc is called with an argument of either type string (for
characters in the pattern) or cset (for character classes [...]).
Filtering is done only as the pattern is compiled. Any filtering of
strings to be matched must be explicitly done.
____________________________________________________________
By default, individual pattern elements are matched in a "leftmost-
longest-first" sequence, which is the order observed by perl, egrep,
and most other regular expression matchers. If the order of matching
is not important a performance improvement might be seen if pattern
elements are matched in "shortest-first" order. The following global
variable setting causes the matcher to operate in leftmost-shortest-
first order.
Re_LeftmostShortest := 1
____________________________________________________________
In the case of patterns containing alternation, ReFind() will
generally not produce positions in increasing order, but will produce
all positions from the first term of the alternation (in increasing
order) followed by all positions from the second (in increasing
order). If it is necessary that the positions be generated in
strictly increasing order, with no duplicates, assign any non-null
value to Re_Ordered:
Re_Ordered := 1
If the Re_Ordered option is chosen, there is a *small* penalty in
efficiency in some cases, and the co-expression facility is required
in your Icon implementation.
____________________________________________________________
Regular Expression Characters and Features Supported
The regular expression format supported by procedures in this file
model very closely those supported by the UNIX "egrep" program, with
modifications as described in the Perl programming language
definition. Following is a brief description of the special
characters used in regular expressions. In the description, the
abbreviation RE means regular expression.
c An ordinary character (not one of the special characters
discussed below) is a one-character RE that matches that
character.
\c A backslash followed by any special character is a one-
character RE that matches the special character itself.
Note that backslash escape sequences representing
non-graphic characters are not supported directly
by these procedures. Of course, strings coded in an
Icon program will have such escapes handled by the
Icon translator. If such escapes must be supported
in strings read from the run-time environment (e.g.
files), they will have to be converted by other means,
such as the Icon Program Library procedure "escape()".
. A period is a one-character RE that matches any
character.
[string] A non-empty string enclosed in square brackets is a one-
character RE that matches any *one* character of that
string. If, the first character is "^" (circumflex),
the RE matches any character not in the remaining
characters of the string. The "-" (minus), when between
two other characters, may be used to indicate a range of
consecutive ASCII characters (e.g. [0-9] is equivalent to
[0123456789]). Other special characters stand for
themselves in a bracketed string.
* Matches zero or more occurrences of the RE to its left.
+ Matches one or more occurrences of the RE to its left.
? Matches zero or one occurrences of the RE to its left.
{N} Matches exactly N occurrences of the RE to its left.
{N,} Matches at least N occurrences of the RE to its left.
{N,M} Matches at least N occurrences but at most M occurrences
of the RE to its left.
^ A caret at the beginning of an entire RE constrains
that RE to match an initial substring of the subject
string.
$ A currency symbol at the end of an entire RE constrains
that RE to match a final substring of the subject string.
| Alternation: two REs separated by "|" match either a
match for the first or a match for the second.
() A RE enclosed in parentheses matches a match for the
regular expression (parenthesized groups are used
for grouping, and for accessing the matched string
subsequently in the match using the \N expression).
\N Where N is a digit in the range 1-9, matches the same
string of characters as was matched by a parenthesized
RE to the left in the same RE. The sub-expression
specified is that beginning with the Nth occurrence
of "(" counting from the left. E.g., ^(.*)\1$ matches
a string consisting of two consecutive occurrences of
the same string.
____________________________________________________________
Extensions beyond UNIX egrep
The following extensions to UNIX REs, as specified in the Perl
programming language, are supported.
\w Matches any alphanumeric (including "_").
\W Matches any non-alphanumeric.
\b Matches only at a word-boundary (word defined as a string
of alphanumerics as in \w).
\B Matches only non-word-boundaries.
\s Matches any white-space character.
\S Matches any non-white-space character.
\d Matches any digit [0-9].
\D Matches any non-digit.
\w, \W, \s, \S, \d, \D can be used within [string] REs.
____________________________________________________________
Notes on computed conjunction expressions by "suspensive recursion"
A conjunction expression of an arbitrary number of terms can be
computed in a looping fashion by the following recursive technique:
procedure Conjunct(v)
if <there is another term to be appended to the conjunction> then
suspend Conjunct(<the next term expression>)
else
suspend v
end
The argument "v" is needed for producing the value of the last term
as the value of the conjunction expression, accurately modeling Icon
conjunction. If the value of the conjunction is not needed, the
technique can be slightly simplified by eliminating "v":
procedure ConjunctAndProduceNull()
if <there is another term to be appended to the conjunction> then
suspend ConjunctAndProduceNull(<the next term expression>)
else
suspend
end
Note that <the next term expression> must still remain in the suspend
expression to test for failure of the term, although its value is not
passed to the recursive invocation. This could have been coded as
suspend <the next term expression> & ConjunctAndProduceNull()
but wouldn't have been as provocative.
Since the computed conjunctions in this program are evaluated only for
their side effects, the second technique is used in two situations:
(1) To compute the conjunction of all of the elements in the
regular expression pattern list (Re_match1()).
(2) To evaluate the "exactly N times" and "N to M times"
control operations (Re_NTimes()).