Icon Program Library: procs/regexp.icn

regexp.icn: Procedure for regular-expression pattern matching

procedure RePat:           regular expression pattern list
procedure ReMatch:         position past regular expression matched
procedure ReFind:          position where regular expression matched

link regexp
May 19, 1996; Robert J. Alexander
This file is in the public domain.

This is a kit of procedures to deal with UNIX-like regular expression
patterns.

These procedures are interesting partly because of the "recursive
suspension" (or "suspensive recursion" :-) technique used to simulate
conjunction of an arbitrary number of computed expressions (see
notes, below).

The public procedures are:

ReMatch(pattern,s,i1,i2) : i3,i4,...,iN
ReFind(pattern,s,i1,i2) : i3,i4,...,iN
RePat(s) : pattern list
____________________________________________________________

ReMatch() produces the sequence of positions in "s" past a substring
starting at "i1" that matches "pattern", but fails if there is no
such position.  Similar to match(), but is capable of generating
multiple positions.

ReFind() produces the sequence of positions in "s" where substrings
begin that match "pattern", but fails if there is no such position.
Similar to find().  Each position is produced only once, even if
several possible matches are possible at that position.

"pattern" can be either a string or a pattern list -- see RePat(),
below.

Default values of s, i1, and i2 are handled as for Icon's built-in
string scanning procedures such as match().
____________________________________________________________

RePat(s) : L

Creates a pattern element list from pattern string "s", but fails if
the pattern string is not syntactically correct.  ReMatch() and
ReFind() will automatically convert a pattern string to a pattern
list, but it is faster to do the conversion explicitly if multiple
operations are done using the same pattern.  An additional advantage
to compiling the pattern separately is avoiding ambiguity of failure
caused by an incorrect pattern and failure to match a correct pattern.
____________________________________________________________

ReCaseIndependent() : n
ReCaseDependent() : n

Set mode for case-independent or case-dependent matching.  The initial
mode is case-dependent.
____________________________________________________________

Accessible Global Variables

After a match, the strings matched by parenthesized regular
expressions are left in list "Re_ParenGroups", and can be accessed by
subscripting in using the same number as the \N construct.

If it is desired that regular expression format be similar to UNIX
filename generation patterns but still retain the power of full
regular expressions, make the following assignments prior to
compiling the pattern string:

     Re_ArbString := "*"     # Defaults to ".*"

The sets of characters (csets) that define a word, digits, and white
space can be modified.  The following assignments can be made before
compiling the pattern string.  The character sets are captured when
the pattern is compiled, so changing them after pattern compilation
will not alter the behavior of matches unless the pattern string is
recompiled.

     Re_WordChars := 'whatever you like'
                     # Defaults to &letters ++ &digits ++ "_"
     Re_Digits := &digits ++ 'ABCDEFabcdef'
                     # Defaults to &digits
     Re_Space := 'whatever you like'
                     # Defaults to ' \t\v\n\r\f'

These globals are normally not initialized until the first call to
RePat(), and then only if they are null.  They can be explicitly
initialized to their defaults (if they are null) by calling
Re_Default().
____________________________________________________________

Characters compiled into patterns can be passed through a
user-supplied filter procedure, provided in global variable
Re_Filter.  The filtering is done before the characters are bound
into the pattern.  The filter proc is passed one argument, the string
to filter, and it must return the filtered string as its result.  If
the filter proc fails, the string will be used unfiltered.  The
filter proc is called with an argument of either type string (for
characters in the pattern) or cset (for character classes [...]).

Filtering is done only as the pattern is compiled.  Any filtering of
strings to be matched must be explicitly done.
____________________________________________________________

By default, individual pattern elements are matched in a "leftmost-
longest-first" sequence, which is the order observed by perl, egrep,
and most other regular expression matchers.  If the order of matching
is not important a performance improvement might be seen if pattern
elements are matched in "shortest-first" order.  The following global
variable setting causes the matcher to operate in leftmost-shortest-
first order.

 Re_LeftmostShortest := 1
____________________________________________________________

In the case of patterns containing alternation, ReFind() will
generally not produce positions in increasing order, but will produce
all positions from the first term of the alternation (in increasing
order) followed by all positions from the second (in increasing
order).  If it is necessary that the positions be generated in
strictly increasing order, with no duplicates, assign any non-null
value to Re_Ordered:

     Re_Ordered := 1

If the Re_Ordered option is chosen, there is a *small* penalty in
efficiency in some cases, and the co-expression facility is required
in your Icon implementation.
____________________________________________________________

Regular Expression Characters and Features Supported

The regular expression format supported by procedures in this file
model very closely those supported by the UNIX "egrep" program, with
modifications as described in the Perl programming language
definition.  Following is a brief description of the special
characters used in regular expressions.  In the description, the
abbreviation RE means regular expression.

c            An ordinary character (not one of the special characters
             discussed below) is a one-character RE that matches that
             character.

\c           A backslash followed by any special character is a one-
             character RE that matches the special character itself.

             Note that backslash escape sequences representing
             non-graphic characters are not supported directly
             by these procedures.  Of course, strings coded in an
             Icon program will have such escapes handled by the
             Icon translator.  If such escapes must be supported
             in strings read from the run-time environment (e.g.
             files), they will have to be converted by other means,
             such as the Icon Program Library procedure "escape()".

.            A period is a one-character RE that matches any
             character.

[string]     A non-empty string enclosed in square brackets is a one-
             character RE that matches any *one* character of that
             string.  If, the first character is "^" (circumflex),
             the RE matches any character not in the remaining
             characters of the string.  The "-" (minus), when between
             two other characters, may be used to indicate a range of
             consecutive ASCII characters (e.g. [0-9] is equivalent to
             [0123456789]).  Other special characters stand for
             themselves in a bracketed string.

*            Matches zero or more occurrences of the RE to its left.

+            Matches one or more occurrences of the RE to its left.

?            Matches zero or one occurrences of the RE to its left.

{N}          Matches exactly N occurrences of the RE to its left.

{N,}         Matches at least N occurrences of the RE to its left.

{N,M}        Matches at least N occurrences but at most M occurrences
             of the RE to its left.

^            A caret at the beginning of an entire RE constrains
             that RE to match an initial substring of the subject
             string.

$            A currency symbol at the end of an entire RE constrains
             that RE to match a final substring of the subject string.

|            Alternation: two REs separated by "|" match either a
             match for the first or a match for the second.

()           A RE enclosed in parentheses matches a match for the
             regular expression (parenthesized groups are used
             for grouping, and for accessing the matched string
             subsequently in the match using the \N expression).

\N           Where N is a digit in the range 1-9, matches the same
             string of characters as was matched by a parenthesized
             RE to the left in the same RE.  The sub-expression
             specified is that beginning with the Nth occurrence
             of "(" counting from the left.  E.g., ^(.*)\1$ matches
             a string consisting of two consecutive occurrences of
             the same string.
____________________________________________________________

Extensions beyond UNIX egrep

The following extensions to UNIX REs, as specified in the Perl
programming language, are supported.

\w           Matches any alphanumeric (including "_").
\W           Matches any non-alphanumeric.

\b           Matches only at a word-boundary (word defined as a string
             of alphanumerics as in \w).
\B           Matches only non-word-boundaries.

\s           Matches any white-space character.
\S           Matches any non-white-space character.

\d           Matches any digit [0-9].
\D           Matches any non-digit.

\w, \W, \s, \S, \d, \D can be used within [string] REs.
____________________________________________________________

Notes on computed conjunction expressions by "suspensive recursion"

A conjunction expression of an arbitrary number of terms can be
computed in a looping fashion by the following recursive technique:

     procedure Conjunct(v)
        if <there is another term to be appended to the conjunction> then
           suspend Conjunct(<the next term expression>)
        else
           suspend v
     end

The argument "v" is needed for producing the value of the last term
as the value of the conjunction expression, accurately modeling Icon
conjunction.  If the value of the conjunction is not needed, the
technique can be slightly simplified by eliminating "v":

     procedure ConjunctAndProduceNull()
        if <there is another term to be appended to the conjunction> then
           suspend ConjunctAndProduceNull(<the next term expression>)
        else
           suspend
     end

Note that <the next term expression> must still remain in the suspend
expression to test for failure of the term, although its value is not
passed to the recursive invocation.  This could have been coded as

           suspend <the next term expression> & ConjunctAndProduceNull()

but wouldn't have been as provocative.

Since the computed conjunctions in this program are evaluated only for
their side effects, the second technique is used in two situations:

     (1)     To compute the conjunction of all of the elements in the
             regular expression pattern list (Re_match1()).

     (2)     To evaluate the "exactly N times" and "N to M times"
             control operations (Re_NTimes()).

Source code | Program Library Page | Icon Home Page