LispPad

Lightweight Scheme Development on macOS

LispKit Regexp

Library (lispkit regexp) provides an API for defining regular expressions and applying them to strings. Supported are both matching as well as search/replace.

Regular expressions

The regular expression syntax supported by this library corresponds to the one of NSRegularExpression of Apple's Foundation framework. This is also the origin of the documentation of this section.

Meta-characters

\a : Match a bell (\u0007)
\A : Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input.
\b : Outside of a [Set], match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored. Inside of a [Set], match a backspace (\u0008).
\B : Match if the current position is not a word boundary.
\cX : Match a control-X character.
\d : Match any character with the unicode general category of Nd, i.e. numbers and decimal digits.
\D : Match any character that is not a decimal digit.
\e : Match an escape (\u001B).
\E : Terminates a \Q ... \E quoted sequence.
\f : Match a form feed (\u000C).
\G : Match if the current position is at the end of the previous match.
\n : Match a line feed (\u000A).
\N{unicode character} : Match the named character.
\p{unicode property} : Match any character with the specified unicode property.
\P{unicode property} : Match any character not having the specified unicode property.
\Q : Quotes all following characters until \E.
\r : Match a carriage return (\u000D).
\s : Match a whitespace character. Whitespace is defined as [\t\n\f\r\p{Z}].
\S : Match a non-whitespace character.
\t : Match a horizontal tabulation (\u0009).
\uhhhh : Match the character with the hex value hhhh.
\Uhhhhhhhh : Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
\w : Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
\W : Match a non-word character.
\x{hhhh} : Match the character with hex value hhhh. From one to six hex digits may be supplied. \xhh : Match the character with two digit hex value hh.
\X : Match a grapheme cluster.
\Z : Match if the current position is at the end of input, but before the final line terminator, if one exists.
\z : Match if the current position is at the end of input. \n : Back Reference. Match whatever the n-th capturing group matched. n must be a number ≥ 1 and ≤ total number of capture groups in the pattern.
\0ooo : Match an octal character. ooo is from one to three octal digits. 0377 is the largest allowed octal character. The leading zero is required and distinguishes octal constants from back references.
[pattern] : Match any one character from the pattern.
. : Match any character.
^ : Match at the beginning of a line.
$ : Match at the end of a line.
\ : Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . /.

Regular expression operators

| : Alternation. A|B matches either A or B.
* : Match 0 or more times, as many times as possible.
+ : Match 1 or more times, as many times as possible.
? : Match zero or one times, preferring one time if possible.
{n} : Match exactly n times.
{n,} : Match at least n times, as many times as possible.
{n,m} : Match between n and m times, as many times as possible, but not more than m times.
*? : Match zero or more times, as few times as possible.
+? : Match one or more times, as few times as possible.
?? : Match zero or one times, preferring zero.
{n}? : Match exactly n times.
{n,}? : Match at least n times, but no more than required for an overall pattern match.
{n,m}? : Match between n and m times, as few times as possible, but not less than n.
*+ : Match zero or more times, as many times as possible when first encountered, do not retry with fewer even if overall match fails (possessive match). ++ : Match one or more times (possessive match).
?+ : Match zero or one times (possessive match).
{n}+ : Match exactly n times.
{n,}+ : Match at least n times (possessive match).
{n,m}+ : Match between n and m times (possessive match).
(...) : Capturing parentheses; the range of input that matched the parenthesized subexpression is available after the match.
(?:...) : Non-capturing parentheses; groups the included pattern, but does not provide capturing of matching text (more efficient than capturing parentheses).
(?>...) : Atomic-match parentheses; first match of the parenthesized subexpression is the only one tried. If it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>".
(?# ... ) : Free-format comment (?# comment).
(?= ... ) : Look-ahead assertion. True, if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?! ... ) : Negative look-ahead assertion. True, if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<= ... ) : Look-behind assertion. True, if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators). (?<! ... ) : Negative look-behind assertion. True, if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators).
(?ismwx-ismwx: ... ) : Flag settings. Evaluate the parenthesized expression with the specified flags enabled or disabled.
(?ismwx-ismwx) : Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.

Template Matching

$n : The text of capture group n will be substituted for $n. n must be ≥ 0 and not greater than the number of capture groups. A $ not followed by a digit has no special meaning, and will appear in the substitution text as itself, i.e. $.
\ : Treat the following character as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for $ and \, but may be used on any other character.

Flag options

The following flags control various aspects of regular expression matching. These flags get specified within the pattern using the (?ismx-ismx) pattern options.

i : If set, matching will take place in a case-insensitive manner.
x : If set, allow use of white space and #comments within patterns.
s : If set, a "." in a pattern will match a line terminator in the input text. By default, it will not. Note that a carriage-return/line-feed pair in text behave as a single line terminator, and will match a single "." in a regular expression pattern.
m : Control the behavior of ^ and $ in a pattern. By default these will only match at the start and end, respectively, of the input text. If this flag is set, ^ and $ will also match at the start and end of each line within the input text.
w : Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either word or non-word, which approximates traditional regular expression behavior.

API

(regexp? obj) [procedure]

Returns #t if obj is a regular expression object; otherwise #f is returned.

(regexp str) [procedure]
(regexp str opt ...)

Returns a new regular expression object from the given regular expression pattern str and matching options opt, ... . str is a string, matching options opt are symbols. The following matching options are supported:

  • case-insensitive: Match letters in the regular expression independent of their case.
  • allow-comments: Ignore whitespace and #-prefixed comments in the regular expression pattern.
  • ignore-meta: Treat the entire regular expression pattern as a literal string.
  • dot-matches-line-separator: Allow . to match any character, including line separators.
  • anchors-match-lines: Allow ^ and $ to match the start and end of lines.
  • unix-only-line-separators: Treat only \n as a line separator; otherwise, all standard line separators are used.
  • unicode-words: Use Unicode TR#29 to specify word boundaries; otherwise, all traditional regular expression word boundaries are used.

(regexp-pattern regexp) [procedure]

Returns the regular expression pattern for the given regular expression object regexp. A regular expression pattern is a string matching the regular expression syntax supported by library (lispkit regexp).

(regexp-capture-groups regexp) [procedure]

Returns the number of capture groups of the given regular expression object regexp.

(escape-regexp-pattern str) [procedure]

Returns a regular expression pattern string by adding backslash escapes to pattern str as necessary to protect any characters that would match as pattern meta-characters.


(escape-regexp-pattern "(home/objecthub)")
⟹ "\\(home\\/objecthub\\)"

(escape-regexp-template str) [procedure]

Returns a regular expression pattern template string by adding backslash escapes to pattern template str as necessary to protect any characters that would match as pattern meta-characters.

(regexp-matches regexp str) [procedure]
(regexp-matches regexp str start)
(regexp-matches regexp str start end)

Returns a matching spec if the regular expression object regexp successfully matches the entire string str from position start (inclusive) to end (exclusive); otherwise, #f is returned. The default for start is 0; the default for end is the length of the string.

A matching spec returned by regexp-matches consists of pairs of fixnum positions (startpos . endpos) in str. The first pair is always representing the full match (i.e. startpos is 0 and endpos is the length of str), all other pairs represent the positions of the matching capture groups of regexp.


(define email (regexp "[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}"))
(regexp-matches email "matthias@objecthub.net")
⟹ ((0 . 22))
(define series (regexp "Season\\s+(\\d+)\\s+Episode\\s+(\\d+)"))
(regexp-matches series "Season 3  Episode 12")
⟹ ((0 . 20) (7 . 8) (18 . 20))

(regexp-matches? regexp str) [procedure]
(regexp-matches? regexp str start)
(regexp-matches? regexp str start end)

Returns #t if the regular expression object regexp successfully matches the entire string str from position start (inclusive) to end (exclusive); otherwise, #f is returned. The default for start is 0; the default for end is the length of the string.

(regexp-search regexp str) [procedure]
(regexp-search regexp str start)
(regexp-search regexp str start end)

Returns a matching spec for the first match of the regular expression regexp with a part of string str between position start (inclusive) and end (exclusive). If regexp does not match any part of str between start and end, #f is returned. The default for start is 0; the default for end is the length of the string.

A matching spec returned by regexp-search consists of pairs of fixnum positions (startpos . endpos) in str. The first pair is always representing the full match of the pattern, all other pairs represent the positions of the matching capture groups of regexp.


(define email (regexp "[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}"))
(regexp-search email "Contact matthias@objecthub.net or foo@bar.org")
⟹ ((8 . 30))
(define series (regexp "Season\\s+(\\d+)\\s+Episode\\s+(\\d+)"))
(regexp-search series "New Season 3 Episode 12: Pilot")
⟹ ((4 . 23) (11 . 12) (21 . 23))

(regexp-search-all regexp str) [procedure]
(regexp-search-all regexp str start)
(regexp-search-all regexp str start end)

Returns a list of all matching specs for matches of the regular expression regexp with parts of string str between position start (inclusive) and end (exclusive). If regexp does not match any part of str between start and end, the empty list is returned. The default for start is 0; the default for end is the length of the string.

A matching spec returned by regexp-search consists of pairs of fixnum positions (startpos . endpos) in str. The first pair is always representing the full match of the pattern, all other pairs represent the positions of the matching capture groups of regexp.


(define email (regexp "[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}"))
(regexp-search-all email "Contact matthias@objecthub.net or foo@bar.org")
⟹ (((8 . 30)) ((34 . 45)))
(define series (regexp "Season\\s+(\\d+)\\s+Episode\\s+(\\d+)"))
(regexp-search-all series "New Season 3 Episode 12: Pilot")
⟹ (((4 . 23) (11 . 12) (21 . 23)))

(regexp-extract regexp str) [procedure]
(regexp-extract regexp str start)
(regexp-extract regexp str start end)

Returns a list of substrings from str which all represent full matches of the regular expression regexp with parts of string str between position start (inclusive) and end (exclusive). If regexp does not match any part of str between start and end, the empty list is returned. The default for start is 0; the default for end is the length of the string.


(define email (regexp "[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}"))
(regexp-extract email "Contact matthias@objecthub.net or foo@bar.org" 10)
⟹ ("tthias@objecthub.net" "foo@bar.org")
(define series (regexp "Season\\s+(\\d+)\\s+Episode\\s+(\\d+)"))
(regexp-extract series "New Season 3 Episode 12: Pilot")
⟹ ("Season 3 Episode 12")

(regexp-split regexp str) [procedure]
(regexp-split regexp str start)
(regexp-split regexp str start end)

Splits string str into a list of possibly empty substrings separated by non-empty matches of regular expression regexp within position start (inclusive) and end (exclusive). If regexp does not match any part of str between start and end, a list with str as its only element is returned. The default for start is 0; the default for end is the length of the string.


(define email (regexp "[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}"))
(regexp-split email "Contact matthias@objecthub.net or foo@bar.org" 10)
⟹ ("Contact ma" " or " "")
(define series (regexp "Season\\s+(\\d+)\\s+Episode\\s+(\\d+)"))
(regexp-split series "New Season 3 Episode 12: Pilot")
⟹ ("New " ": Pilot")

(regexp-partition regexp str) [procedure]
(regexp-partition regexp str start)
(regexp-partition regexp str start end)

Partitions string str into a list of non-empty strings matching regular expression regexp within position start (inclusive) and end (exclusive), interspersed with the unmatched portions of the whole string. The first and every odd element is an unmatched substring, which will be the empty string if regexp matches at the beginning of the string or end of the previous match. The second and every even element will be a substring fully matching regexp. If str is the empty string or if there is no match at all, the result is a list with str as its only element.


(define email (regexp "[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}"))
(regexp-partition email "Contact matthias@objecthub.net or foo@bar.org" 10)
⟹ ("Contact ma" "tthias@objecthub.net" " or " "foo@bar.org" "")
(define series (regexp "Season\\s+(\\d+)\\s+Episode\\s+(\\d+)"))
(regexp-partition series "New Season 3 Episode 12: Pilot")
⟹ ("New " "Season 3 Episode 12" ": Pilot")

(regexp-replace regexp str subst) [procedure]
(regexp-replace regexp str subst start)
(regexp-replace regexp str subst start end)

Returns a new string replacing all matches of regular expression regexp in string str within position start (inclusive) and end (exclusive) with string subst. regexp-replace will always return a new string, even if there are no matches and replacements.

The optional parameters start and end restrict both the matching and the substitution, to the given positions, such that the result is equivalent to omitting these parameters and replacing on (substring str start end).


(define email (regexp "[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}"))
(regexp-replace email "Contact matthias@objecthub.net or foo@bar.org" "" 10)
⟹ "Contact ma or "
(define series (regexp "Season\\s+(\\d+)\\s+Episode\\s+(\\d+)"))
(regexp-replace series "New Season 3 Episode 12: Pilot" "Series")
⟹ "New Series: Pilot"

(regexp-replace! x) [procedure]

Mutates string str by replacing all matches of regular expression regexp within position start (inclusive) and end (exclusive) with string subst. The optional parameters start and end restrict both the matching and the substitution. regexp-replace! returns the number of replacements that were applied.


(define email (regexp "[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}"))
(define str "Contact matthias@objecthub.net or foo@bar.org")
(regexp-replace! email str "" 10) ⟹ 2
str ⟹ "Contact ma or "

(regexp-fold regexp kons knil str) [procedure]
(regexp-fold regexp kons knil str finish)
(regexp-fold regexp kons knil str finish start)
(regexp-fold regexp kons knil str finish start end)

regexp-fold is the most fundamental and generic regular expression matching iterator. It repeatedly searches string str for the regular expression regexp so long as a match can be found. On each successful match, it applies (kons i regexp-match str acc) where i is the index since the last match (beginning with start), regexp-match is the resulting matching spec, and acc is the result of the previous kons application, beginning with knil. When no more matches can be found, regexp-fold calls finish with the same arguments, except that regexp-match is #f. By default, finish just returns acc.


(regexp-fold (regexp "(\\w+)")
             (lambda (i m str acc)
               (let ((s (substring str (caar m) (cdar m))))
                 (if (zero? i) s (string-append acc "-" s))))
             ""
             "to  be  or  not  to  be")
⟹ "to-be-or-not-to-be"