Regex Pattern Info: Unterschied zwischen den Versionen

Aus expecco Wiki (Version 2.x)
Zur Navigation springen Zur Suche springen
(Die Seite wurde neu angelegt: „== Regex Patterns == The following text was extracted from the online Smalltalk class documentation (it is found in the RxParser class's documentation string.…“)
 
 
(44 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 1: Zeile 1:
== Regex Patterns ==
== Regex vs. GLOB Patterns ==

Be reminded that there are 2 common ways to specify string match patterns:
[https://en.wikipedia.org/wiki/Regular_expression Regex (regular expression)] and
[https://en.wikipedia.org/wiki/Glob_(programming) GLOB] patterns.

Regex patterns are more powerful but also more complex and kind of harder to use than GLOB patterns. Details follow below.

GLOB patterns are much simpler and are the same as used eg. by the shell to match filenames. Thus, many users will already be familiar with GLOB patterns but are new to Regex patterns.

= GLOB Patterns =
The special match characters are:
* <code>'*'</code> to match any sequence of characters, including the empty string
* <code>'#'</code> to match any single character
* <code>'[...]'</code> to match a set or range of characters
:eg. <code>'[abxyz]'</code> to match any of 'a', 'b', 'x', 'y' or 'z'
:eg. <code>'[a-z]'</code> to match any lower case letter
* <code>'\x'</code> use a backslash to use the x-character literally (to escape special match characters)
Thus the GLOB pattern <code>'foo*'</code> will match any of <code>'foo'</code>, <code>'footer'</code>, <code>'foo1'</code> or <code>'foo bar'</code>, whereas the pattern <code>'foo\*'</code> will ONLY match the string <code>'foo*'</code>.

Notice the main difference in that a GLOB <code>'*'</code> matches by itself, whereas the Regex <code>'*'</code> repeats the previous match.
<br>To match the 'foo' prefix (as in the above GLOB example) with regex, you'd have to use a regex pattern like <code>'foo.*'</code>.

== Programmer's GLOB API ==

Expecco's stdandard library includes action blocks for both GLOB and Regex matching. However, you may also use them inside elementary JavaScript or Smalltalk coded actions.

* <i>aString</i><code>matches:</code><i>aPattern</i>
: returns true if aString matches the given GLOB pattern; case sensitive
* <i>aString</i><code>matches:</code><i>aPattern</i><code>caseSensitive:</code><i>aBoolean</i>
: returns true if aString matches the given GLOB pattern with possible case insensitivity

* <i>multiPattern</i><code>compoundMatch:</code><i>aString</i>
: returns true if aString matches case sensitive any of the given GLOB patterns in multiPattern, which consists of multiple GLOB patterns separated by semicolon.
* <i>multiPattern</i><code>compoundMatch:</code><i>aString</i><code>caseSensitive:</code><i>aBoolean</i>
: ditto specifying case sensitivity

* <i>aString</i><code>findMatchString:</code><i>aPattern</i>
: returns the index of the first matching substring within aString, or 0 if there is no match; case sensitive
* <i>aString</i><code>findMatchString:</code><i>aPattern</i><code>caseSensitive:</code><i>aBoolean</i>
: ditto specifying case sensitivity

== GLOB Examples ==
'hello' matches: 'h*' -> true (any string starting with 'h' is OK)
'Hello' matches: 'h*' -> false (due to case sensitivity)
'Hello' matches: 'h*' caseSensitive:false -> true
'hello' matches: 'h###' -> false (only 3 chars after the 'h' match)
'h' matches: 'h#*' -> false (at least 1 character must be after the 'h')
'h123' matches: 'h#*' -> true
'123' matches: '[0-9][0-9]' -> false (only 2-digit strings are matched)
'*.png;*.gif' compoundMatch:'bar.jpg' -> false
'*.png;*.gif' compoundMatch:'bar.gif' -> true
'hello world' findMatchString: 'w*' -> 7 ('world' starts at position 7)
'hello world' findMatchString: 'x*' -> 0 (not found)

= Regex Patterns =


The following text was extracted from the online Smalltalk class documentation
The following text was extracted from the online Smalltalk class documentation
(it is found in the RxParser class's documentation string.
(it is found in the RxParser class's documentation string.
<br>The code fragments below are all in Smalltalk; the API for JavaScript is similar.

The code fragments below are all in Smalltalk; the API for JavaScript is similar.
The pattern info is also valid if you use the regex matching actions from the standard library.
The pattern info is also valid if you use the regex matching actions from the standard library.


== Programmer's Regex API ==
=== SYNTAX ===
There are 2 kinds of usages:
* queries (i.e. "''does a string match a pattern''")
* extraction (i.e. "''extract matches from a string''")

==== Queries ====

The APIs for this are:
* <i>aString</i><code>matchesRegex:</code><i>aPattern</i>
: returns true if aString matches the given regex pattern; case sensitive
* <i>aString</i><code>matchesRegex:</code><i>aPattern</i><code>caseSensitive:</code><i>aBoolean</i>
: returns true if aString matches the given regex pattern with possible case insensitivity
* <i>aString</i><code>hasAnyRegexMatches:</code><i>aPattern</i>
: returns true if aString contains a substring which matches the given regex pattern; case sensitive
* <i>aString</i><code>hasAnyRegexMatches:</code><i>aPattern</i><code>caseSensitive:</code><i>aBoolean</i>
: returns true if aString contains a substring which matches the given regex pattern

==== Extraction ====
The API is:
*<i>aString</i><code>allRegexMatches:</code><i>aPattern</i>
: returns a collection of matching substrings

=== Syntax ===


The simplest regular expression is a single character. It matches
The simplest regular expression is a single character. It matches
Zeile 13: Zeile 91:
exactly the same sequence of characters:
exactly the same sequence of characters:


'a' matchesRegex: 'a' -- true
'a' matchesRegex: 'a' -> true
'foobar' matchesRegex: 'foobar' -- true
'foobar' matchesRegex: 'foobar' -> true
'blorple' matchesRegex: 'foobar' -- false
'blorple' matchesRegex: 'foobar' -> false


The above paragraph introduced a primitive regular expression (a
The above paragraph introduced a primitive regular expression (a
Zeile 28: Zeile 106:
expression. For example:
expression. For example:


'ab' matchesRegex: 'a*b' -- true
'ab' matchesRegex: 'a*b' -> true
'aaaaab' matchesRegex: 'a*b' -- true
'aaaaab' matchesRegex: 'a*b' -> true
'b' matchesRegex: 'a*b' -- true
'b' matchesRegex: 'a*b' -> true
'aac' matchesRegex: 'a*b' -- false: b does not match
'aac' matchesRegex: 'a*b' -> false: b does not match


A star's precedence is higher than that of sequencing. A star applies
A star's precedence is higher than that of sequencing. A star applies
Zeile 38: Zeile 116:
or more occurrences of ab':
or more occurrences of ab':


'abbb' matchesRegex: 'ab*' -- true
'abbb' matchesRegex: 'ab*' -> true
'abab' matchesRegex: 'ab*' -- false
'abab' matchesRegex: 'ab*' -> false


To actually make a regex matching `zero or more occurrences of ab',
To actually make a regex matching `zero or more occurrences of ab',
`ab' is enclosed in parentheses:
`ab' is enclosed in parentheses:


'abab' matchesRegex: '(ab)*' -- true
'abab' matchesRegex: '(ab)*' -> true
'abcab' matchesRegex: '(ab)*' -- false: c spoils the fun
'abcab' matchesRegex: '(ab)*' -> false: c spoils the fun


Two other operators similar to `*' are `+' and `?'. `+' (positive
Two other operators similar to `*' are `+' and `?'. `+' (positive
Zeile 52: Zeile 130:
more, occurrences.
more, occurrences.


'ac' matchesRegex: 'ab*c' -- true
'ac' matchesRegex: 'ab*c' -> true
'ac' matchesRegex: 'ab+c' -- false: need at least one b
'ac' matchesRegex: 'ab+c' -> false: need at least one b
'abbc' matchesRegex: 'ab+c' -- true
'abbc' matchesRegex: 'ab+c' -> true
'abbc' matchesRegex: 'ab?c' -- false: too many b's
'abbc' matchesRegex: 'ab?c' -> false: too many b's


As we have seen, characters `*', `+', `?', `(', and `)' have special
As we have seen, characters `*', `+', `?', `(', and `)' have special
Zeile 64: Zeile 142:
further).
further).


'ab*' matchesRegex: 'ab*' -- false: star in the right string is special
'ab*' matchesRegex: 'ab*' -> false: star in the right string is special
'ab*' matchesRegex: 'ab\*' -- true
'ab*' matchesRegex: 'ab\*' -> true
'a\c' matchesRegex: 'a\\c' -- true
'a\c' matchesRegex: 'a\\c' -> true


The last operator is `|' meaning `or'. It is placed between two
The last operator is `|' meaning `or'. It is placed between two
Zeile 74: Zeile 152:
number of b's, or b followed by any number of a's':
number of b's, or b followed by any number of a's':


'abb' matchesRegex: 'ab*|ba*' -- true
'abb' matchesRegex: 'ab*|ba*' -> true
'baa' matchesRegex: 'ab*|ba*' -- true
'baa' matchesRegex: 'ab*|ba*' -> true
'baab' matchesRegex: 'ab*|ba*' -- false
'baab' matchesRegex: 'ab*|ba*' -> false


A slightly more complex example is the following expression, matching the
A slightly more complex example is the following expression, matching the
Zeile 92: Zeile 170:
A character set is a string of characters enclosed in square
A character set is a string of characters enclosed in square
brackets. It matches any single character if it appears between the
brackets. It matches any single character if it appears between the
brackets. For example, `[01]' matches either `0' or `1':
brackets. For example, <code>'[01]'</code> matches either `0' or `1':


'0' matchesRegex: '[01]' -- true
'0' matchesRegex: '[01]' -> true
'3' matchesRegex: '[01]' -- false
'3' matchesRegex: '[01]' -> false
'11' matchesRegex: '[01]' -- false: a set matches only one character
'11' matchesRegex: '[01]' -> false: a set matches only one character


Using plus operator, we can build the following binary number
Using the plus operator <code>'+'</code>, we can build the following binary number
recognizer:
recognizer:


'10010100' matchesRegex: '[01]+' -- true
'10010100' matchesRegex: '[01]+' -> true
'10001210' matchesRegex: '[01]+' -- false
'10001210' matchesRegex: '[01]+' -> false


If the first character after the opening bracket is `^', the set is
If the first character after the opening bracket is `^', the set is
Zeile 108: Zeile 186:
brackets:
brackets:


'0' matchesRegex: '[^01]' -- false
'0' matchesRegex: '[^01]' -> false
'3' matchesRegex: '[^01]' -- true
'3' matchesRegex: '[^01]' -> true


For convenience, a set may include ranges: pairs of characters
For convenience, a set may include ranges: pairs of characters
Zeile 121: Zeile 199:
[01-] -- put the dash as the last character
[01-] -- put the dash as the last character
[]01] -- put the closing bracket as the first character
[]01] -- put the closing bracket as the first character
[^]01] (thus, empty and universal sets cannot be specified)
[^]01] (thus, empty and universal sets cannot be specified)


Regular expressions can also include the following backquote escapes
Regular expressions can also include the following backquote escapes
Zeile 192: Zeile 270:
\> an empty string at the end of a word
\> an empty string at the end of a word


'axyzb' matchesRegex: 'a.+b' -- true
'axyzb' matchesRegex: 'a.+b' -> true
'ax zb' matchesRegex: 'a.+b' -- false (space is not matched by `.')
'ax zb' matchesRegex: 'a.+b' -> false (space is not matched by `.')


Again, all the above three characters are special and should be quoted
Again, all the above three characters are special and should be quoted
Zeile 200: Zeile 278:
\. to match the character '.'
\. to match the character '.'


=== EXAMPLES ===
== Regex Examples ==


As the introductions said, a great use for regular expressions is user
As the introductions said, a great use for regular expressions is user
Zeile 206: Zeile 284:
that might be handy in checking input entered by the user in an input
that might be handy in checking input entered by the user in an input
field. Try them out by entering something between the quotes and
field. Try them out by entering something between the quotes and
print-iting. (Also, try to imagine Smalltalk code that each validation
printing-it. (Also, try to imagine Smalltalk code that each validation
would require if coded by hand). Most example expressions could have
would require if coded by hand). Most example expressions could have
been written in alternative ways.
been written in alternative ways.


Checking if aString may represent a nonnegative integer number:
===== Checking if aString may represent a nonnegative integer number =====


'' matchesRegex: ':isDigit:+'
'' matchesRegex: ':isDigit:+'
Zeile 218: Zeile 296:
'' matchesRegex: '\d+'
'' matchesRegex: '\d+'


Checking if aString may represent an integer number with an optional
===== Checking if aString may represent an integer number with an optional sign in front =====
sign in front:


'' matchesRegex: '(\+|-)?\d+'
'' matchesRegex: '(\+|-)?\d+'


Checking if aString is a fixed-point number, with at least one digit
===== Checking if aString is a fixed-point number, with at least one digit after a dot =====
is required after a dot:


'' matchesRegex: '(\+|-)?\d+(\.\d+)?'
'' matchesRegex: '(\+|-)?\d+(\.\d+)?'
Zeile 232: Zeile 308:
'' matchesRegex: '(\+|-)?\d+(\.\d*)?'
'' matchesRegex: '(\+|-)?\d+(\.\d*)?'


Recognizer for a string that might be a name: one word with first
===== Recognizer for a string that might be a name: one word with first capital letter, no blanks, no digits =====
capital letter, no blanks, no digits. More traditional:
More traditional:


'' matchesRegex: '[A-Z][A-Za-z]*'
'' matchesRegex: '[A-Z][A-Za-z]*'
Zeile 241: Zeile 317:
'' matchesRegex: ':isUppercase::isAlphabetic:*'
'' matchesRegex: ':isUppercase::isAlphabetic:*'


A date in format MMM DD, YYYY with any number of spaces in between, in
===== A date in format MMM DD, YYYY with any number of spaces in between, in XX century =====
XX century:


'' matchesRegex: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)'
'' matchesRegex: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)'
Zeile 250: Zeile 325:
that have matched them (i.e. month name, day number, and year number).
that have matched them (i.e. month name, day number, and year number).


For dessert, coming back to numbers: here is a recognizer for a
Coming back to numbers: here is a recognizer for a
general number format: anything like 999, or 999.999, or -999.999e+21.
general number format: anything like 999, or 999.999, or -999.999e+21.


'' matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'
'' matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'

===== To extract all integer numbers from a string =====
'1234 abcd 3456 defg' allRegexMatches:'[0-9]+'
will return a collection containing <code>('1234' '3456')</code><br>(reminder: '[0-9]' will match any digit-character; '+' will match repetitions of the previous pattern with at least 1 occurrence)

'1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+'
would also care for an optional sign and return <code>('1234' '-3456')</code><br>(reminder: '(-)' will match the '-' character; '?' will match an optional match of the previous pattern (0 or 1 occurrences of the previous pattern))

To convert the extracted strings into real numbers, use the collect operation:
('1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+')
collect:[:eachString | eachString asNumber]
or even shorter:
('1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+') collect:#asNumber
(reminder: the collection enumeration operations which expect a one-argument-block (lambda) can also be given the name of the operation to be applied to each element)

===== To extract non-numbers from a string =====
'1234 abcd 3456 defg' allRegexMatches:'[a-z]+'
will return a collection containing <code>('abcd' 'defg')</code>
'1234 Abcd 3456 defg' allRegexMatches:'[a-z]+'
will return a collection containing <code>('bcd' 'defg')</code>
'1234 Abcd 3456 defg' allRegexMatches:'[a-z]+' caseSensitive:false
will return a collection containing <code>('Abcd' 'defg')</code>
'1234 Abcd 3456 defg' allRegexMatches:'[a-zA-Z]+'
ditto. will return a collection containing <code>('Abcd' 'defg')</code>

Aktuelle Version vom 26. Oktober 2023, 11:09 Uhr

Regex vs. GLOB Patterns[Bearbeiten]

Be reminded that there are 2 common ways to specify string match patterns: Regex (regular expression) and GLOB patterns.

Regex patterns are more powerful but also more complex and kind of harder to use than GLOB patterns. Details follow below.

GLOB patterns are much simpler and are the same as used eg. by the shell to match filenames. Thus, many users will already be familiar with GLOB patterns but are new to Regex patterns.

GLOB Patterns[Bearbeiten]

The special match characters are:

  • '*' to match any sequence of characters, including the empty string
  • '#' to match any single character
  • '[...]' to match a set or range of characters
eg. '[abxyz]' to match any of 'a', 'b', 'x', 'y' or 'z'
eg. '[a-z]' to match any lower case letter
  • '\x' use a backslash to use the x-character literally (to escape special match characters)

Thus the GLOB pattern 'foo*' will match any of 'foo', 'footer', 'foo1' or 'foo bar', whereas the pattern 'foo\*' will ONLY match the string 'foo*'.

Notice the main difference in that a GLOB '*' matches by itself, whereas the Regex '*' repeats the previous match.
To match the 'foo' prefix (as in the above GLOB example) with regex, you'd have to use a regex pattern like 'foo.*'.

Programmer's GLOB API[Bearbeiten]

Expecco's stdandard library includes action blocks for both GLOB and Regex matching. However, you may also use them inside elementary JavaScript or Smalltalk coded actions.

  • aStringmatches:aPattern
returns true if aString matches the given GLOB pattern; case sensitive
  • aStringmatches:aPatterncaseSensitive:aBoolean
returns true if aString matches the given GLOB pattern with possible case insensitivity
  • multiPatterncompoundMatch:aString
returns true if aString matches case sensitive any of the given GLOB patterns in multiPattern, which consists of multiple GLOB patterns separated by semicolon.
  • multiPatterncompoundMatch:aStringcaseSensitive:aBoolean
ditto specifying case sensitivity
  • aStringfindMatchString:aPattern
returns the index of the first matching substring within aString, or 0 if there is no match; case sensitive
  • aStringfindMatchString:aPatterncaseSensitive:aBoolean
ditto specifying case sensitivity

GLOB Examples[Bearbeiten]

'hello' matches: 'h*'                     -> true (any string starting with 'h' is OK)
'Hello' matches: 'h*'                     -> false (due to case sensitivity)
'Hello' matches: 'h*' caseSensitive:false -> true
'hello' matches: 'h###'                   -> false (only 3 chars after the 'h' match)
'h'     matches: 'h#*'                    -> false (at least 1 character must be after the 'h')
'h123'  matches: 'h#*'                    -> true
'123'   matches: '[0-9][0-9]'             -> false (only 2-digit strings are matched)

'*.png;*.gif' compoundMatch:'bar.jpg'     -> false      
'*.png;*.gif' compoundMatch:'bar.gif'     -> true

'hello world' findMatchString: 'w*'       -> 7 ('world' starts at position 7)
'hello world' findMatchString: 'x*'       -> 0 (not found)

Regex Patterns[Bearbeiten]

The following text was extracted from the online Smalltalk class documentation (it is found in the RxParser class's documentation string.
The code fragments below are all in Smalltalk; the API for JavaScript is similar. The pattern info is also valid if you use the regex matching actions from the standard library.

Programmer's Regex API[Bearbeiten]

There are 2 kinds of usages:

  • queries (i.e. "does a string match a pattern")
  • extraction (i.e. "extract matches from a string")

Queries[Bearbeiten]

The APIs for this are:

  • aStringmatchesRegex:aPattern
returns true if aString matches the given regex pattern; case sensitive
  • aStringmatchesRegex:aPatterncaseSensitive:aBoolean
returns true if aString matches the given regex pattern with possible case insensitivity
  • aStringhasAnyRegexMatches:aPattern
returns true if aString contains a substring which matches the given regex pattern; case sensitive
  • aStringhasAnyRegexMatches:aPatterncaseSensitive:aBoolean
returns true if aString contains a substring which matches the given regex pattern

Extraction[Bearbeiten]

The API is:

  • aStringallRegexMatches:aPattern
returns a collection of matching substrings

Syntax[Bearbeiten]

The simplest regular expression is a single character. It matches exactly that character. A sequence of characters matches a string with exactly the same sequence of characters:

'a' matchesRegex: 'a'                   -> true
'foobar' matchesRegex: 'foobar'         -> true
'blorple' matchesRegex: 'foobar'        -> false

The above paragraph introduced a primitive regular expression (a character), and an operator (sequencing). Operators are applied to regular expressions to produce more complex regular expressions. Sequencing (placing expressions one after another) as an operator is, in a certain sense, `invisible'--yet it is arguably the most common.

A more `visible' operator is Kleene closure, more often simply referred to as `a star'. A regular expression followed by an asterisk matches any number (including 0) of matches of the original expression. For example:

'ab' matchesRegex: 'a*b'                -> true
'aaaaab' matchesRegex: 'a*b'            -> true
'b' matchesRegex: 'a*b'                 -> true
'aac' matchesRegex: 'a*b'               -> false: b does not match

A star's precedence is higher than that of sequencing. A star applies to the shortest possible subexpression that precedes it. For example, 'ab*' means `a followed by zero or more occurrences of b', not `zero or more occurrences of ab':

'abbb' matchesRegex: 'ab*'              -> true
'abab' matchesRegex: 'ab*'              -> false

To actually make a regex matching `zero or more occurrences of ab', `ab' is enclosed in parentheses:

'abab' matchesRegex: '(ab)*'            -> true
'abcab' matchesRegex: '(ab)*'           -> false: c spoils the fun

Two other operators similar to `*' are `+' and `?'. `+' (positive closure, or simply `plus') matches one or more occurrences of the original expression. `?' (`optional') matches zero or one, but never more, occurrences.

'ac' matchesRegex: 'ab*c'               -> true
'ac' matchesRegex: 'ab+c'               -> false: need at least one b
'abbc' matchesRegex: 'ab+c'             -> true
'abbc' matchesRegex: 'ab?c'             -> false: too many b's

As we have seen, characters `*', `+', `?', `(', and `)' have special meaning in regular expressions. If one of them is to be used literally, it should be quoted: preceded with a backslash. (Thus, backslash is also a special character, and needs to be quoted for a literal match--as well as any other special character described further).

'ab*' matchesRegex: 'ab*'                -> false: star in the right string is special
'ab*' matchesRegex: 'ab\*'               -> true
'a\c' matchesRegex: 'a\\c'               -> true

The last operator is `|' meaning `or'. It is placed between two regular expressions, and the resulting expression matches if one of the expressions matches. It has the lowest possible precedence (lower than sequencing). For example, `ab*|ba*' means `a followed by any number of b's, or b followed by any number of a's':

'abb' matchesRegex: 'ab*|ba*'           -> true
'baa' matchesRegex: 'ab*|ba*'           -> true
'baab' matchesRegex: 'ab*|ba*'          -> false

A slightly more complex example is the following expression, matching the name of any of the Lisp-style `car', `cdr', `caar', `cadr', ... functions:

c(a|d)+r

It is possible to write an expression matching an empty string, for example: `a|'. However, it is an error to apply `*', `+', or `?' to such expressions: `(a|)*' is an invalid expression.

So far, we have used only characters as the 'smallest' components of regular expressions. There are other, more `interesting', components.

A character set is a string of characters enclosed in square brackets. It matches any single character if it appears between the brackets. For example, '[01]' matches either `0' or `1':

'0' matchesRegex: '[01]'                 -> true
'3' matchesRegex: '[01]'                 -> false
'11' matchesRegex: '[01]'                -> false: a set matches only one character

Using the plus operator '+', we can build the following binary number recognizer:

'10010100' matchesRegex: '[01]+'         -> true
'10001210' matchesRegex: '[01]+'         -> false

If the first character after the opening bracket is `^', the set is inverted: it matches any single character *not* appearing between the brackets:

'0' matchesRegex: '[^01]'                -> false
'3' matchesRegex: '[^01]'                -> true

For convenience, a set may include ranges: pairs of characters separated with `-'. This is equivalent to listing all characters between them: `[0-9]' is the same as `[0123456789]'.

Special characters within a set are `^', `-', and `]' that closes the set. Below are the examples of how to literally use them in a set:

[01^]           -- put the caret anywhere except the beginning
[01-]           -- put the dash as the last character
[]01]           -- put the closing bracket as the first character
[^]01]             (thus, empty and universal sets cannot be specified)

Regular expressions can also include the following backquote escapes to refer to popular classes of characters:

\w      any word constituent character (same as [a-zA-Z0-9_])
\W      any character but a word constituent
\d      a digit (same as [0-9])
\D      anything but a digit
\s      a whitespace character
\S      anything but a whitespace character

These escapes are also allowed in character classes: '[\w+-]' means 'any character that is either a word constituent, or a plus, or a minus'.

Also supported is \xXX to allow for single byte hex characters:

\xXX    the character with codePoint XX (hex)

Character classes can also include the following grep(1)-compatible elements to refer to:

[:alnum:]               any alphanumeric, i.e., a word constituent, character
[:alpha:]               any alphabetic character
[:blank:]               space or tab.
[:cntrl:]               any control character. In this version, it means any character which code is < 32.
[:digit:]               any decimal digit.
[:graph:]               any graphical character. In this version, this mean any character with the code >= 32.
[:lower:]               any lowercase character
[:print:]               any printable character. In this version, this is the same as [:cntrl:]
[:punct:]               any punctuation character.
[:space:]               any whitespace character.
[:upper:]               any uppercase character.
[:xdigit:]              any hexadecimal character.

Note that these elements are components of the character classes, i.e. they have to be enclosed in an extra set of square brackets to form a valid regular expression. For example, a non-empty string of digits would be represented as 'digit:+'.

The above primitive expressions and operators are common to many implementations of regular expressions. The next primitive expression is unique to this Smalltalk implementation.

A sequence of characters between colons is treated as a unary selector which is supposed to be understood by Characters. A character matches such an expression if it answers true to a message with that selector. This allows a more readable and efficient way of specifying character classes. For example, `[0-9]' is equivalent to `:isDigit:', but the latter is more efficient. Analogously to character sets, character classes can be negated: `:^isDigit:' matches a Character that answers false to #isDigit, and is therefore equivalent to `[^0-9]'.

As an example, so far we have seen the following equivalent ways to write a regular expression that matches a non-empty string of digits:

'[0-9]+'
'\d+'
'[\d]+'
'[[:digit::]+'
:isDigit:+'

The last group of special primitive expressions includes:

.       matching any character except a newline;
^       matching an empty string at the beginning of a line;
$       matching an empty string at the end of a line.
\b      an empty string at a word boundary
\B      an empty string not at a word boundary
\<      an empty string at the beginning of a word
\>      an empty string at the end of a word
'axyzb' matchesRegex: 'a.+b'            -> true
'ax zb' matchesRegex: 'a.+b'            -> false (space is not matched by `.')

Again, all the above three characters are special and should be quoted to be matched literally.

\.      to match the character '.'

Regex Examples[Bearbeiten]

As the introductions said, a great use for regular expressions is user input validation. Following are a few examples of regular expressions that might be handy in checking input entered by the user in an input field. Try them out by entering something between the quotes and printing-it. (Also, try to imagine Smalltalk code that each validation would require if coded by hand). Most example expressions could have been written in alternative ways.

Checking if aString may represent a nonnegative integer number[Bearbeiten]
 matchesRegex: ':isDigit:+'

or

 matchesRegex: '[0-9]+'

or

 matchesRegex: '\d+'
Checking if aString may represent an integer number with an optional sign in front[Bearbeiten]
 matchesRegex: '(\+|-)?\d+'
Checking if aString is a fixed-point number, with at least one digit after a dot[Bearbeiten]
 matchesRegex: '(\+|-)?\d+(\.\d+)?'

The same, but allow notation like `123.':

 matchesRegex: '(\+|-)?\d+(\.\d*)?'
Recognizer for a string that might be a name: one word with first capital letter, no blanks, no digits[Bearbeiten]

More traditional:

 matchesRegex: '[A-Z][A-Za-z]*'

more Smalltalkish:

 matchesRegex: ':isUppercase::isAlphabetic:*'
A date in format MMM DD, YYYY with any number of spaces in between, in XX century[Bearbeiten]
 matchesRegex: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)'

Note parentheses around some components of the expression above. As `Usage' section shows, they will allow us to obtain the actual strings that have matched them (i.e. month name, day number, and year number).

Coming back to numbers: here is a recognizer for a general number format: anything like 999, or 999.999, or -999.999e+21.

 matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'
To extract all integer numbers from a string[Bearbeiten]
'1234 abcd 3456 defg' allRegexMatches:'[0-9]+'

will return a collection containing ('1234' '3456')
(reminder: '[0-9]' will match any digit-character; '+' will match repetitions of the previous pattern with at least 1 occurrence)

'1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+'

would also care for an optional sign and return ('1234' '-3456')
(reminder: '(-)' will match the '-' character; '?' will match an optional match of the previous pattern (0 or 1 occurrences of the previous pattern))

To convert the extracted strings into real numbers, use the collect operation:

('1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+')
    collect:[:eachString | eachString asNumber]

or even shorter:

('1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+') collect:#asNumber

(reminder: the collection enumeration operations which expect a one-argument-block (lambda) can also be given the name of the operation to be applied to each element)

To extract non-numbers from a string[Bearbeiten]
'1234 abcd 3456 defg' allRegexMatches:'[a-z]+'

will return a collection containing ('abcd' 'defg')

'1234 Abcd 3456 defg' allRegexMatches:'[a-z]+'  

will return a collection containing ('bcd' 'defg')

'1234 Abcd 3456 defg' allRegexMatches:'[a-z]+' caseSensitive:false  

will return a collection containing ('Abcd' 'defg')

'1234 Abcd 3456 defg' allRegexMatches:'[a-zA-Z]+'  

ditto. will return a collection containing ('Abcd' 'defg')



Copyright © 2014-2024 eXept Software AG