Regex Pattern Info

Aus expecco Wiki (Version 2.x)
Zur Navigation springen Zur Suche springen

Regex vs. GLOB Patterns[Bearbeiten]

Be reminded that there are 2 common ways to specify string match patterns: Regex (regular expression) and GLOB patterns.

Regex patterns are more powerful but also more complex and kind of harder to use than GLOB patterns. Details follow below.

GLOB patterns are much simpler and are the same as used eg. by the shell to match filenames. Thus, many users will already be familiar with GLOB patterns but are new to Regex patterns.

GLOB Patterns[Bearbeiten]

The special match characters are:

  • '*' to match any sequence of characters, including the empty string
  • '#' to match any single character
  • '[...]' to match a set or range of characters
eg. '[abxyz]' to match any of 'a', 'b', 'x', 'y' or 'z'
eg. '[a-z]' to match any lower case letter
  • '\x' use a backslash to use the x-character literally (to escape special match characters)

Thus the GLOB pattern 'foo*' will match any of 'foo', 'footer', 'foo1' or 'foo bar', whereas the pattern 'foo\*' will ONLY match the string 'foo*'.

Notice the main difference in that a GLOB '*' matches by itself, whereas the Regex '*' repeats the previous match.
To match the 'foo' prefix (as in the above GLOB example) with regex, you'd have to use a regex pattern like 'foo.*'.

Programmer's GLOB API[Bearbeiten]

Expecco's stdandard library includes action blocks for both GLOB and Regex matching. However, you may also use them inside elementary JavaScript or Smalltalk coded actions.

  • aStringmatches:aPattern
returns true if aString matches the given GLOB pattern; case sensitive
  • aStringmatches:aPatterncaseSensitive:aBoolean
returns true if aString matches the given GLOB pattern with possible case insensitivity
  • multiPatterncompoundMatch:aString
returns true if aString matches case sensitive any of the given GLOB patterns in multiPattern, which consists of multiple GLOB patterns separated by semicolon.
  • multiPatterncompoundMatch:aStringcaseSensitive:aBoolean
ditto specifying case sensitivity
  • aStringfindMatchString:aPattern
returns the index of the first matching substring within aString, or 0 if there is no match; case sensitive
  • aStringfindMatchString:aPatterncaseSensitive:aBoolean
ditto specifying case sensitivity

GLOB Examples[Bearbeiten]

'hello' matches: 'h*'                     -> true (any string starting with 'h' is OK)
'Hello' matches: 'h*'                     -> false (due to case sensitivity)
'Hello' matches: 'h*' caseSensitive:false -> true
'hello' matches: 'h###'                   -> false (only 3 chars after the 'h' match)
'h'     matches: 'h#*'                    -> false (at least 1 character must be after the 'h')
'h123'  matches: 'h#*'                    -> true
'123'   matches: '[0-9][0-9]'             -> false (only 2-digit strings are matched)

'*.png;*.gif' compoundMatch:'bar.jpg'     -> false      
'*.png;*.gif' compoundMatch:'bar.gif'     -> true

'hello world' findMatchString: 'w*'       -> 7 ('world' starts at position 7)
'hello world' findMatchString: 'x*'       -> 0 (not found)

Regex Patterns[Bearbeiten]

The following text was extracted from the online Smalltalk class documentation (it is found in the RxParser class's documentation string.
The code fragments below are all in Smalltalk; the API for JavaScript is similar. The pattern info is also valid if you use the regex matching actions from the standard library.

Programmer's Regex API[Bearbeiten]

There are 2 kinds of usages:

  • queries (i.e. "does a string match a pattern")
  • extraction (i.e. "extract matches from a string")

Queries[Bearbeiten]

The APIs for this are:

  • aStringmatchesRegex:aPattern
returns true if aString matches the given regex pattern; case sensitive
  • aStringmatchesRegex:aPatterncaseSensitive:aBoolean
returns true if aString matches the given regex pattern with possible case insensitivity
  • aStringhasAnyRegexMatches:aPattern
returns true if aString contains a substring which matches the given regex pattern; case sensitive
  • aStringhasAnyRegexMatches:aPatterncaseSensitive:aBoolean
returns true if aString contains a substring which matches the given regex pattern

Extraction[Bearbeiten]

The API is:

  • aStringallRegexMatches:aPattern
returns a collection of matching substrings

Syntax[Bearbeiten]

The simplest regular expression is a single character. It matches exactly that character. A sequence of characters matches a string with exactly the same sequence of characters:

'a' matchesRegex: 'a'                   -> true
'foobar' matchesRegex: 'foobar'         -> true
'blorple' matchesRegex: 'foobar'        -> false

The above paragraph introduced a primitive regular expression (a character), and an operator (sequencing). Operators are applied to regular expressions to produce more complex regular expressions. Sequencing (placing expressions one after another) as an operator is, in a certain sense, `invisible'--yet it is arguably the most common.

A more `visible' operator is Kleene closure, more often simply referred to as `a star'. A regular expression followed by an asterisk matches any number (including 0) of matches of the original expression. For example:

'ab' matchesRegex: 'a*b'                -> true
'aaaaab' matchesRegex: 'a*b'            -> true
'b' matchesRegex: 'a*b'                 -> true
'aac' matchesRegex: 'a*b'               -> false: b does not match

A star's precedence is higher than that of sequencing. A star applies to the shortest possible subexpression that precedes it. For example, 'ab*' means `a followed by zero or more occurrences of b', not `zero or more occurrences of ab':

'abbb' matchesRegex: 'ab*'              -> true
'abab' matchesRegex: 'ab*'              -> false

To actually make a regex matching `zero or more occurrences of ab', `ab' is enclosed in parentheses:

'abab' matchesRegex: '(ab)*'            -> true
'abcab' matchesRegex: '(ab)*'           -> false: c spoils the fun

Two other operators similar to `*' are `+' and `?'. `+' (positive closure, or simply `plus') matches one or more occurrences of the original expression. `?' (`optional') matches zero or one, but never more, occurrences.

'ac' matchesRegex: 'ab*c'               -> true
'ac' matchesRegex: 'ab+c'               -> false: need at least one b
'abbc' matchesRegex: 'ab+c'             -> true
'abbc' matchesRegex: 'ab?c'             -> false: too many b's

As we have seen, characters `*', `+', `?', `(', and `)' have special meaning in regular expressions. If one of them is to be used literally, it should be quoted: preceded with a backslash. (Thus, backslash is also a special character, and needs to be quoted for a literal match--as well as any other special character described further).

'ab*' matchesRegex: 'ab*'                -> false: star in the right string is special
'ab*' matchesRegex: 'ab\*'               -> true
'a\c' matchesRegex: 'a\\c'               -> true

The last operator is `|' meaning `or'. It is placed between two regular expressions, and the resulting expression matches if one of the expressions matches. It has the lowest possible precedence (lower than sequencing). For example, `ab*|ba*' means `a followed by any number of b's, or b followed by any number of a's':

'abb' matchesRegex: 'ab*|ba*'           -> true
'baa' matchesRegex: 'ab*|ba*'           -> true
'baab' matchesRegex: 'ab*|ba*'          -> false

A slightly more complex example is the following expression, matching the name of any of the Lisp-style `car', `cdr', `caar', `cadr', ... functions:

c(a|d)+r

It is possible to write an expression matching an empty string, for example: `a|'. However, it is an error to apply `*', `+', or `?' to such expressions: `(a|)*' is an invalid expression.

So far, we have used only characters as the 'smallest' components of regular expressions. There are other, more `interesting', components.

A character set is a string of characters enclosed in square brackets. It matches any single character if it appears between the brackets. For example, '[01]' matches either `0' or `1':

'0' matchesRegex: '[01]'                 -> true
'3' matchesRegex: '[01]'                 -> false
'11' matchesRegex: '[01]'                -> false: a set matches only one character

Using the plus operator '+', we can build the following binary number recognizer:

'10010100' matchesRegex: '[01]+'         -> true
'10001210' matchesRegex: '[01]+'         -> false

If the first character after the opening bracket is `^', the set is inverted: it matches any single character *not* appearing between the brackets:

'0' matchesRegex: '[^01]'                -> false
'3' matchesRegex: '[^01]'                -> true

For convenience, a set may include ranges: pairs of characters separated with `-'. This is equivalent to listing all characters between them: `[0-9]' is the same as `[0123456789]'.

Special characters within a set are `^', `-', and `]' that closes the set. Below are the examples of how to literally use them in a set:

[01^]           -- put the caret anywhere except the beginning
[01-]           -- put the dash as the last character
[]01]           -- put the closing bracket as the first character
[^]01]             (thus, empty and universal sets cannot be specified)

Regular expressions can also include the following backquote escapes to refer to popular classes of characters:

\w      any word constituent character (same as [a-zA-Z0-9_])
\W      any character but a word constituent
\d      a digit (same as [0-9])
\D      anything but a digit
\s      a whitespace character
\S      anything but a whitespace character

These escapes are also allowed in character classes: '[\w+-]' means 'any character that is either a word constituent, or a plus, or a minus'.

Also supported is \xXX to allow for single byte hex characters:

\xXX    the character with codePoint XX (hex)

Character classes can also include the following grep(1)-compatible elements to refer to:

[:alnum:]               any alphanumeric, i.e., a word constituent, character
[:alpha:]               any alphabetic character
[:blank:]               space or tab.
[:cntrl:]               any control character. In this version, it means any character which code is < 32.
[:digit:]               any decimal digit.
[:graph:]               any graphical character. In this version, this mean any character with the code >= 32.
[:lower:]               any lowercase character
[:print:]               any printable character. In this version, this is the same as [:cntrl:]
[:punct:]               any punctuation character.
[:space:]               any whitespace character.
[:upper:]               any uppercase character.
[:xdigit:]              any hexadecimal character.

Note that these elements are components of the character classes, i.e. they have to be enclosed in an extra set of square brackets to form a valid regular expression. For example, a non-empty string of digits would be represented as 'digit:+'.

The above primitive expressions and operators are common to many implementations of regular expressions. The next primitive expression is unique to this Smalltalk implementation.

A sequence of characters between colons is treated as a unary selector which is supposed to be understood by Characters. A character matches such an expression if it answers true to a message with that selector. This allows a more readable and efficient way of specifying character classes. For example, `[0-9]' is equivalent to `:isDigit:', but the latter is more efficient. Analogously to character sets, character classes can be negated: `:^isDigit:' matches a Character that answers false to #isDigit, and is therefore equivalent to `[^0-9]'.

As an example, so far we have seen the following equivalent ways to write a regular expression that matches a non-empty string of digits:

'[0-9]+'
'\d+'
'[\d]+'
'[[:digit::]+'
:isDigit:+'

The last group of special primitive expressions includes:

.       matching any character except a newline;
^       matching an empty string at the beginning of a line;
$       matching an empty string at the end of a line.
\b      an empty string at a word boundary
\B      an empty string not at a word boundary
\<      an empty string at the beginning of a word
\>      an empty string at the end of a word
'axyzb' matchesRegex: 'a.+b'            -> true
'ax zb' matchesRegex: 'a.+b'            -> false (space is not matched by `.')

Again, all the above three characters are special and should be quoted to be matched literally.

\.      to match the character '.'

Regex Examples[Bearbeiten]

As the introductions said, a great use for regular expressions is user input validation. Following are a few examples of regular expressions that might be handy in checking input entered by the user in an input field. Try them out by entering something between the quotes and printing-it. (Also, try to imagine Smalltalk code that each validation would require if coded by hand). Most example expressions could have been written in alternative ways.

Checking if aString may represent a nonnegative integer number[Bearbeiten]
 matchesRegex: ':isDigit:+'

or

 matchesRegex: '[0-9]+'

or

 matchesRegex: '\d+'
Checking if aString may represent an integer number with an optional sign in front[Bearbeiten]
 matchesRegex: '(\+|-)?\d+'
Checking if aString is a fixed-point number, with at least one digit after a dot[Bearbeiten]
 matchesRegex: '(\+|-)?\d+(\.\d+)?'

The same, but allow notation like `123.':

 matchesRegex: '(\+|-)?\d+(\.\d*)?'
Recognizer for a string that might be a name: one word with first capital letter, no blanks, no digits[Bearbeiten]

More traditional:

 matchesRegex: '[A-Z][A-Za-z]*'

more Smalltalkish:

 matchesRegex: ':isUppercase::isAlphabetic:*'
A date in format MMM DD, YYYY with any number of spaces in between, in XX century[Bearbeiten]
 matchesRegex: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)'

Note parentheses around some components of the expression above. As `Usage' section shows, they will allow us to obtain the actual strings that have matched them (i.e. month name, day number, and year number).

Coming back to numbers: here is a recognizer for a general number format: anything like 999, or 999.999, or -999.999e+21.

 matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'
To extract all integer numbers from a string[Bearbeiten]
'1234 abcd 3456 defg' allRegexMatches:'[0-9]+'

will return a collection containing ('1234' '3456')
(reminder: '[0-9]' will match any digit-character; '+' will match repetitions of the previous pattern with at least 1 occurrence)

'1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+'

would also care for an optional sign and return ('1234' '-3456')
(reminder: '(-)' will match the '-' character; '?' will match an optional match of the previous pattern (0 or 1 occurrences of the previous pattern))

To convert the extracted strings into real numbers, use the collect operation:

('1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+')
    collect:[:eachString | eachString asNumber]

or even shorter:

('1234 abcd -3456 defg' allRegexMatches:'(-)?[0-9]+') collect:#asNumber

(reminder: the collection enumeration operations which expect a one-argument-block (lambda) can also be given the name of the operation to be applied to each element)

To extract non-numbers from a string[Bearbeiten]
'1234 abcd 3456 defg' allRegexMatches:'[a-z]+'

will return a collection containing ('abcd' 'defg')

'1234 Abcd 3456 defg' allRegexMatches:'[a-z]+'  

will return a collection containing ('bcd' 'defg')

'1234 Abcd 3456 defg' allRegexMatches:'[a-z]+' caseSensitive:false  

will return a collection containing ('Abcd' 'defg')

'1234 Abcd 3456 defg' allRegexMatches:'[a-zA-Z]+'  

ditto. will return a collection containing ('Abcd' 'defg')



Copyright © 2014-2024 eXept Software AG