String API Functions

Aus expecco Wiki (Version 25.x)
Zur Navigation springen Zur Suche springen

This document lists most useful (and most often needed) string functions. Be aware, that there are many more to be found in either the class references or via the builtin class browser.

See also "String Handling" in Expecco_API.
Reference: String inherits from: CharacterArray Collection

Back to Useful API Functions

Notice: unless written otherwise, all indices are 1-based. Valid indices range from 1 to the string's size.

Literals (i.e. Constant Strings)

Literals are constants in the source code or in expressions entered into a workspace for immediate execution. Literal objects are created at compile time and are immutable. That means that an exception will be raised, if the program tries to modify the contents of a literal string at execution time.

Tradititionally, Smalltalk supported only one type of literal string constant, which does not support special character escapes (i.e. it was consequently following the philosophy of "no surprises" and "what you see is exacty what you get"). This has some charm and might be easier to understand for beginners, but makes life of programmers hard at times (think of newlines or tab characters inside a string).
For this reason, the underlying Smalltalk/X has been extended to support C-like escaped strings, strings with embedded expressions and international strings which are automatically translated to a national language. Syntaxwise, extended string literals have a prefix such as 'c', 'e' or 'i'.

Notice that the Python language provides aimilar mechanisms, but has the C-style as default, and the untranslated literal as so-called "raw" string literal prefixed by an 'r'.

Traditional String Literals (what you see is what you get)
'...'
Smalltalk style string (as is; no escapes for special characters). Be aware, that this is inconvenient, if newlines, tabs or other non-graphical characters are to be in the string. Notice that Smalltalk's plain strings are similar to Python's "Raw Strings".
C-Style String Literals (with C escapes)
c'...'
C style string (supports the usual C-escapes, such as "\n" for newline, "\t" for tab or "\xHH" and "\uHHHH" for hex codes). Backslashes and single quotes must be prefixed by a backslash (which is especially needed when a string represents a Windows pathname). For example, the literal c'abc\u2021def' represents the characters "abc‡def".
Expression Strings (with sliced in expressions)
e'...{expr}...'
C style with embedded expressions. In addition to C-style escapes, expressions in braces are embedded. These expressions are evaluated (at execution time) and their string representation (i.e. the printString, which is Smalltalk's equivalent of toString in JS) sliced into the string.
Example:
a := 1234.
b := true.
c := 'hello'.
foo := e'{c}; this string contains embedded exprs.\nYes, {b} and {a} are there.'
foo
   => 'hello; this string contains embedded exprs.
       Yes, true and 1234 are there.'
Strictly speaking, these are not literals. Instead, the Smalltalk compiler generates code which expands these at execution time.
Semantically, they are equal to a "bindWith:" message (see String class), sent to a string with "%" markers.
(Don't care, if that is too technical).
National Language Strings
i'...{expr}...'
C style with embedded expressions and automatic nationalization. First, it is expanded as a C-style string. Then a translation is looked up in the currently loaded language resources. The template searched has successive expression positions replaced by "%i".
For example, the literal "i'Hello {name}'" will look for a translation of "Hello %1" in the language pack, and may find one of "Hallo %1", "Hola %1", etc. depending on your current language setting. Then the variable "name" is converted to a string and sliced in, finally generating "Hallo Tom" or "Hola Tom" etc.
Translations are loaded from resource files in the "resources" folder.
Regex Literals
r'...'
these are implemented, but not yet officially released to the public. Don't use for now, instead use the normal regex messages as described below.

Protocol (String API)

String provides a huge amount of functionality, both in the String class itself and also due to inheritance from a number of Collection classes (be reminded that a string is a sequencable collection of characters, therefore all of the API implemented in collections which support an integer index and those implemented for any generic collection is also provided. Worth mentioning are the enumeration methods, the search, and the substring methods).

Use a SystemBrowser and/or the MethodFinder to get a feeling. There is usually an existingsolution for most needs.

Queries

aString size
aString.size() [JS]
Returns the number of characters in the string (i.e. the string's length).
Example:
'hello world' size
   => 11
aString bitsPerCharacter
Returns one of {8, 16, 32}.
aString isWideString
Same as "aString bitsPerCharacter > 8"
aString containsNon7BitAscii'
true if aString is either a wide string, or contains any character above the 7bit Ascii range (i.e. a national character in the ISO8859 charset)

Accessing

The general access methods for any collection in Smalltalk are "at:index" and "at:index put:anObject" (i.e. the at: and at:put: messages).
However, for your convenience, the Smalltalk compiler also supports square brackets as used in most other languages. Semantically, these are the same, as the compiler generates the corresponding accessor calls for bracket indexes.

aString at: index
aString [ índex ]   [ST/X]1
aString [ índex ]   [JS]
Returns the character at an index (1-based; valid indices are [1 .. size]).
Example:
 'hello world' at:2
     => $e
aString at: index put: char
aString [ índex ] := char   [ST/X]1
aString [ índex ] = char   [JS]
Changes the character at an index (1-based; valid indices are [1 .. size]).
Notice that string constants (and freeze values) are immutable (i.e. readonly strings). Only strings which have been created dynamically can be modified.
1) this non-ANSI-Smalltalk syntax is supported by Smalltalk/X and thus also valid in expecco.

Copying

aString copy
Generates a mutable copy.
Example:
 'hello world' copy at:1 
     => $h
aString copyFrom: startIndex to: endIndex
aString.copyFrom_to( startIndex, endIndex )   [JS]
Copies characters from a start index to an end index (1-based; valid indices are [1 .. size]).
Example:
 'hello world' copyFrom:1 to:5. 
     => 'hello'
aString copyFrom: startIndex count: numChars
aString.copyFrom_count( startIndex, numChars )   [JS]
Copies a number of characters starting at the given index.
Example:
 'hello world' copyFrom:7 count:3. 
     => 'wor'
aString copyFrom: startIndex
aString.copyFrom( startIndex )   [JS]
Copies from the given index to the end.
Example:
 'hello world' copyFrom:6. 
     => ' world'
aString copyTo: endIndex
aString.copyTo( endIndex )   [JS]
Copies from the start to the given index.
Example:
 'hello world' copyTo:6. 
     => 'hello '
aString copyLast: count
aString<.copyLast( count )   [JS]
Copies the last count characters.
Example:
 'hello world' copyLast:4. 
     => 'world '
aString copyButFirst: count
aString.copyButFirst( count )   [JS]
Copies except for the first count characters.
Example:
 'hello world' copyButFirst:4. 
     => 'o world'
aString copyButLast: count
aString.copyButLast( count )   [JS]
Copies except for the last count characters.
Example:
 'hello world' copyButLast:4. 
     => 'hello w'
aString copyBetween: leftString and: rightString caseSensitive: boolean
aString.copyBetween_and_caseSensitive( leftString, rightString, boolean )   [JS]
Finds two substrings and copies the part between them.
Example:
 'hello small world' copyBetween:'hello' and:'world' caseSensitive:true
     => ' small '

 'helloworld' copyBetween:'hello' and:'world' caseSensitive:true
     => '' (an empty string)

 'hello small World' copyBetween:'hello' and:'world' caseSensitive:true
     => nil

 'hello small World' copyBetween:'hello' and:'world' caseSensitive:false
     => ' small '
aString copyReplaceString: oldString withString: newString
aString.copyReplaceString_withString( oldString, newString )   [JS]
Example:
 'hello small world' copyReplaceString:'small' withString:'big' 
     => ' hello big world '

 'hello small world' copyReplaceString:'big' withString:'bigger' 
     => ' hello small world '
aString withoutPrefix: prefixString [ caseSensitive: boolean ]
aString.withoutPrefix[_caseSensitive]( prefixString [, boolean ])   [JS]
If string starts with a prefix-string, return a copy without it. Otherwise return the original string.
Example:
 'hello small world' withoutPrefix:'hello ' 
     => ' small world '

 'small world' withoutPrefix:'hello '
     => ' small world '
aString withoutSuffix: suffixString [ caseSensitive: boolean' ]'
aString.withoutSuffix[_ caseSensitive]( suffixString [, boolean )   [JS]
If string ends with a prefix-string, return a copy without it. Otherwise return the original string.
Example:
 'hello small world' withoutSuffix:' world' 
     => ' hello small'

 'hello small' withoutSuffix:' world'
     => ' hello small '

Concatenation

string1 , string2
string1 + string2 [JS]
the Smalltalk comma operator1 and the JavaScript plus operator concatenate two strings
Example (Smalltalk):
 'hello' , ' ' , 'world'
     => 'hello world'
Example (JavaScript):
 "hello" + " " + "world"
     => "hello world"
string ,* integer
the Smalltalk comma-star operator generates a new string by repeating the first string a number of times
Example:
 'abc' ,* 5
     => 'abcabcabcabcabc'

1)Although in theory, the plus operator could have been overloaded in Smalltalk as well, this was not done for readability and debuggability reasons: using a comma makes it clear that the intention is string concatenation and not addition. Also notice, that the JavaScript semantics can be considered unclean, as it concatenates strings if either operand is a string. Thus, you have to be careful if addition is attempted, and either incoming operand could possibly be a string.
In contrast, Smalltalk will report an error if you try to concatenate or add strings and numbers.
it is not inconvenience - its a feature to protect you from stupid and hard to find errors!

Splitting

aString splitBy: aCharacter
aString.splitBy( aCharacter ) [JS]
Splits a string into pieces, given a splitting character. The result is a collection containing the parts.
Examples:
 'hello world here are six words' splitBy: $      <- trailing invisible space here
     => #( 'hello' 'world' 'here' 'are' 'six' 'words')

 'hello world here are six words' splitBy: (Character space) <- no invisible space
     => #( 'hello' 'world' 'here' 'are' 'six' 'words')

 'hello-world here are six-words' splitBy: $-
     => #( 'hello' 'world here are six' 'words')
aString splitOn: aCharacter
string1 splitOn: string2
string1 splitOn: regex
aString splitOn: [:char | <condition-expression on char> ]
aString splitOn: #<name of condition-expression on char>
aString.splitOn( aCharacter )   [JS]
string1.splitOn( string2 )   [JS]
aString.splitOn( (char) => <condition-expression on char> )   [JS]
Splits a string into pieces, given a splitter.
The splitter may be a single character, a string, a regular expression or a block/the name of an element-method, which returns true to split.
This is a more general version of the above "splitBy:", for complex splits.
Examples:
 'hello here are five words' splitOn: $      <- trailing space here
     => #( 'hello' 'here' 'are' 'five' 'words')

 'hello here are five words' splitOn:(Character space)      <- the same, but readable
     => #( 'hello' 'here' 'are' 'five' 'words')

 'hello-world here are six-words' splitOn: $-
     => #( 'hello' 'world here are six' 'words')

 'hello world and goodbye world' splitOn: ' and '
     => #( 'hello world' 'goodbye world')

 'hello, commas and semis; here' splitOn: [:ch | (ch == $,) or:[ ch == $; ]]
     => #( 'hello' 'commas and semis' 'here')

 'hello, commas and semis; here' splitOn: [:ch | ',;' includes:ch ]] <- same functionality as above
     => #( 'hello' 'commas and semis' 'here')

 'aWordWithCamelCase' splitOn: [:ch | ch isUppercase ]
     => #( 'a' 'Word' 'With' 'Camel' 'Case')

 'aWordWithCamelCase' splitOn: #isUppercase
     => #( 'a' 'Word' 'With' 'Camel' 'Case')

 c'some text\twith\ndifferent separators' splitOn: #isSeparator
     => #( 'some' 'text' 'with' 'different' 'separators')

 '123abc456def789' splitOn:'[a-z]*' asRegex 
     => #('123' '456' '789')

JS examples:
 "hello here are five words".splitOn( $' ' )    <- space character here
     => [ "hello" , "here" , "are" , "five" , "words" ]

 "hello here are five words".splitOn( ' ' )    <- string with a space here
     => [ "hello" , "here" , "are" , "five" , "words" ]

 "hello world and goodbye world".splitOn(" and ")
     => [ "hello world" , "goodbye world"]

 "123abc456def789".splitOn("[a-z]*" asRegex) 
     => [ "123" "456" "789" ]

 "hello-world, commas and semis; here".splitOn( (ch) => (ch == $',') || ( ch == $';') )
     => [ "hello-world" , "çommas and semis" , "here"]

Case Conversion

aString asLowercase
aString asUppercase
aString asUppercaseFirst
aString.asUppercase()   [JS]
aString.asLowercase()   [JS]
aString.asUppercaseFirst()   [JS]
Covert to lowercase, uppercase
Example:
 'HELLO' asLowercase
     => 'hello'

 'hello' asUppercase
     => 'HELLO'

 'hello' asUppercaseFirst
     => 'Hello'
Be aware that these functions do not care for some special cases in the Unicode character set, where a single character has to be replaced by two characters or by a different character depending on the position (and most, even native programmers do not know about them!).
The german "ß" (usually up-cased as "SS") is one example of the first kind, the greek sigma which has to be down-cased differently depending on its position within a string is one of the second kind.
If your test depends on such specifics, it must handle that itself (the 2008 Unicode standard does however include an upper-case eszet, which is handled by those functions).

Comparing

Be reminded that there are multiple possible definitions of "being equal" w.r.t. strings; identity, equality and case-insensitive equalitiy.
Identity is about two objects being the identical object (instance), and handled by the "==" operator in Smalltalk.
Equality means being of the same type of object and having the same contents is handled by the "=" operator.

string1 = string2
string1== string2   [JS]
Compares two strings for equal contents, NOT ignoring case.
Notice that "==" in Smalltalk is an identity compare, which is "===" in JavaScript,
and "=" in Smalltalk is a "same contents" compare, whereas its assignment in JS.
Keep that in mind to avoid hard to find bugs.
Example:
 'HELLO' = 'HeLlo'
     => false

 'hello' = 'hello' copy
     => true

 'hello' == 'hello' copy
     => false

 'hello' = 'hello' asUnicode16String
     => true
string1 sameAs: string2
string1.sameAs( string2 )   [JS]
Compares two strings, ignoring case.
"caselessEqual:" is an alias which performs the same operation (and may be more self describing).
Example:
 'HELLO' sameAs: 'HeLlo'
     => true

 'HELxO' sameAs: 'HeLlo'
     => false

 'HELLO' caselessEqual: 'HeLlo'
     => true
string1 caselessBefore: string2
string1 caselessAfter: string2
string1 caselessEqual: string2
string1.caselessBefore( string2 )   [JS]
string1.caselessAfter( string2 )   [JS]
Compare two strings, ignoring case.
Returns true if string1 is smaller, same or larger than string2.
Does not care for national characters (i.e. the compare is based on the unicode points)
Example:
 'HELLO' caselessBefore: 'world'
     => true

 'bce' caselessBefore: 'abc'
     => false
string1 startsWith: prefixString
string1 startsWith: prefixString caseSensitive: boolean
string1.startsWith( prefixString )   [JS]
string1.startsWith_caseSensitive( prefixString, boolean )   [JS]
Checks if a string starts with another string.
Comes in two variants, one being strict, the other with optional case-insensitivity.
Example:
 'hello' startsWith: 'hel'
     => true

 'Hello' startsWith: 'hel'
     => false

 'Hello' startsWith: 'hel' caseSensitive: false
     => true

 'Hexlo' startsWith: 'hel' caseSensitive: false
     => false
string1 endsWith: prefixString
string1 endsWith: prefixString caseSensitive: boolean
Checks if a string ends with another string.
Comes in two variants, one being strict, the other with optional case-insensitivity.
Example:
 'hello' endsWith: 'lo'
     => true

 'Hello' endsWith: 'Lo'
     => false

 'Hello' endsWith: 'Lo' caseSensitive: false
     => true

 'Hexlo' endsWith: 'LX' caseSensitive: false
     => false

Searching

aString indexOf: aCharacter
aString lastIndexOf: aCharacter
Returns the first/last index of an element (a character). The index is 1-based; returns 0 if not found.
Example:
 'HELLO' indexOf: $L
     => 3

 'HELLO' indexOf: $x
     => 0

 'HELLO' lastIndexOf: $L
     => 4
aString indexOf: aCharacter startingAt: startIndex
aString lastIndexOf: aCharacter startingAt: startIndex
Returns the next/previous index of an element (a character) given a search start index .
Returns 0 if not found.
Example:
 'HELLO WORLD' indexOf: $O startingAt: 6
     => 8

 'HELLO WORLD' indexOf: $x startingAt: 6
     => 0

 'HELLO WORLD' indexOf: $L startingAt: 6
     => 0

 'HELLO WORLD' lastIndexOf: $O startingAt: 7
     => 5
aString indexOfAny: aCollectionOfCharacters [ startingAt: startIndex ]
aString lastIndexOfAny: aCollectionOfCharacters [ startingAt: startIndex ]
Similar to the above, but searches for any element in the given argument collection. This may be a string (of characters) or an array or any other collection of characters.
Returns 0 if not found.
Example:
 'HELLO, WORLD' indexOfAny: ',;'
     => 6

 'HELLO; WORLD' indexOfAny: ',;'
     => 6

 'HELLO; WORLD' indexOfAny: #( $, $; )
     => 6

 'HELLO. WORLD' indexOfAny: #( $, $; )
     => 0
aString indexOfSeparator
aString indexOfSeparatorStartingAt: startIndex
aString lastIndexOfSeparator
aString lastIndexOfSeparatorStartingAt: startIndex
Searches for the first/last whitespace character (Space, Tab, CR or NL).
Returns 0 if not found.
Example:
 'HELLO WORLD' indexOfSeparator
     => 6

 'HELLO WORLD' indexOfSeparatorStartingAt: 7
     => 0

 'HELLO abc World' lastIndexOfSeparator
     => 10

 'HELLO abc World' lastIndexOfSeparatorStartingAt:10
     => 10

 'HELLO abc World' lastIndexOfSeparatorStartingAt:9
     => 6
aString indexOfString: aSubstring [ startingAt: startIndex ]
aString lastIndexOfString: aSubstring [ startingAt: startIndex ]
Returns the first/last index of an element (a character). Returns 0 if not found.
Example:
 'HELLO' indexOfString:'LL'
     => 3

 'HELLO' indexOfString: 'LX'
     => 0

 'HELLO BELLO' lastIndexOfString: 'LL'
     => 9
aString includesString: aSubstring [ caseSensitive: boolean ]
Returns true if aString contains the sub-string aSubstring; false otherwise.
Example:
 'HELLO' includesString:'LL'
     => true

 'HELLO' includesString: 'll'
     => false

 'HELLO' includesString: 'll' caseSensitive: false 
     => true

Pattern Matching

Be reminded that there are two match algorithms available: the simpler GLOB pattern matcher (which resembles the way shell and cmd.com do filename matching), and the more complicated but also much more powerful REGEX matcher. The most obvious difference is that in GLOB, the star "*" matches any string, including the empty string, whereas in REGEX, the star matches the PREVIOUS pattern zero or multiple times.
Thus, the simple GLOB pattern "*.c" (to match C-filenames) has to be written as ".*\.c" in REGEX (because in regex, "." matches ANY character, the star repeats it, and the next "." has to be escaped because it would otherwise stand for "any character").
If you are new to these concepts, please read the documentation on GLOB and REGEX patterns.

The following are the most useful API entries (but the browser will show a few more):

GLOB Pattern Matching

aString matches: patternString [ caseSensitive: boolean ]
True if a string matches a GLOB match pattern.
Example:
 'HELLO world' matches: 'HE*'
     => true

 'HELLO world' matches: 'he*'
     => false

 'HELLO world' matches: 'he*' caseSensitive: false
     => true
patternStrings compoundMatch: aString [ caseSensitive: boolean ]
True if a string matches any pattern in patternStrings, which contains multiple GLOB match patterns separated by semicolon.
Useful eg. to verify a string against multiple file extensions (eg. '*.txt;*.doc;*.pdf' compoundMatch: aFilename).
Example:
 'f*;b*' compoundMatch:'foo'
     => true

 'f*;b*' compoundMatch:'bar'
     => true

 'f*;b*' compoundMatch:'xxx'
     => false

REGEX Pattern Matching

aString matchesRegex: patternString [ caseSensitive: boolean ]
True if a string matches a regex pattern.
Example
 'HELLO world' matchesRegex: 'H.*O'
     => true

 'HELLO world' matchesRegex: 'h.*o'
     => false

 'HELLO world' matchesRegex: 'h.*o' caseSensitive: false
     => true
 ('myFile.txt' asFilename contents) select:[:line |
    line matchesRegex:'[0-9]+\:.*' 
 ]
     => all lines from the file which start with a number followed by a colon.
patternString regexMatchesIn: aString
Returns a collection containing all regex matching substrings.
Example
 '[0-9]+' regexMatchesIn: '1234 abcd 3456 defg'
     => OrderedCollection('1234' '3456')
aString subExpressionsInRegex: patternString caseSensitive: boolean
Returns a collection containing the individual partial matches. Partials are regex subexpressions in parentheses.
Example
 '1234abc3456' subExpressionsInRegex: '([0-9]+)abc([0-9]+)' caseSensitive:false
     => OrderedCollection('1234' '3456')

 '123 abc xyz :bla' subExpressionsInRegex: '([0-9]+).*(\:.*)' caseSensitive:false
     => OrderedCollection('123' 'bla')
Notice that ':' is a special character in the pattern and must therefore be escaped with a backslash.

Converting

aString asByteArray
Returns a byte array containing the codePoints (Unicode codes).
The characters should be within the ISO8859 range 0x00 .. 0xFF.
If they are not, a byte array containing the wide characters is returned (to be exact: in the machine's native byte order, which is low byte first on x86 cpus).
Example:
 'abc123' asByteArray
     => #[97 98 99 49 50 51]

 'äöü' asByteArray
     => #[228 246 252]

 c'abc\u2022123' asByteArray
     => #[97 0 98 0 99 0 34 32 49 0 50 0 51 0]
aString asFilename
Returns a filename instance, which provides functions to operate on files and directories.
See "Filename protocol" for its functions.
Example:
 'c:\data.txt' asFilename modificationTime
     => 2017-07-28 10:31:23

 '/etc' asFilename exists
     => true
aStringWithMultipleLines asCollectionOfLines
Returns a collection of strings, each containing one line from the original string. Handles any combination of CR, LF or CRLF as line separator. The resulting line-collection can then be further processed using functions from the Collection protocol.
Example:
 'data.txt' asFilename contents asCollectionOfLines
     => #( 'line1' 'line2' 'line3' ... )
aString asCollectionOfWords
Returns a collection of strings, each containing one word from the original string. Words are separated by whitespace (space, tab or CR/LF).
Example:
 'hello bla bla world' asCollectionOfWords
     => #( 'hello' 'bla' 'bla' 'world' )

Encoding / Decoding

aString utf8Encoded
aString utf8Decoded
Encode / decode into/from utf8 encoding.
Example:
 'äöü' utf8Encoded
     => 'äöü'

 'äöü' utf8Encoded asByteArray
     => #[195 164 195 182 195 188]

 'äöü' utf8Encoded utf8Decoded
     => 'äöü'
aString utf16Encoded
Encode into an utf16 encoded two byte string (ie. still a vector of characters), caring for characters above codepoint 0xFFFF. This can then be further converted to a vector of bytes in BE (big end) or LE (low end) order (see below).
Example:
 'hello äöü' utf16Encoded encodeInto:#utfLE16
     => #[104 0 101 0 108 0 108 0 111 0 32 0 228 0 246 0 252 0]
aString encodeInto: nameOfEncoding
aString decodeFrom: nameOfEncoding
Similar to the above, but nameOfEncoding gives the encoding's name (as symbol). Supported encodings are: #utf, #utf16BE, #utf16LE, #iso8859-1 to #iso8859-8, #iso8859-15, #iso8859-16, #koi7, #cp437, #cp850, #cp1250, #cp1252, #jis7, #gb, #big5, #ebcdic.
aString encodeFrom: nameOfEncoding1 into: nameOfEncoding2
to transcode from one encoding into another. E.g. "someString encodeFrom:#cp850 into:#koi7" would recode from the Windows codepage 850 into cyrillic koi7.
aString base64Encoded
aString base64Decoded
aString base64DecodedString
Encode / decode into/from base64 encoding.
Example:
 'äöü' base64Encoded
     => '5Pb8'

 'äöü' base64Encoded asByteArray
     => #[53 80 98 56]

 'äöü' base64Encoded base64Decoded
     => #[228 246 252]

 'äöü' base64Encoded base64DecodedString
     => 'äöü'

 'äöü' base64Encoded base64Decoded asString
     => 'äöü'

Hashing

See Cryptographic API Functions.



Copyright © 2014-2024 eXept Software AG