Core Function Find
Find( <string>, <pattern>, <offset>, <plain> )
Contents |
Description
Find the first occurrence of the pattern in the string passed.
Parameters
string
The string to evaluate.
pattern
See Remarks to learn about patterns
offset
Optional; The starting position of the search.
Default: 0
plain
Optional; Flag to indicate if the operations should use patterns or not.
true = use the pattern as a regular expression false = treat the pattern as plain text
Default: true
Return Value
Success: Returns a pair of values representing the start and end of the string.
Failure: Returns NULL.
Remarks
This function is pretty much the same a the LUA String.Find() however this one returns start position starting at 0 (LUA Find() starts at 1) and lowers the end position by 1.
This is because in Sputnik chars in a string start at 0 not 1.
Patterns
Pattern Item:
A pattern item can be
- a single character class, which matches any single character in the class;
- a single character class followed by '*', which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
- a single character class followed by '+', which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
- a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
- a single character class followed by '?', which matches 0 or 1 occurrence of a character in the class;
- %n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below);
- %bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses.
Pattern:
A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves. Captures:
A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.
A pattern cannot contain embedded zeros. Use %z instead.
The standard patterns you can search for are:
. --- (a dot) represents all characters. %a --- all letters. %c --- all control characters. %d --- all digits. %l --- all lowercase letters. %p --- all punctuation characters. %s --- all space characters. %u --- all uppercase letters. %w --- all alphanumeric characters. %x --- all hexadecimal digits. %z --- the character with hex representation 0x00 (null). %% --- a single '%' character. %1 --- captured pattern 1. %2 --- captured pattern 2 (and so on).
For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.
The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.
Important! - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits.
There are some "magic characters" (such as %) that have special meanings. These are:
^ $ ( ) % . [ ] * + - ?
If you want to use those in a pattern (as themselves) you must precede them by a % symbol.
eg. %% would match a single %
As with normal regular expressions you can build your own pattern classes by using square brackets, eg.
[abc] ---> matches a, b or c [a-z] ---> matches lowercase letters (same as %l) [^abc] ---> matches anything except a, b or c [%a%d] ---> matches all letters and digits [%a%d_] ---> matches all letters, digits and underscore [%[%]] ---> matches square brackets (had to escape them with %)
The repetition characters are:
+ ---> 1 or more repetitions (greedy) * ---> 0 or more repetitions (greedy) - ---> 0 or more repetitions (non greedy) ? ---> 0 or 1 repetition only
The standard "anchor" characters apply:
^ ---> anchor to start of subject string $ ---> anchor to end of subject string
You can also use round brackets to specify "captures", similar to normal regular expressions:
You see (.*) here
Here, whatever matches (.*) becomes the first pattern.
You can also refer to matched substrings (captures) later on in an expression:
printr find ("You see dogs and dogs", "You see (.*) and %1"); // 1 21 dogs printr find ("You see dogs and cats", "You see (.*) and %1"); // NULL
This example shows how you can look for a repetition of a word matched earlier, whatever that word was ("dogs" in this case).
As a special case, an empty capture string returns as the captured pattern, the position of itself in the string. eg.
printr find ("You see dogs and cats", "You .* ()dogs .*"); // 1 21 9
What this is saying is that the word "dogs" starts at column 9.
Finally you can look for nested "balanced" things (such as parentheses) by using %b, like this:
printr find ("I see a (big fish (swimming) in the pond) here", "%b()"); // 9 41
After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)".
Example
Search for raw text in a string (No patterns)
my $Test = "Hello cat world!"; printr Find($Test, "cat", 0, true); // Prints // Array // ( // [0] => 6 // [1] => 8 // )
Search for raw text in a string but handle the result manually
my $Test = "the quick brown fox"; my List ($Pos, $Len) = Find($Test, "brown", 0, true); say "Position: $Pos"; say "EndPosition: $Len"; say "String: $Text"; // Prints // Position: 10 // EndPosition: 14
Use a pattern to find it note it only returns the index and size when it cant find any group matches
my $Test = "Hello cat world!"; printr Find($Test, "cat"); // Prints // Array // ( // [0] => 6 // [1] => 8 // )
Another pattern match with no groups
my $Test = "the quick brown fox"; printr Find($Test, "quick"); // Prints // Array // ( // [0] => 4 // [1] => 8 // )
A group capture pattern this time
my $Test = "the quick brown fox"; printr Find($Test, "(%a+)"); // Prints // Array // ( // [0] => 0 // [1] => 2 // [2] => the // )
Another group capture pattern this time
my $Test = "the quick brown fox"; printr Find($Test, "(%a+)", 10); // Prints // Array // ( // [0] => 10 // [1] => 14 // [2] => brown // )
Another pattern this time but we will handle the capture ourself
my $Test = "the quick brown fox"; my List ($Pos, $PosEnd) = Find($Test, "(%a+)", 10); say "Position: $Pos"; say "PosEnd: $PosEnd"; say "String: " . substr($Test, $Pos, strlen($Test) - $PosEnd); // Prints // Position: 10 // PosEnd: 14 // String: brown
What happens when no match is found with regular expressions
my $Test = "the quick brown fox"; printr vardump(Find($Test, "fruit")); // Prints // NULL
What happens when no match is found without regular expressions
my $Test = "the quick brown fox"; printr vardump(Find($Test, "fruit", 0, true)); // Prints // NULL
More examples
my $Test = "You see dogs and dogs"; printr Find($Test, "You see (.*)"); // Prints // Array // ( // [0] => 0 // [1] => 20 // [2] => dogs and dogs // )
You can also refer to matched substrings (captures) later on in an expression:
my $Test = "You see dogs and dogs"; printr Find($Test, "You see (.*) and %1"); // Prints // Array // ( // [0] => 0 // [1] => 20 // [2] => dogs // )
As shown here when the matched substring is not found NULL is returned
my $Test = "You see dogs and cats"; printr vardump(Find($Test, "You see (.*) and %1")); // Prints // NULL
Another example of referring to matched substrings (captures) later on in an expression:
my $Test = "You sir see dogs and dogs = sir"; printr Find($Test, "You (.*) see (.*) and %2 = %1"); // Prints // Array // ( // [0] => 0 // [1] => 30 // [2] => sir // [3] => dogs // )
Examples using the Offset
say "Offset -1"; printr Find("abc", "(%a+)", -1); // 2 2 c say "Offset -2"; printr Find("abc", "(%a+)", -2); // 1 2 bc say "Offset -3"; printr Find("abc", "(%a+)", -3); // 0 2 abc say "Offset -4"; // A negative Offset going beyond the strings length // will be converted to the strings length so // it does a full search // this is contrary to the positive Offset out of bounds // which will return null printr Find("abc", "(%a+)", -4); // 0 2 abc say "Offset -100"; // Even if it is -100 etc printr Find("abc", "(%a+)", -100); // 0 2 abc say "Offset 0"; printr Find("abc", "(%a+)", 0); // 0 2 abc say "Offset 1"; printr Find("abc", "(%a+)", 1); // 1 2 bc say "Offset 2"; printr Find("abc", "(%a+)", 2); // 2 2 c say "Offset 3"; // Out of bounds positive Offset will return NULL printr Find("abc", "(%a+)", 3); // NULL