Core Function Find

From Sputnik Wiki

Jump to: navigation, search

Find( <string>, <pattern>, <offset>, <plain> )

Description

Find the first occurrence of the pattern in the string passed.

Parameters

string

The string to evaluate.

pattern

See Remarks to learn about patterns

offset

Optional; The starting position of the search.

Default: 0

plain

Optional; Flag to indicate if the operations should use patterns or not.

true = use the pattern as a regular expression
false = treat the pattern as plain text

Default: true

Return Value

Success: Returns a pair of values representing the start and end of the string.

Failure: Returns NULL.

Remarks

This function is pretty much the same a the LUA String.GSub() however this one returns start position starting at 0 (LUA's starts at 1) and lowers the end position by 1, Also the Offset begins at 0 here where as in LUA it begins at 1.

This is because in Sputnik chars in a string start at 0 not 1.

Patterns

Pattern Item

A pattern item can be

a single character class, which matches any single character in the class;
a single character class followed by '*', which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
a single character class followed by '+', which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
a single character class followed by '?', which matches 0 or 1 occurrence of a character in the class;
%n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below);
%bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses.
%f[set], a frontier pattern; such item matches an empty string at any position such that the next character belongs to set and the previous character does not belong to set. The set set is interpreted as previously described. The beginning and the end of the subject are handled as if they were the character '\0'.

Pattern

A pattern is a sequence of pattern items. A caret '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.

Captures

A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3.

As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.

Standard patterns

The standard patterns you can search for are:

 . --- (a dot) represents all characters. 
%a --- all letters. 
%c --- all control characters. 
%d --- all digits. 
%l --- all lowercase letters. 
%p --- all punctuation characters. 
%s --- all space characters. 
%u --- all uppercase letters. 
%w --- all alphanumeric characters. 
%x --- all hexadecimal digits. 
%z --- the character with hex representation 0x00 (null). 
%% --- a single '%' character.
%bxy - Matches x, y, and everything in between allows for the nesting of balanced captures as well.
%1 --- captured pattern 1.
%2 --- captured pattern 2 (and so on).

A pattern cannot contain embedded zeros. Use %z instead.

For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.

The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.

Important! - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits.

There are some "magic characters" (such as %) that have special meanings. These are:

^ $ ( ) % . [ ] * + - ?

If you want to use those in a pattern (as themselves) you must precede them by a % symbol.

eg. %% would match a single %

As with normal regular expressions you can build your own pattern classes by using square brackets, eg.

[abc] ---> matches a, b or c [a-z] ---> matches lowercase letters (same as %l) [^abc] ---> matches anything except a, b or c [%a%d] ---> matches all letters and digits [%a%d_] ---> matches all letters, digits and underscore [%[%]] ---> matches square brackets (had to escape them with %)

Repetition characters

The repetition characters are:

+  ---> 1 or more repetitions (greedy)
*  ---> 0 or more repetitions (greedy)
-  ---> 0 or more repetitions (non greedy)
?  ---> 0 or 1 repetition only

Anchors

The standard "anchor" characters apply:

^  ---> anchor to start of subject string
$  ---> anchor to end of subject string

Practical

You can also use round brackets to specify "captures", similar to normal regular expressions:

You see (.*) here

Here, whatever matches (.*) becomes the first pattern.

You can also refer to matched substrings (captures) later on in an expression:

printr find ("You see dogs and dogs", "You see (.*) and %1"); // 1    21    dogs
printr find ("You see dogs and cats", "You see (.*) and %1"); // NULL

This example shows how you can look for a repetition of a word matched earlier, whatever that word was ("dogs" in this case).

As a special case, an empty capture string returns as the captured pattern, the position of itself in the string. eg.

printr find ("You see dogs and cats", "You .* ()dogs .*"); // 1    21    9

What this is saying is that the word "dogs" starts at column 9.

Finally you can look for nested "balanced" things (such as parentheses) by using %b, like this:

printr find ("I see a (big fish (swimming) in the pond) here", "%b()"); // 9    41

After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)".

Frontier Pattern Demo

The "frontier" expression pattern %f s a little known pattern yet it is very powerful.

I would like to present here the usefulness of it, in an attempt to show how it can be used.

Let's consider a fairly straightforward task: to find all words in upper-case in a string.

First attempt: %u+

gsub ("the QUICK brown fox", "%u+", function($w) { say $w;});
// QUICK

That looks OK, found a word in all caps. But look at this:

gsub ("the QUICK BROwn fox", "%u+", function($w) { say $w;});
// QUICK
// BRO

We also found a word which was partially capitalised.

Second attempt: %u+%A

gsub ("the QUICK BROwn fox", "%u+%A", function($w) { say $w;});
// QUICK

The detection of non-letters correctly excluded the partially capitalised word. But wait! How about this:

gsub ("the QUICK brOWN fox", "%u+%A", function($w) { say $w;});
// QUICK
// OWN

We also have a second problem:

gsub ("the QUICK. brown fox", "%u+%A", function($w) { say $w;});
// QUICK.

The punctuation after the word is now part of the captured string, which is not wanted.

Third attempt: %A%u+%A

gsub ("the QUICK brOWN FOx jumps", "%A%u+%A", function($w) { say $w;});
//  QUICK

This correctly excludes the two partially capitalised words, but still leaves the punctuation in, like this:

gsub ("the (QUICK) brOWN FOx jumps", "%A%u+%A", function($w) { say $w;});
// (QUICK)

Also, there is another problem, apart from capturing the non-letters at the sides. Look at this:

gsub ("THE (QUICK) brOWN FOx JUMPS", "%A%u+%A", function($w) { say $w;});
// (QUICK)

The correctly capitalised words at the start and end of the string are not detected.

The solution: The Frontier pattern: %f

gsub ("THE (QUICK) brOWN FOx JUMPS", "%f[%a]%u+%f[%A]", function($w) { say $w;});
// THE
// QUICK
// JUMPS

The frontier pattern %f followed by a set detects the transition from "not in set" to "in set". The source string boundary qualifies as "not in set" so it also matches the word at the very start of the string to be matched.

The second frontier pattern is also matched at the end of the string, so our final word is also captured.

Alternatives without the frontier pattern

Without the frontier pattern, one might resort to maddness like this:

$s = "THE (QUICK) brOWN FOx JUMPS";
$s = ("\0$s\0")->>gsub("(%A)(%u)", "%1\0%2")
               ->>gsub("(%u)(%A)", "%1\0%2");
$s = $s->>gsub("%z(%u+)%z", function($w) { say $w;});
// THE
// QUICK
// JUMPS

Limitations

Especially if you're used to other languages with regular expressions, you might expect to be able to do stuff like this:

'(foo)+' -- match the string "foo" repeated one or more times '(foo|bar)' -- match either the string "foo" or the string "bar"

But these patterns do not support this, only single characters can be repeated or chosen between, not sub-patterns or strings. The solution is to either use multiple patterns and write some custom logic or use Sputnik's Regex features such as m// and s// which are very similar to the ones found in Perl.

If you have something really complex to Parse and these patterns and regex doesn't cut it you can always try Sputniks parser engine which supports complex grammar etc.

Patterns Tutorial

Patterns can match sequences of characters, where each character can be optional, or repeat multiple times. If you're used to other languages that have regular expressions to match text, remember that these functions provide a pattern matching that is not the same: it's more limited, and has different syntax. It is not as robust as Sputnik's regular expression library or its parser generator however it is a nice subset with a friendly

Introduction to patterns

First we will use the string.find function, which finds the first occurrence of a pattern in a string and returns start and end indices of the first and last characters that matched the text:

// find 1st occurance of 'an' (letters are matched literally)
printr find('banana', 'an');
// Prints
// Array
// (
//     [0] => 1
//     [1] => 2
// )
 
// 'sputnik' will not be found so NULL is returned
printr find('banana', 'sputnik');

But literally matching text isn't that useful, so patterns have the concept of character classes. A character class is a pattern that matches one of a set of characters. For example, . is a character class that matches any character:

printr find("abcdefg", 'b..');
// Prints
// Array
// (
//     [0] => 1
//     [1] => 3
// )

We can now use these indices to get the matched text, but there's a better way: the match() function. It returns the matched text, or null if the pattern is not found: (actually, find() also returns the matched text, but it first returns the indexes; match only returns the text)

Patterns have a few pre-defined classes, use them as "%x", where "x" is the letter identifying the class:

// %d matches a digit
printr match("foo 123 bar", '%d%d%d');
// Prints
// 123
 
// %u matches an uppercase letter
printr match("text with an Uppercase letter", '%u'); 
// Prints
// U

Making the letter after the % uppercase inverts the class, so %D will match all non-digit characters. See the Patterns (above) for a list of all pre-defined classes.

You can also create your own classes by wrapping a group of characters in square brackets. This will match one of the characters. If the first character inside the brackets is ^, then it will match a character not in the group.

printr match("abcd", '[bc][bc]');
// Prints
// bc
printr match("abcd", '[^ad]');
// Prints
// b
 
// you can specify a range of characters using -
printr match("123", '[0-9]');
// Prints
// 1

Repetition

Even with character classes this is still very limiting, because we can only match strings with a fixed length. To solve this, patterns support these four repetition operators:

* Match the previous character (or class) zero or more times, as many times as possible.
+ Match the previous character (or class) one or more times, as many times as possible.
- Match the previous character (or class) zero or more times, as few times as possible.
? Make the previous character (or class) optional.

We'll start with ?, since it's the simplest:

printr match("examples", 'examples?');
// Prints
// examples
printr match("example", 'examples?');
// Prints
// example
printr match("example", 'examples');
// null

Now an example of +. Note how it's used with a class, so it can match a sequence of different characters:

printr match("this is some text with a number 12345 in it", '%d+');
// Prints
// 12345

Unlike +, * can match nothing:

printr match("one |two| three", '|.*|');
// Prints
// |two|
printr match("one || three", '|.*|');
// Prints
// ||
printr match("one || three", '|.+|');
// null

A common mistake with + and * is not realizing that they match as much as possible, which may not be the desired result. One way to fix this is using -:

printr match("one |two| three |four| five", '|.*|');
// Prints
// |two| three |four|
printr match("one |two| three |four| five", '|.-|');
// Prints
// |two|
 
// another solution can be to not let the contents match the delimiter
printr match("one |two| three |four| five", '|[^|]*|');
// |two|

When using -, you need to remember to "anchor" it from both sides, otherwise it will match nothing (since it tries to match as little as possible):

printr match("abc", 'a.*');
// Prints
// abc
printr match("abc", 'a.-'); // the .- part matches nothing
// Prints
// a
printr match("abc", 'a.-$'); // the $ matches the end of the string
// Prints
// abc
printr match("abc", '^.-b'); // the ^ matches the start of the string
// Prints
// ab

Here we also introduced ^ and $, which match the start and end of the string. They're not just for use with -, you can just prepend the pattern with ^ to make it match at the start, append $ to make it match at the end, and wrap it in both (like the example above) to make it match the whole string.

Finally, you might be thinking how to match all these special characters literally. The solution is to prepend them with a % character:

printr match("%*^", '%%%*%^');
// Prints
// %*^

Captures

What if you want to get certain pieces out of a string of text? This can be done by wrapping parts of a pattern in ( ), and the contents of each of these captures will be returned from match().

// %a: letter %s: whitespace
printr match("foo: 123 bar: 456", '(%a+):%s*(%d+)%s+(%a+):%s*(%d+)');
// Prints
// Array
// (
//     [0] => foo
//     [1] => 123
//     [2] => bar
//     [3] => 456
// )

Each capture is returned as a separate result, so this is useful for splitting out values

my $date = "04/19/64";
my List ( $m, $d, $y ) = match($date, "(%d+)/(%d+)/(%d+)");
say "Month: $m";
say "Day: $d";
say "Year: $y";
print("19" . $y);
// Prints
// Month: 04
// Day: 19
// Year: 64
// 1964

Example

Example of using %b the balanced capture

// The %b[] will match everything starting from [
// and stopping at the first ] it finds
// it will also include the [] in the return
$line = "int[4547] height";
printr match($line, '%b[]');
// Prints:
// [4547]
 
// Of course you can set the start and end to be anything with %b
$line = "int z4547q height";
printr match($line, '%bzq');
// Prints:
// z4547q

Search for raw text in a string (No patterns)

my $Test = "Hello cat world!";
printr Find($Test, "cat", 0, true);
// Prints
// Array
// (
//     [0] => 6
//     [1] => 8
// )

Search for raw text in a string but handle the result manually

my $Test = "the quick brown fox";
my List ($Pos, $Len) = Find($Test, "brown", 0, true);
say "Position: $Pos";
say "EndPosition: $Len";
say "String: $Text";
// Prints
// Position: 10
// EndPosition: 14

Use a pattern to find it note it only returns the index and size when it cant find any group matches

my $Test = "Hello cat world!";
printr Find($Test, "cat");
// Prints
// Array
// (
//     [0] => 6
//     [1] => 8
// )

Another pattern match with no groups

my $Test = "the quick brown fox";
printr Find($Test, "quick");
// Prints
// Array
// (
//     [0] => 4
//     [1] => 8
// )

A group capture pattern this time

my $Test = "the quick brown fox";
printr Find($Test, "(%a+)");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 2
//     [2] => the
// )

Another group capture pattern this time

my $Test = "the quick brown fox";
printr Find($Test, "(%a+)", 10);
// Prints
// Array
// (
//     [0] => 10
//     [1] => 14
//     [2] => brown
// )

Another pattern this time but we will handle the capture ourself

my $Test = "the quick brown fox";
my List ($Pos, $PosEnd) = Find($Test, "(%a+)", 10);
say "Position: $Pos";
say "PosEnd: $PosEnd";
say "String: " . substr($Test, $Pos, strlen($Test) - $PosEnd);
// Prints
// Position: 10
// PosEnd: 14
// String: brown

What happens when no match is found with regular expressions

my $Test = "the quick brown fox";
printr vardump(Find($Test, "fruit"));
// Prints
// NULL

What happens when no match is found without regular expressions

my $Test = "the quick brown fox";
printr vardump(Find($Test, "fruit", 0, true));
// Prints
// NULL

More examples

my $Test = "You see dogs and dogs";
printr Find($Test, "You see (.*)");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 20
//     [2] => dogs and dogs
// )

You can also refer to matched substrings (captures) later on in an expression:

my $Test = "You see dogs and dogs";
printr Find($Test, "You see (.*) and %1");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 20
//     [2] => dogs
// )

As shown here when the matched substring is not found NULL is returned

my $Test = "You see dogs and cats";
printr vardump(Find($Test, "You see (.*) and %1"));
// Prints
// NULL

Another example of referring to matched substrings (captures) later on in an expression:

my $Test = "You sir see dogs and dogs = sir";
printr Find($Test, "You (.*) see (.*) and %2 = %1");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 30
//     [2] => sir
//     [3] => dogs
// )

Examples using the Offset

say "Offset -1";
printr Find("abc", "(%a+)", -1); //    2    2    c
 
say "Offset -2";
printr Find("abc", "(%a+)", -2); //    1    2    bc
 
say "Offset -3";
printr Find("abc", "(%a+)", -3); //    0    2    abc
 
 
say "Offset -4";
// A negative Offset going beyond the strings length
// will be converted to the strings length so
// it does a full search
// this is contrary to the positive Offset out of bounds
// which will return null
printr Find("abc", "(%a+)", -4); //    0    2    abc
 
say "Offset -100";
// Even if it is -100 etc
printr Find("abc", "(%a+)", -100); //    0    2    abc
 
say "Offset 0";
printr Find("abc", "(%a+)", 0); //    0    2    abc
 
say "Offset 1";
printr Find("abc", "(%a+)", 1); //    1    2    bc
 
say "Offset 2";
printr Find("abc", "(%a+)", 2); //    2    2    c
 
say "Offset 3";
// Out of bounds positive Offset will return NULL
printr Find("abc", "(%a+)", 3); // NULL

Core Function Find

Contents

Description

Parameters

string

pattern

offset

plain

Return Value

Remarks

Patterns

Pattern Item

Pattern

Captures

Standard patterns

Repetition characters

Anchors

Practical

Frontier Pattern Demo

Limitations

Patterns Tutorial

Introduction to patterns

Repetition

Captures

Example

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox