Core Function Find

From Sputnik Wiki

Revision as of 15:56, 6 August 2014 by UberFoX (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

Find( <string>, <pattern>, <offset>, <plain> )

a single character class, which matches any single character in the class;
a single character class followed by '*', which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
a single character class followed by '+', which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
a single character class followed by '?', which matches 0 or 1 occurrence of a character in the class;
%n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below);
%bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses.
%f[set], a frontier pattern; such item matches an empty string at any position such that the next character belongs to set and the previous character does not belong to set. The set set is interpreted as previously described. The beginning and the end of the subject are handled as if they were the character '\0'.

Pattern

A pattern is a sequence of pattern items. A caret '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.

Captures

A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3.

As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.

Standard patterns

The standard patterns you can search for are:

 . --- (a dot) represents all characters. 
%a --- all letters. 
%c --- all control characters. 
%d --- all digits. 
%l --- all lowercase letters. 
%p --- all punctuation characters. 
%s --- all space characters. 
%u --- all uppercase letters. 
%w --- all alphanumeric characters. 
%x --- all hexadecimal digits. 
%z --- the character with hex representation 0x00 (null). 
%% --- a single '%' character.
%1 --- captured pattern 1.
%2 --- captured pattern 2 (and so on).

A pattern cannot contain embedded zeros. Use %z instead.

For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.

The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.

Important! - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits.

There are some "magic characters" (such as %) that have special meanings. These are:

^ $ ( ) % . [ ] * + - ?

If you want to use those in a pattern (as themselves) you must precede them by a % symbol.

eg. %% would match a single %

As with normal regular expressions you can build your own pattern classes by using square brackets, eg.

[abc] ---> matches a, b or c [a-z] ---> matches lowercase letters (same as %l) [^abc] ---> matches anything except a, b or c [%a%d] ---> matches all letters and digits [%a%d_] ---> matches all letters, digits and underscore [%[%]] ---> matches square brackets (had to escape them with %)

Repetition characters

The repetition characters are:

+  ---> 1 or more repetitions (greedy)
*  ---> 0 or more repetitions (greedy)
-  ---> 0 or more repetitions (non greedy)
?  ---> 0 or 1 repetition only

Anchors

The standard "anchor" characters apply:

^  ---> anchor to start of subject string
$  ---> anchor to end of subject string

Practical

You can also use round brackets to specify "captures", similar to normal regular expressions:

You see (.*) here

Here, whatever matches (.*) becomes the first pattern.

You can also refer to matched substrings (captures) later on in an expression:

printr find ("You see dogs and dogs", "You see (.*) and %1"); // 1    21    dogs
printr find ("You see dogs and cats", "You see (.*) and %1"); // NULL

This example shows how you can look for a repetition of a word matched earlier, whatever that word was ("dogs" in this case).

As a special case, an empty capture string returns as the captured pattern, the position of itself in the string. eg.

printr find ("You see dogs and cats", "You .* ()dogs .*"); // 1    21    9

What this is saying is that the word "dogs" starts at column 9.

Finally you can look for nested "balanced" things (such as parentheses) by using %b, like this:

printr find ("I see a (big fish (swimming) in the pond) here", "%b()"); // 9    41

After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)".

Frontier Pattern Demo

The "frontier" expression pattern %f s a little known pattern yet it is very powerful.

I would like to present here the usefulness of it, in an attempt to show how it can be used.

Let's consider a fairly straightforward task: to find all words in upper-case in a string.

First attempt: %u+

gsub ("the QUICK brown fox", "%u+", function($w) { say $w;});
// QUICK

That looks OK, found a word in all caps. But look at this:

gsub ("the QUICK BROwn fox", "%u+", function($w) { say $w;});
// QUICK
// BRO

We also found a word which was partially capitalised.

Second attempt: %u+%A

gsub ("the QUICK BROwn fox", "%u+%A", function($w) { say $w;});
// QUICK

The detection of non-letters correctly excluded the partially capitalised word. But wait! How about this:

gsub ("the QUICK brOWN fox", "%u+%A", function($w) { say $w;});
// QUICK
// OWN

We also have a second problem:

gsub ("the QUICK. brown fox", "%u+%A", function($w) { say $w;});
// QUICK.

The punctuation after the word is now part of the captured string, which is not wanted.

Third attempt: %A%u+%A

gsub ("the QUICK brOWN FOx jumps", "%A%u+%A", function($w) { say $w;});
//  QUICK

This correctly excludes the two partially capitalised words, but still leaves the punctuation in, like this:

gsub ("the (QUICK) brOWN FOx jumps", "%A%u+%A", function($w) { say $w;});
// (QUICK)

Also, there is another problem, apart from capturing the non-letters at the sides. Look at this:

gsub ("THE (QUICK) brOWN FOx JUMPS", "%A%u+%A", function($w) { say $w;});
// (QUICK)

The correctly capitalised words at the start and end of the string are not detected.

The solution: The Frontier pattern: %f

gsub ("THE (QUICK) brOWN FOx JUMPS", "%f[%a]%u+%f[%A]", function($w) { say $w;});
// THE
// QUICK
// JUMPS

The frontier pattern %f followed by a set detects the transition from "not in set" to "in set". The source string boundary qualifies as "not in set" so it also matches the word at the very start of the string to be matched.

The second frontier pattern is also matched at the end of the string, so our final word is also captured.

Alternatives without the frontier pattern

Without the frontier pattern, one might resort to maddness like this:

$s = "THE (QUICK) brOWN FOx JUMPS";
$s = ("\0$s\0")->>gsub("(%A)(%u)", "%1\0%2")
               ->>gsub("(%u)(%A)", "%1\0%2");
$s = $s->>gsub("%z(%u+)%z", function($w) { say $w;});
// THE
// QUICK
// JUMPS

Limitations

Especially if you're used to other languages with regular expressions, you might expect to be able to do stuff like this:

'(foo)+' -- match the string "foo" repeated one or more times '(foo|bar)' -- match either the string "foo" or the string "bar"

But these patterns do not support this, only single characters can be repeated or chosen between, not sub-patterns or strings. The solution is to either use multiple patterns and write some custom logic, use Sputnik's Regex features such as m// and s// which are very similar to the ones found in Perl.

If you have something really complex to Parse and these patterns and regex doesn't cut it you can always try Sputniks parser engine which supports complex grammar etc.

Example

Search for raw text in a string (No patterns)

my $Test = "Hello cat world!";
printr Find($Test, "cat", 0, true);
// Prints
// Array
// (
//     [0] => 6
//     [1] => 8
// )

Search for raw text in a string but handle the result manually

my $Test = "the quick brown fox";
my List ($Pos, $Len) = Find($Test, "brown", 0, true);
say "Position: $Pos";
say "EndPosition: $Len";
say "String: $Text";
// Prints
// Position: 10
// EndPosition: 14

Use a pattern to find it note it only returns the index and size when it cant find any group matches

my $Test = "Hello cat world!";
printr Find($Test, "cat");
// Prints
// Array
// (
//     [0] => 6
//     [1] => 8
// )

Another pattern match with no groups

my $Test = "the quick brown fox";
printr Find($Test, "quick");
// Prints
// Array
// (
//     [0] => 4
//     [1] => 8
// )

A group capture pattern this time

my $Test = "the quick brown fox";
printr Find($Test, "(%a+)");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 2
//     [2] => the
// )

Another group capture pattern this time

my $Test = "the quick brown fox";
printr Find($Test, "(%a+)", 10);
// Prints
// Array
// (
//     [0] => 10
//     [1] => 14
//     [2] => brown
// )

Another pattern this time but we will handle the capture ourself

my $Test = "the quick brown fox";
my List ($Pos, $PosEnd) = Find($Test, "(%a+)", 10);
say "Position: $Pos";
say "PosEnd: $PosEnd";
say "String: " . substr($Test, $Pos, strlen($Test) - $PosEnd);
// Prints
// Position: 10
// PosEnd: 14
// String: brown

What happens when no match is found with regular expressions

my $Test = "the quick brown fox";
printr vardump(Find($Test, "fruit"));
// Prints
// NULL

What happens when no match is found without regular expressions

my $Test = "the quick brown fox";
printr vardump(Find($Test, "fruit", 0, true));
// Prints
// NULL

More examples

my $Test = "You see dogs and dogs";
printr Find($Test, "You see (.*)");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 20
//     [2] => dogs and dogs
// )

You can also refer to matched substrings (captures) later on in an expression:

my $Test = "You see dogs and dogs";
printr Find($Test, "You see (.*) and %1");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 20
//     [2] => dogs
// )

As shown here when the matched substring is not found NULL is returned

my $Test = "You see dogs and cats";
printr vardump(Find($Test, "You see (.*) and %1"));
// Prints
// NULL

Another example of referring to matched substrings (captures) later on in an expression:

my $Test = "You sir see dogs and dogs = sir";
printr Find($Test, "You (.*) see (.*) and %2 = %1");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 30
//     [2] => sir
//     [3] => dogs
// )

Examples using the Offset

say "Offset -1";
printr Find("abc", "(%a+)", -1); //    2    2    c
 
say "Offset -2";
printr Find("abc", "(%a+)", -2); //    1    2    bc
 
say "Offset -3";
printr Find("abc", "(%a+)", -3); //    0    2    abc
 
 
say "Offset -4";
// A negative Offset going beyond the strings length
// will be converted to the strings length so
// it does a full search
// this is contrary to the positive Offset out of bounds
// which will return null
printr Find("abc", "(%a+)", -4); //    0    2    abc
 
say "Offset -100";
// Even if it is -100 etc
printr Find("abc", "(%a+)", -100); //    0    2    abc
 
say "Offset 0";
printr Find("abc", "(%a+)", 0); //    0    2    abc
 
say "Offset 1";
printr Find("abc", "(%a+)", 1); //    1    2    bc
 
say "Offset 2";
printr Find("abc", "(%a+)", 2); //    2    2    c
 
say "Offset 3";
// Out of bounds positive Offset will return NULL
printr Find("abc", "(%a+)", 3); // NULL

Core Function Find

Contents

Description

Parameters

string

pattern

offset

plain

Return Value

Remarks

Patterns

Pattern Item

Pattern

Captures

Standard patterns

Repetition characters

Anchors

Practical

Frontier Pattern Demo

Limitations

Example

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox