Core Function Find

From Sputnik Wiki

(Difference between revisions)

Jump to: navigation, search

Revision as of 09:30, 5 August 2014

Find( <string>, <needle>, <offset>, <plain> )

a single character class, which matches any single character in the class;
a single character class followed by '*', which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
a single character class followed by '+', which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
a single character class followed by '?', which matches 0 or 1 occurrence of a character in the class;
%n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below);
%bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses.

Pattern:

A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves. Captures:

A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3.

As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.

A pattern cannot contain embedded zeros. Use %z instead.

The standard patterns you can search for are:

 . --- (a dot) represents all characters. 
%a --- all letters. 
%c --- all control characters. 
%d --- all digits. 
%l --- all lowercase letters. 
%p --- all punctuation characters. 
%s --- all space characters. 
%u --- all uppercase letters. 
%w --- all alphanumeric characters. 
%x --- all hexadecimal digits. 
%z --- the character with hex representation 0x00 (null). 
%% --- a single '%' character.
%1 --- captured pattern 1.
%2 --- captured pattern 2 (and so on).

For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.

The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.

Important! - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits.

There are some "magic characters" (such as %) that have special meanings. These are:

^ $ ( ) % . [ ] * + - ?

If you want to use those in a pattern (as themselves) you must precede them by a % symbol.

eg. %% would match a single %

As with normal regular expressions you can build your own pattern classes by using square brackets, eg.

[abc] ---> matches a, b or c [a-z] ---> matches lowercase letters (same as %l) [^abc] ---> matches anything except a, b or c [%a%d] ---> matches all letters and digits [%a%d_] ---> matches all letters, digits and underscore [%[%]] ---> matches square brackets (had to escape them with %)

The repetition characters are:

+  ---> 1 or more repetitions (greedy)
*  ---> 0 or more repetitions (greedy)
-  ---> 0 or more repetitions (non greedy)
?  ---> 0 or 1 repetition only

The standard "anchor" characters apply:

^  ---> anchor to start of subject string
$  ---> anchor to end of subject string

You can also use round brackets to specify "captures", similar to normal regular expressions:

You see (.*) here

Here, whatever matches (.*) becomes the first pattern.

You can also refer to matched substrings (captures) later on in an expression:

printr find ("You see dogs and dogs", "You see (.*) and %1"); // 1    21    dogs
printr find ("You see dogs and cats", "You see (.*) and %1"); // NULL

This example shows how you can look for a repetition of a word matched earlier, whatever that word was ("dogs" in this case).

As a special case, an empty capture string returns as the captured pattern, the position of itself in the string. eg.

printr find ("You see dogs and cats", "You .* ()dogs .*"); // 1    21    9

What this is saying is that the word "dogs" starts at column 9.

Finally you can look for nested "balanced" things (such as parentheses) by using %b, like this:

printr find ("I see a (big fish (swimming) in the pond) here", "%b()"); // 9    41

After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)".

Example

Search for raw text in a string (No patterns)

my $Test = "Hello cat world!";
printr Find($Test, "cat", 0, true);
// Prints
// Array
// (
//     [0] => 6
//     [1] => 8
// )

Search for raw text in a string but handle the result manually

my $Test = "the quick brown fox";
my List ($Pos, $Len) = Find($Test, "brown", 0, true);
say "Position: $Pos";
say "EndPosition: $Len";
say "String: $Text";
// Prints
// Position: 10
// EndPosition: 14

Use a pattern to find it note it only returns the index and size when it cant find any group matches

my $Test = "Hello cat world!";
printr Find($Test, "cat");
// Prints
// Array
// (
//     [0] => 6
//     [1] => 8
// )

Another pattern match with no groups

my $Test = "the quick brown fox";
printr Find($Test, "quick");
// Prints
// Array
// (
//     [0] => 4
//     [1] => 8
// )

A group capture pattern this time

my $Test = "the quick brown fox";
printr Find($Test, "(%a+)");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 2
//     [2] => the
// )

Another group capture pattern this time

my $Test = "the quick brown fox";
printr Find($Test, "(%a+)", 10);
// Prints
// Array
// (
//     [0] => 10
//     [1] => 14
//     [2] => brown
// )

Another pattern this time but we will handle the capture ourself

my $Test = "the quick brown fox";
my List ($Pos, $PosEnd) = Find($Test, "(%a+)", 10);
say "Position: $Pos";
say "PosEnd: $PosEnd";
say "String: " . substr($Test, $Pos, strlen($Test) - $PosEnd);
// Prints
// Position: 10
// PosEnd: 14
// String: brown

What happens when no match is found with regular expressions

my $Test = "the quick brown fox";
printr vardump(Find($Test, "fruit"));
// Prints
// NULL

What happens when no match is found without regular expressions

my $Test = "the quick brown fox";
printr vardump(Find($Test, "fruit", 0, true));
// Prints
// NULL

More examples

my $Test = "You see dogs and dogs";
printr Find($Test, "You see (.*)");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 20
//     [2] => dogs and dogs
// )

You can also refer to matched substrings (captures) later on in an expression:

my $Test = "You see dogs and dogs";
printr Find($Test, "You see (.*) and %1");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 20
//     [2] => dogs
// )

As shown here when the matched substring is not found NULL is returned

my $Test = "You see dogs and cats";
printr vardump(Find($Test, "You see (.*) and %1"));
// Prints
// NULL

Another example of referring to matched substrings (captures) later on in an expression:

my $Test = "You sir see dogs and dogs = sir";
printr Find($Test, "You (.*) see (.*) and %2 = %1");
// Prints
// Array
// (
//     [0] => 0
//     [1] => 30
//     [2] => sir
//     [3] => dogs
// )

@@ Line 49: / Line 49: @@
 ==== Patterns ====
+Pattern Item:
+A pattern item can be
+* a single character class, which matches any single character in the class;
+* a single character class followed by '*', which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
+* a single character class followed by '+', which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
+* a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
+* a single character class followed by '?', which matches 0 or 1 occurrence of a character in the class;
+* %n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below);
+* %bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses.
+Pattern:
+A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.
+Captures:
+A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
+As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.
+A pattern cannot contain embedded zeros. Use %z instead.
 The standard patterns you can search for are:
@@ Line 68: / Line 91: @@
 %2 --- captured pattern 2 (and so on).
 </pre>
+For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.
+The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.
 Important! - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits.
@@ Line 134: / Line 161: @@
 After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)".
 === Example ===

Core Function Find

Revision as of 09:30, 5 August 2014

Contents

Description

Parameters

string

needle

offset

plain

Return Value

Remarks

Patterns

Example

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox