Core Function Regex Match

From Sputnik Wiki
(Difference between revisions)
Jump to: navigation, search
(Anchors)
(Supported Named Blocks)
Line 165: Line 165:
 
                           decimal digit.                              \D                          " ", "=", " ", "I"
 
                           decimal digit.                              \D                          " ", "=", " ", "I"
 
                                                                                                     "V" in "4 = IV"  
 
                                                                                                     "V" in "4 = IV"  
</pre>
 
 
====== Supported Named Blocks ======
 
 
Sputnik provides the named blocks listed in the following tables.
 
 
The set of supported named blocks is based on Unicode 4.0 and Perl 5.6.
 
 
<pre>
 
alnum letters and digits
 
alpha letters
 
ascii character codes 0 - 127
 
blank space or tab only
 
cntrl control characters
 
digit decimal digits (same as \d)
 
graph printing characters, excluding space
 
lower lower case letters
 
print printing characters, including space
 
punct printing characters, excluding letters and digits
 
space white space (not quite the same as \s)
 
upper upper case letters
 
lower lower case letters
 
word "word" characters (same as \w)
 
xdigit hexadecimal digits
 
// You use these like
 
// say ("CAT" =~ m/\p{alnum}/);
 
// \p{alnum}
 
</pre>
 
 
You can also use the [[Macros|Macros]] for even more character classes.
 
 
Additional classes are:
 
<pre>
 
IsBasicLatin ---> Unicode Range:0000 - 007F
 
IsLatin-1Supplement ---> Unicode Range:0080 - 00FF
 
IsLatinExtended-A ---> Unicode Range:0100 - 017F
 
IsLatinExtended-B ---> Unicode Range:0180 - 024F
 
IsIPAExtensions ---> Unicode Range:0250 - 02AF
 
IsSpacingModifierLetters ---> Unicode Range:02B0 - 02FF
 
IsCombiningDiacriticalMarks ---> Unicode Range:0300 - 036F
 
IsGreek ---> Unicode Range:0370 - 03FF
 
IsCyrillic ---> Unicode Range:0400 - 04FF
 
IsCyrillicSupplement ---> Unicode Range:0500 - 052F
 
IsArmenian ---> Unicode Range:0530 - 058F
 
IsHebrew ---> Unicode Range:0590 - 05FF
 
IsArabic ---> Unicode Range:0600 - 06FF
 
IsSyriac ---> Unicode Range:0700 - 074F
 
IsThaana ---> Unicode Range:0780 - 07BF
 
IsDevanagari ---> Unicode Range:0900 - 097F
 
IsBengali ---> Unicode Range:0980 - 09FF
 
IsGurmukhi ---> Unicode Range:0A00 - 0A7F
 
IsGujarati ---> Unicode Range:0A80 - 0AFF
 
IsOriya ---> Unicode Range:0B00 - 0B7F
 
IsTamil ---> Unicode Range:0B80 - 0BFF
 
IsTelugu ---> Unicode Range:0C00 - 0C7F
 
IsKannada ---> Unicode Range:0C80 - 0CFF
 
IsMalayalam ---> Unicode Range:0D00 - 0D7F
 
IsSinhala ---> Unicode Range:0D80 - 0DFF
 
IsThai ---> Unicode Range:0E00 - 0E7F
 
IsLao ---> Unicode Range:0E80 - 0EFF
 
IsTibetan ---> Unicode Range:0F00 - 0FFF
 
IsMyanmar ---> Unicode Range:1000 - 109F
 
IsGeorgian ---> Unicode Range:10A0 - 10FF
 
IsHangulJamo ---> Unicode Range:1100 - 11FF
 
IsEthiopic ---> Unicode Range:1200 - 137F
 
IsCherokee ---> Unicode Range:13A0 - 13FF
 
IsUnifiedCanadianAboriginalSyllabics ---> Unicode Range:1400 - 167F
 
IsOgham ---> Unicode Range:1680 - 169F
 
IsRunic ---> Unicode Range:16A0 - 16FF
 
IsTagalog ---> Unicode Range:1700 - 171F
 
IsHanunoo ---> Unicode Range:1720 - 173F
 
IsBuhid ---> Unicode Range:1740 - 175F
 
IsTagbanwa ---> Unicode Range:1760 - 177F
 
IsKhmer ---> Unicode Range:1780 - 17FF
 
IsMongolian ---> Unicode Range:1800 - 18AF
 
IsLimbu ---> Unicode Range:1900 - 194F
 
IsTaiLe ---> Unicode Range:1950 - 197F
 
IsKhmerSymbols ---> Unicode Range:19E0 - 19FF
 
IsPhoneticExtensions ---> Unicode Range:1D00 - 1D7F
 
IsLatinExtendedAdditional ---> Unicode Range:1E00 - 1EFF
 
IsGreekExtended ---> Unicode Range:1F00 - 1FFF
 
IsGeneralPunctuation ---> Unicode Range:2000 - 206F
 
IsSuperscriptsandSubscripts ---> Unicode Range:2070 - 209F
 
IsCurrencySymbols ---> Unicode Range:20A0 - 20CF
 
IsCombiningDiacriticalMarksforSymbols ---> Unicode Range:20D0 - 20FF
 
IsLetterlikeSymbols ---> Unicode Range:2100 - 214F
 
IsNumberForms ---> Unicode Range:2150 - 218F
 
IsArrows ---> Unicode Range:2190 - 21FF
 
IsMathematicalOperators ---> Unicode Range:2200 - 22FF
 
IsMiscellaneousTechnical ---> Unicode Range:2300 - 23FF
 
IsControlPictures ---> Unicode Range:2400 - 243F
 
IsOpticalCharacterRecognition ---> Unicode Range:2440 - 245F
 
IsEnclosedAlphanumerics ---> Unicode Range:2460 - 24FF
 
IsBoxDrawing ---> Unicode Range:2500 - 257F
 
IsBlockElements ---> Unicode Range:2580 - 259F
 
IsGeometricShapes ---> Unicode Range:25A0 - 25FF
 
IsMiscellaneousSymbols ---> Unicode Range:2600 - 26FF
 
IsDingbats ---> Unicode Range:2700 - 27BF
 
IsMiscellaneousMathematicalSymbols-A ---> Unicode Range:27C0 - 27EF
 
IsSupplementalArrows-A ---> Unicode Range:27F0 - 27FF
 
IsBraillePatterns ---> Unicode Range:2800 - 28FF
 
IsSupplementalArrows-B ---> Unicode Range:2900 - 297F
 
IsMiscellaneousMathematicalSymbols-B ---> Unicode Range:2980 - 29FF
 
IsSupplementalMathematicalOperators ---> Unicode Range:2A00 - 2AFF
 
IsMiscellaneousSymbolsandArrows ---> Unicode Range:2B00 - 2BFF
 
IsCJKRadicalsSupplement ---> Unicode Range:2E80 - 2EFF
 
IsKangxiRadicals ---> Unicode Range:2F00 - 2FDF
 
IsIdeographicDescriptionCharacters ---> Unicode Range:2FF0 - 2FFF
 
IsCJKSymbolsandPunctuation ---> Unicode Range:3000 - 303F
 
IsHiragana ---> Unicode Range:3040 - 309F
 
IsKatakana ---> Unicode Range:30A0 - 30FF
 
IsBopomofo ---> Unicode Range:3100 - 312F
 
IsHangulCompatibilityJamo ---> Unicode Range:3130 - 318F
 
IsKanbun ---> Unicode Range:3190 - 319F
 
IsBopomofoExtended ---> Unicode Range:31A0 - 31BF
 
IsKatakanaPhoneticExtensions ---> Unicode Range:31F0 - 31FF
 
IsEnclosedCJKLettersandMonths ---> Unicode Range:3200 - 32FF
 
IsCJKCompatibility ---> Unicode Range:3300 - 33FF
 
IsCJKUnifiedIdeographsExtensionA ---> Unicode Range:3400 - 4DBF
 
IsYijingHexagramSymbols ---> Unicode Range:4DC0 - 4DFF
 
IsCJKUnifiedIdeographs ---> Unicode Range:4E00 - 9FFF
 
IsYiSyllables ---> Unicode Range:A000 - A48F
 
IsYiRadicals ---> Unicode Range:A490 - A4CF
 
IsHangulSyllables ---> Unicode Range:AC00 - D7AF
 
IsHighSurrogates ---> Unicode Range:D800 - DB7F
 
IsHighPrivateUseSurrogates ---> Unicode Range:DB80 - DBFF
 
IsLowSurrogates ---> Unicode Range:DC00 - DFFF
 
IsPrivateUse or IsPrivateUseArea ---> Unicode Range:E000 - F8FF
 
IsCJKCompatibilityIdeographs ---> Unicode Range:F900 - FAFF
 
IsAlphabeticPresentationForms ---> Unicode Range:FB00 - FB4F
 
IsArabicPresentationForms-A ---> Unicode Range:FB50 - FDFF
 
IsVariationSelectors ---> Unicode Range:FE00 - FE0F
 
IsCombiningHalfMarks ---> Unicode Range:FE20 - FE2F
 
IsCJKCompatibilityForms ---> Unicode Range:FE30 - FE4F
 
IsSmallFormVariants ---> Unicode Range:FE50 - FE6F
 
IsArabicPresentationForms-B ---> Unicode Range:FE70 - FEFF
 
IsHalfwidthandFullwidthForms ---> Unicode Range:FF00 - FFEF
 
IsSpecials ---> Unicode Range:FFF0 - FFFF
 
// As usual you use these like
 
// \p{IsBasicLatin}
 
// $var =~ m/\p{IsBasicLatin}/;
 
 
</pre>
 
</pre>
  

Revision as of 02:04, 13 September 2013

<Expression> =~ m/pattern/flags

Contents

Description

Match a string to a regular expression pattern and check if it matches and optionally return captured groups

Parameters

Expression

Any valid expression that is a string.

pattern

The regular expression pattern to match.

flags

Optional; The flags to use in the pattern.

i = Ignore case.

m = Treat the string as multiple lines.

s = Treat the string as a single line.

o = Do not recompile the regular expression after the first compile (Improves speed of your matches if you run the pattern many times).

g = Match all occurrences of the pattern in the string (Default is only match the first).

d = Return a single dimension array when using flag "g" (Default is to return a multidimensional array).

x = Allows newlines and commands and ignores whitespace in the regex

Note - If flag "g" is used the $_rg array will contain all matches from first to last it will not contain the text it matched it will only contain the matches, However if "g" is not used then the first element will be the matched text followed by all the matched groups 1 2 3 4 etc.

Return Value

Success: Returns 1 if the match was successful.

Failure: Returns 0.

Remarks

Note - The below is only for single matches not flag "g" matches.

If a regular expression pattern is matched correctly and it has capture groups in the pattern the groups will be set to local variables for example:

Group 0 will be $0.

Group 1 will be $1.

Group 2 will be $2.

And so on.

Of course captured NAMED groups will also be returned as follows:

Named group "Test" will be $_rg["Test"]

Named group "Moo" will be $_rg["Moo"]

Named group "Cat" will be $_rg["Cat"]

And so on.

After each regular expression match all capture groups from the previous match will be deleted so its best to copy them if you intend to keep using them.

Regular expressions

Regular expression notation is a compact way of specifying a pattern for strings that can be searched. Regular expressions are character strings in which plain text characters indicate what text should exist in the target string, and a some characters are given special meanings to indicate what variability is allowed in the target string. AutoIt regular expressions are normally case-sensitive.

Regular expressions are constructed of one or more of the following simple regular expression specifiers. If the character is not in the following table, then it will match only itself.

Repeating characters (*, +, ?, {...} ) will try to match the largest set possible, which allows the following characters to match as well, unless followed immediately by a question mark; then it will find the smallest pattern that allows the following characters to match as well.

Nested groups are allowed, but keep in mind that all the groups, except non-capturing groups, assign to the returned array, with the outer groups assigning after the inner groups.

Character escapes

The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes in Regular Expressions.

Escaped character   Description                                Pattern          Matches
\a                  Matches a bell character, \u0007.          \a               "\u0007" in "Error!" + '\u0007'
\b                  In a character class, matches a
                    backspace, \u0008.                         [\b]{3,}         "\b\b\b\b" in "\b\b\b\b"

\t                  Matches a tab, \u0009.                     (\w+)\t          "item1\t", "item2\t" in "item1\titem2\t"
\r                  Matches a carriage return, \u000D.
                    (\r is not equivalent to the
                    newline character, \n.)                    \r\n(\w+)        "\r\nThese" in "\r\nThese are\ntwo lines."
\v                  Matches a vertical tab, \u000B.            [\v]{2,}         "\v\v\v" in "\v\v\v"
\f                  Matches a form feed, \u000C.               [\f]{2,}         "\f\f\f" in "\f\f\f"
\n                  Matches a new line, \u000A.                \r\n(\w+)        "\r\nThese" in "\r\nThese are\ntwo lines."
\e                  Matches an escape, \u001B.                 \e               "\x001B" in "\x001B"
\ nnn               Uses octal representation to specify
                    a character (nnn consists of two or
                    three digits).                             \w\040\w         "a b", "c d" in "a bc d"
\x nn               Uses hexadecimal representation to
                    specify a character (nn consists of
                    exactly two digits).                       \w\x20\w         "a b", "c d" in "a bc d"
\c X
\c x                Matches the ASCII control character
                    that is specified by X or x, where
                    X or x is the letter of the control
                    character.                                 \cC              "\x0003" in "\x0003" (Ctrl-C)
\u nnnn             Matches a Unicode character by
                    using hexadecimal representation
                    (exactly four digits, as represented
                    by nnnn).                                  \w\u0020\w       "a b", "c d" in "a bc d"
\                   When followed by a character that is                        "2+2" and "3*9" in "(2+2) * 3*9"
                    not recognized as an escaped character
                    in this and other tables in this topic,
                    matches that character. For example, \*
                    is the same as \x2A, and \. is the same
                    as \x2E. This allows the regular
                    expression engine to disambiguate language
                    elements (such as * or ?) and character
                    literals (represented by \* or \?).        \d+[\+-x\*]\d+\d+[\+-x\*\d+ 
Repeating Characters
Character classes

A character class matches any one of a set of characters. Character classes include the language elements listed in the following table. For more information, see Character Classes in Regular Expressions.

Character class            Description                                 Pattern                      Matches
[ character_group ]        Matches any single character in
                           character_group. By default,
                           the match is case-sensitive.                [ae]                         "a" in "gray"
                                                                                                    "a", "e" in "lane"
[^ character_group ]       Negation: Matches any single character
                           that is not in character_group. By
                           default, characters in character_group
                           are case-sensitive.                         [^aei]                       "r", "g", "n" in "reign"
[ first - last ]           Character range: Matches any single
                           character in the range from first to
                           last.                                       [A-Z]                        "A", "B" in "AB123"
.                          Wildcard: Matches any single character
                           except \n.
                           To match a literal period character
                           (. or \u002E), you must precede it with
                           the escape character (\.).                  a.e                          "ave" in "nave"
                                                                                                    "ate" in "water"
\p{ name }                 Matches any single character in the
                           Unicode general category or named block
                           specified by name.                          \p{Lu}                       "C", "L" in "City Lights"
                                                                       \p{IsCyrillic}               "Д", "Ж" in "ДЖem"
\P{ name }                 Matches any single character that is
                           not in the Unicode general category or
                           named block specified by name.              \P{Lu}                       "i", "t", "y" in "City"
                                                                       \P{IsCyrillic}               "e", "m" in "ДЖem"
\w                         Matches any word character.                 \w                           "I", "D", "A", "1"
                                                                                                    "3" in "ID A1.3"
\W                         Matches any non-word character.             \W                           " ", "." in "ID A1.3"
\s                         Matches any white-space character.          \w\s                         "D " in "ID A1.3"
\S                         Matches any non-white-space character.      \s\S                         " _" in "int __ctr"
\d                         Matches any decimal digit.                  \d                           "4" in "4 = IV"
\D                         Matches any character other than a
                           decimal digit.                              \D                           " ", "=", " ", "I"
                                                                                                    "V" in "4 = IV" 
Anchors

Anchors, or atomic zero-width assertions, cause a match to succeed or fail depending on the current position in the string, but they do not cause the engine to advance through the string or consume characters. The metacharacters listed in the following table are anchors. For more information, see Anchors in Regular Expressions.

Assertion   Description                                       Pattern                Matches
^           The match must start at the beginning of the
            string or line.                                   ^\d{3}                 "901" in "901-333-"
$           The match must occur at the end of the string
            or before \n at the end of the line or string.    -\d{3}$                "-333" in "-901-333"
\A          The match must occur at the start of the string.  \A\d{3}                "901" in "901-333-"
\Z          The match must occur at the end of the string
            or before \n at the end of the string.            -\d{3}\Z               "-333" in "-901-333"
\z          The match must occur at the end of the string.    -\d{3}\z               "-333" in "-901-333"
\G          The match must occur at the point where the
            previous match ended.                             \G\(\d\)               "(1)", "(3)", "(5)" in "(1)(3)(5)[7](9)"
\b          The match must occur on a boundary between
            a \w (alphanumeric) and a \W (nonalphanumeric)
            character.                                        \b\w+\s\w+\b           "them theme", "them them" in
                                                                                     "them theme them them"
\B          The match must not occur on a \b boundary.         \Bend\w*\b            "ends", "ender" in
                                                                                     "end sends endure lender"

Example

Check if a string matches a given pattern :

// Set a string to parse
$str = "Hello, World!";
 
if( $str =~ m/\w+,\s+\w+!/ )
{
	println("True");
}
else
{
	println("False");
}

Check if a string matches a given pattern case insensitive :

// Set a string to parse
$str = "Hello, World!";
 
if( $str =~ m/hello,\s+WORLD!/i )
{
	println("True");
}
else
{
	println("False");
}

Simple matching a string and returning 2 captured groups :

// Set a string to parse
$str = 'Account Test Credits 777';
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/i;
 
println("Account '$1' Credits '$2'");
// Prints
// Account 'Test' Credits '777'

Simple matching a string and returning 2 captured groups and saving the variables:

// Set a string to parse
$str = 'Account Test Credits 777';
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/i;
 
$Account = $1;
$Credits = $2;
 
println("Account '$Account' Credits '$Credits'");
// Prints
// Account 'Test' Credits '777'

Same thing but this time parsing multiple lines of accounts :

// Set a string to parse
$str = 'Account Test Credits 777' . @CRLF;
$str .= 'Account FoX Credits 1337' . @CRLF;
$str .= 'Account Cat Credits 100' . @CRLF;
$str .= 'Account Dog Credits 50' . @CRLF;
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/ig;
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	$Account = $_rg[$i][1];
	$Credits = $_rg[$i][2];
	println("Match ($i) |  Account '" . $Account . "' | Credits '" . $Credits . "'" );
}
// Prints
// Match (0) |  Account 'Test' | Credits '777'
// Match (1) |  Account 'FoX' | Credits '1337'
// Match (2) |  Account 'Cat' | Credits '100'
// Match (3) |  Account 'Dog' | Credits '50'

Case insensitive match on a string to capture all possible matches and return them as a multi-dimensional array :

// Set a string to parse
$str = '<test>a</test> <test>b</test> <test>c</Test>';
 
// Do the regex match
$str =~ m/<(?i)test>(.*?)<\/(?i)test>/ig;
 
// How many groups did we find?
println("Found groups: " . @Groups);
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	$match = $_rg[$i];
	println("Match ($i) |  Text '" . $match[0] . "' | Group text '" . $match[1] . "'" );
}
// Prints
// Found groups: 3
// Match (0) |  Text '<test>a</test>' | Group text 'a'
// Match (1) |  Text '<test>b</test>' | Group text 'b'
// Match (2) |  Text '<test>c</Test>' | Group text 'c'

Case insensitive match on a string to capture all possible matches and return them as a single dimension array :

// Set a string to parse
$str = '<test>a</test> <test>b</test> <test>c</Test>';
 
// Do the regex match
$str =~ m/<(?i)test>(.*?)<\/(?i)test>/igd;
 
// How many groups did we find?
println("Found groups: " . @Groups);
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	println("Match ($i) |  Group text '" . $_rg[$i] . "'" );
}
// Prints
// Found groups: 3
// Match (0) |  Group text 'a'
// Match (1) |  Group text 'b'
// Match (2) |  Group text 'c'

Example of using the /x flag

my $a = "xor eax, edx";
$a =~ m/
		(\w+) # You can add comments
		\s*
		(\w+)
		\s*	# Yup comments all over
		,
		\s*
		(\w+)
		/x;
print( "'$1' -> '$2' -> '$3'" );

Example of using a While loop (While loops with regexp wont work properly without the /g flag)

// Set a string to parse
$str = 'Account Test Credits 777' . @CRLF;
$str .= 'Account FoX Credits 1337' . @CRLF;
$str .= 'Account Cat Credits 100' . @CRLF;
$str .= 'Account Dog Credits 50' . @CRLF;
 
while( $str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/ig )
{
	$Account = $_rg[$_][1];
	$Credits = $_rg[$_][2];
	println("Match ($_) |  Account '" . $Account . "' | Credits '" . $Credits . "'" );
}

Example of named capture groups

$str = "xor eax, edx";
 
if( $str =~ m/xor\s*(?<first>\w*),\s*(?<second>\w*)/ )
{
	println("True: " . $_rg["first"] . " | " . $_rg["second"]);
}
else
{
	println("False");
}
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox