Core Function Regex Match

From Sputnik Wiki

(Difference between revisions)

Jump to: navigation, search

Revision as of 02:04, 13 September 2013

<Expression> =~ m/pattern/flags

Regular expression notation is a compact way of specifying a pattern for strings that can be searched. Regular expressions are character strings in which plain text characters indicate what text should exist in the target string, and a some characters are given special meanings to indicate what variability is allowed in the target string. AutoIt regular expressions are normally case-sensitive.

Regular expressions are constructed of one or more of the following simple regular expression specifiers. If the character is not in the following table, then it will match only itself.

Repeating characters (*, +, ?, {...} ) will try to match the largest set possible, which allows the following characters to match as well, unless followed immediately by a question mark; then it will find the smallest pattern that allows the following characters to match as well.

Nested groups are allowed, but keep in mind that all the groups, except non-capturing groups, assign to the returned array, with the outer groups assigning after the inner groups.

Character escapes

The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes in Regular Expressions.

Escaped character   Description                                Pattern          Matches
\a                  Matches a bell character, \u0007.          \a               "\u0007" in "Error!" + '\u0007'
\b                  In a character class, matches a
                    backspace, \u0008.                         [\b]{3,}         "\b\b\b\b" in "\b\b\b\b"

\t                  Matches a tab, \u0009.                     (\w+)\t          "item1\t", "item2\t" in "item1\titem2\t"
\r                  Matches a carriage return, \u000D.
                    (\r is not equivalent to the
                    newline character, \n.)                    \r\n(\w+)        "\r\nThese" in "\r\nThese are\ntwo lines."
\v                  Matches a vertical tab, \u000B.            [\v]{2,}         "\v\v\v" in "\v\v\v"
\f                  Matches a form feed, \u000C.               [\f]{2,}         "\f\f\f" in "\f\f\f"
\n                  Matches a new line, \u000A.                \r\n(\w+)        "\r\nThese" in "\r\nThese are\ntwo lines."
\e                  Matches an escape, \u001B.                 \e               "\x001B" in "\x001B"
\ nnn               Uses octal representation to specify
                    a character (nnn consists of two or
                    three digits).                             \w\040\w         "a b", "c d" in "a bc d"
\x nn               Uses hexadecimal representation to
                    specify a character (nn consists of
                    exactly two digits).                       \w\x20\w         "a b", "c d" in "a bc d"
\c X
\c x                Matches the ASCII control character
                    that is specified by X or x, where
                    X or x is the letter of the control
                    character.                                 \cC              "\x0003" in "\x0003" (Ctrl-C)
\u nnnn             Matches a Unicode character by
                    using hexadecimal representation
                    (exactly four digits, as represented
                    by nnnn).                                  \w\u0020\w       "a b", "c d" in "a bc d"
\                   When followed by a character that is                        "2+2" and "3*9" in "(2+2) * 3*9"
                    not recognized as an escaped character
                    in this and other tables in this topic,
                    matches that character. For example, \*
                    is the same as \x2A, and \. is the same
                    as \x2E. This allows the regular
                    expression engine to disambiguate language
                    elements (such as * or ?) and character
                    literals (represented by \* or \?).        \d+[\+-x\*]\d+\d+[\+-x\*\d+

Repeating Characters

Character classes

A character class matches any one of a set of characters. Character classes include the language elements listed in the following table. For more information, see Character Classes in Regular Expressions.

Character class            Description                                 Pattern                      Matches
[ character_group ]        Matches any single character in
                           character_group. By default,
                           the match is case-sensitive.                [ae]                         "a" in "gray"
                                                                                                    "a", "e" in "lane"
[^ character_group ]       Negation: Matches any single character
                           that is not in character_group. By
                           default, characters in character_group
                           are case-sensitive.                         [^aei]                       "r", "g", "n" in "reign"
[ first - last ]           Character range: Matches any single
                           character in the range from first to
                           last.                                       [A-Z]                        "A", "B" in "AB123"
.                          Wildcard: Matches any single character
                           except \n.
                           To match a literal period character
                           (. or \u002E), you must precede it with
                           the escape character (\.).                  a.e                          "ave" in "nave"
                                                                                                    "ate" in "water"
\p{ name }                 Matches any single character in the
                           Unicode general category or named block
                           specified by name.                          \p{Lu}                       "C", "L" in "City Lights"
                                                                       \p{IsCyrillic}               "Д", "Ж" in "ДЖem"
\P{ name }                 Matches any single character that is
                           not in the Unicode general category or
                           named block specified by name.              \P{Lu}                       "i", "t", "y" in "City"
                                                                       \P{IsCyrillic}               "e", "m" in "ДЖem"
\w                         Matches any word character.                 \w                           "I", "D", "A", "1"
                                                                                                    "3" in "ID A1.3"
\W                         Matches any non-word character.             \W                           " ", "." in "ID A1.3"
\s                         Matches any white-space character.          \w\s                         "D " in "ID A1.3"
\S                         Matches any non-white-space character.      \s\S                         " _" in "int __ctr"
\d                         Matches any decimal digit.                  \d                           "4" in "4 = IV"
\D                         Matches any character other than a
                           decimal digit.                              \D                           " ", "=", " ", "I"
                                                                                                    "V" in "4 = IV"

Anchors

Anchors, or atomic zero-width assertions, cause a match to succeed or fail depending on the current position in the string, but they do not cause the engine to advance through the string or consume characters. The metacharacters listed in the following table are anchors. For more information, see Anchors in Regular Expressions.

Assertion   Description                                       Pattern                Matches
^           The match must start at the beginning of the
            string or line.                                   ^\d{3}                 "901" in "901-333-"
$           The match must occur at the end of the string
            or before \n at the end of the line or string.    -\d{3}$                "-333" in "-901-333"
\A          The match must occur at the start of the string.  \A\d{3}                "901" in "901-333-"
\Z          The match must occur at the end of the string
            or before \n at the end of the string.            -\d{3}\Z               "-333" in "-901-333"
\z          The match must occur at the end of the string.    -\d{3}\z               "-333" in "-901-333"
\G          The match must occur at the point where the
            previous match ended.                             \G\(\d\)               "(1)", "(3)", "(5)" in "(1)(3)(5)[7](9)"
\b          The match must occur on a boundary between
            a \w (alphanumeric) and a \W (nonalphanumeric)
            character.                                        \b\w+\s\w+\b           "them theme", "them them" in
                                                                                     "them theme them them"
\B          The match must not occur on a \b boundary.         \Bend\w*\b            "ends", "ender" in
                                                                                     "end sends endure lender"

Example

Check if a string matches a given pattern :

// Set a string to parse
$str = "Hello, World!";
 
if( $str =~ m/\w+,\s+\w+!/ )
{
	println("True");
}
else
{
	println("False");
}

Check if a string matches a given pattern case insensitive :

// Set a string to parse
$str = "Hello, World!";
 
if( $str =~ m/hello,\s+WORLD!/i )
{
	println("True");
}
else
{
	println("False");
}

Simple matching a string and returning 2 captured groups :

// Set a string to parse
$str = 'Account Test Credits 777';
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/i;
 
println("Account '$1' Credits '$2'");
// Prints
// Account 'Test' Credits '777'

Simple matching a string and returning 2 captured groups and saving the variables:

// Set a string to parse
$str = 'Account Test Credits 777';
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/i;
 
$Account = $1;
$Credits = $2;
 
println("Account '$Account' Credits '$Credits'");
// Prints
// Account 'Test' Credits '777'

Same thing but this time parsing multiple lines of accounts :

// Set a string to parse
$str = 'Account Test Credits 777' . @CRLF;
$str .= 'Account FoX Credits 1337' . @CRLF;
$str .= 'Account Cat Credits 100' . @CRLF;
$str .= 'Account Dog Credits 50' . @CRLF;
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/ig;
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	$Account = $_rg[$i][1];
	$Credits = $_rg[$i][2];
	println("Match ($i) |  Account '" . $Account . "' | Credits '" . $Credits . "'" );
}
// Prints
// Match (0) |  Account 'Test' | Credits '777'
// Match (1) |  Account 'FoX' | Credits '1337'
// Match (2) |  Account 'Cat' | Credits '100'
// Match (3) |  Account 'Dog' | Credits '50'

Case insensitive match on a string to capture all possible matches and return them as a multi-dimensional array :

// Set a string to parse
$str = '<test>a</test> <test>b</test> <test>c</Test>';
 
// Do the regex match
$str =~ m/<(?i)test>(.*?)<\/(?i)test>/ig;
 
// How many groups did we find?
println("Found groups: " . @Groups);
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	$match = $_rg[$i];
	println("Match ($i) |  Text '" . $match[0] . "' | Group text '" . $match[1] . "'" );
}
// Prints
// Found groups: 3
// Match (0) |  Text '<test>a</test>' | Group text 'a'
// Match (1) |  Text '<test>b</test>' | Group text 'b'
// Match (2) |  Text '<test>c</Test>' | Group text 'c'

Case insensitive match on a string to capture all possible matches and return them as a single dimension array :

// Set a string to parse
$str = '<test>a</test> <test>b</test> <test>c</Test>';
 
// Do the regex match
$str =~ m/<(?i)test>(.*?)<\/(?i)test>/igd;
 
// How many groups did we find?
println("Found groups: " . @Groups);
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	println("Match ($i) |  Group text '" . $_rg[$i] . "'" );
}
// Prints
// Found groups: 3
// Match (0) |  Group text 'a'
// Match (1) |  Group text 'b'
// Match (2) |  Group text 'c'

Example of using the /x flag

my $a = "xor eax, edx";
$a =~ m/
		(\w+) # You can add comments
		\s*
		(\w+)
		\s*	# Yup comments all over
		,
		\s*
		(\w+)
		/x;
print( "'$1' -> '$2' -> '$3'" );

Example of using a While loop (While loops with regexp wont work properly without the /g flag)

// Set a string to parse
$str = 'Account Test Credits 777' . @CRLF;
$str .= 'Account FoX Credits 1337' . @CRLF;
$str .= 'Account Cat Credits 100' . @CRLF;
$str .= 'Account Dog Credits 50' . @CRLF;
 
while( $str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/ig )
{
	$Account = $_rg[$_][1];
	$Credits = $_rg[$_][2];
	println("Match ($_) |  Account '" . $Account . "' | Credits '" . $Credits . "'" );
}

Example of named capture groups

$str = "xor eax, edx";
 
if( $str =~ m/xor\s*(?<first>\w*),\s*(?<second>\w*)/ )
{
	println("True: " . $_rg["first"] . " | " . $_rg["second"]);
}
else
{
	println("False");
}

@@ Line 165: / Line 165: @@
                             decimal digit.                              \D                           " ", "=", " ", "I"
                                                                                                      "V" in "4 = IV"
-</pre>
-====== Supported Named Blocks ======
-Sputnik provides the named blocks listed in the following tables.
-The set of supported named blocks is based on Unicode 4.0 and Perl 5.6.
-<pre>
-alnum	letters and digits
-alpha	letters
-ascii	character codes 0 - 127
-blank	space or tab only
-cntrl	control characters
-digit	decimal digits (same as \d)
-graph	printing characters, excluding space
-lower	lower case letters
-print	printing characters, including space
-punct	printing characters, excluding letters and digits
-space	white space (not quite the same as \s)
-upper	upper case letters
-lower	lower case letters
-word	"word" characters (same as \w)
-xdigit	hexadecimal digits
-// You use these like
-// say ("CAT" =~ m/\p{alnum}/);
-// \p{alnum}
-</pre>
-You can also use the [[Macros|Macros]] for even more character classes.
-Additional classes are:
-<pre>
-IsBasicLatin ---> Unicode Range:0000 - 007F
-IsLatin-1Supplement ---> Unicode Range:0080 - 00FF
-IsLatinExtended-A ---> Unicode Range:0100 - 017F
-IsLatinExtended-B ---> Unicode Range:0180 - 024F
-IsIPAExtensions ---> Unicode Range:0250 - 02AF
-IsSpacingModifierLetters ---> Unicode Range:02B0 - 02FF
-IsCombiningDiacriticalMarks ---> Unicode Range:0300 - 036F
-IsGreek ---> Unicode Range:0370 - 03FF
-IsCyrillic ---> Unicode Range:0400 - 04FF
-IsCyrillicSupplement ---> Unicode Range:0500 - 052F
-IsArmenian ---> Unicode Range:0530 - 058F
-IsHebrew ---> Unicode Range:0590 - 05FF
-IsArabic ---> Unicode Range:0600 - 06FF
-IsSyriac ---> Unicode Range:0700 - 074F
-IsThaana ---> Unicode Range:0780 - 07BF
-IsDevanagari ---> Unicode Range:0900 - 097F
-IsBengali ---> Unicode Range:0980 - 09FF
-IsGurmukhi ---> Unicode Range:0A00 - 0A7F
-IsGujarati ---> Unicode Range:0A80 - 0AFF
-IsOriya ---> Unicode Range:0B00 - 0B7F
-IsTamil ---> Unicode Range:0B80 - 0BFF
-IsTelugu ---> Unicode Range:0C00 - 0C7F
-IsKannada ---> Unicode Range:0C80 - 0CFF
-IsMalayalam ---> Unicode Range:0D00 - 0D7F
-IsSinhala ---> Unicode Range:0D80 - 0DFF
-IsThai ---> Unicode Range:0E00 - 0E7F
-IsLao ---> Unicode Range:0E80 - 0EFF
-IsTibetan ---> Unicode Range:0F00 - 0FFF
-IsMyanmar ---> Unicode Range:1000 - 109F
-IsGeorgian ---> Unicode Range:10A0 - 10FF
-IsHangulJamo ---> Unicode Range:1100 - 11FF
-IsEthiopic ---> Unicode Range:1200 - 137F
-IsCherokee ---> Unicode Range:13A0 - 13FF
-IsUnifiedCanadianAboriginalSyllabics ---> Unicode Range:1400 - 167F
-IsOgham ---> Unicode Range:1680 - 169F
-IsRunic ---> Unicode Range:16A0 - 16FF
-IsTagalog ---> Unicode Range:1700 - 171F
-IsHanunoo ---> Unicode Range:1720 - 173F
-IsBuhid ---> Unicode Range:1740 - 175F
-IsTagbanwa ---> Unicode Range:1760 - 177F
-IsKhmer ---> Unicode Range:1780 - 17FF
-IsMongolian ---> Unicode Range:1800 - 18AF
-IsLimbu ---> Unicode Range:1900 - 194F
-IsTaiLe ---> Unicode Range:1950 - 197F
-IsKhmerSymbols ---> Unicode Range:19E0 - 19FF
-IsPhoneticExtensions ---> Unicode Range:1D00 - 1D7F
-IsLatinExtendedAdditional ---> Unicode Range:1E00 - 1EFF
-IsGreekExtended ---> Unicode Range:1F00 - 1FFF
-IsGeneralPunctuation ---> Unicode Range:2000 - 206F
-IsSuperscriptsandSubscripts ---> Unicode Range:2070 - 209F
-IsCurrencySymbols ---> Unicode Range:20A0 - 20CF
-IsCombiningDiacriticalMarksforSymbols ---> Unicode Range:20D0 - 20FF
-IsLetterlikeSymbols ---> Unicode Range:2100 - 214F
-IsNumberForms ---> Unicode Range:2150 - 218F
-IsArrows ---> Unicode Range:2190 - 21FF
-IsMathematicalOperators ---> Unicode Range:2200 - 22FF
-IsMiscellaneousTechnical ---> Unicode Range:2300 - 23FF
-IsControlPictures ---> Unicode Range:2400 - 243F
-IsOpticalCharacterRecognition ---> Unicode Range:2440 - 245F
-IsEnclosedAlphanumerics ---> Unicode Range:2460 - 24FF
-IsBoxDrawing ---> Unicode Range:2500 - 257F
-IsBlockElements ---> Unicode Range:2580 - 259F
-IsGeometricShapes ---> Unicode Range:25A0 - 25FF
-IsMiscellaneousSymbols ---> Unicode Range:2600 - 26FF
-IsDingbats ---> Unicode Range:2700 - 27BF
-IsMiscellaneousMathematicalSymbols-A ---> Unicode Range:27C0 - 27EF
-IsSupplementalArrows-A ---> Unicode Range:27F0 - 27FF
-IsBraillePatterns ---> Unicode Range:2800 - 28FF
-IsSupplementalArrows-B ---> Unicode Range:2900 - 297F
-IsMiscellaneousMathematicalSymbols-B ---> Unicode Range:2980 - 29FF
-IsSupplementalMathematicalOperators ---> Unicode Range:2A00 - 2AFF
-IsMiscellaneousSymbolsandArrows ---> Unicode Range:2B00 - 2BFF
-IsCJKRadicalsSupplement ---> Unicode Range:2E80 - 2EFF
-IsKangxiRadicals ---> Unicode Range:2F00 - 2FDF
-IsIdeographicDescriptionCharacters ---> Unicode Range:2FF0 - 2FFF
-IsCJKSymbolsandPunctuation ---> Unicode Range:3000 - 303F
-IsHiragana ---> Unicode Range:3040 - 309F
-IsKatakana ---> Unicode Range:30A0 - 30FF
-IsBopomofo ---> Unicode Range:3100 - 312F
-IsHangulCompatibilityJamo ---> Unicode Range:3130 - 318F
-IsKanbun ---> Unicode Range:3190 - 319F
-IsBopomofoExtended ---> Unicode Range:31A0 - 31BF
-IsKatakanaPhoneticExtensions ---> Unicode Range:31F0 - 31FF
-IsEnclosedCJKLettersandMonths ---> Unicode Range:3200 - 32FF
-IsCJKCompatibility ---> Unicode Range:3300 - 33FF
-IsCJKUnifiedIdeographsExtensionA ---> Unicode Range:3400 - 4DBF
-IsYijingHexagramSymbols ---> Unicode Range:4DC0 - 4DFF
-IsCJKUnifiedIdeographs ---> Unicode Range:4E00 - 9FFF
-IsYiSyllables ---> Unicode Range:A000 - A48F
-IsYiRadicals ---> Unicode Range:A490 - A4CF
-IsHangulSyllables ---> Unicode Range:AC00 - D7AF
-IsHighSurrogates ---> Unicode Range:D800 - DB7F
-IsHighPrivateUseSurrogates ---> Unicode Range:DB80 - DBFF
-IsLowSurrogates ---> Unicode Range:DC00 - DFFF
-IsPrivateUse or IsPrivateUseArea ---> Unicode Range:E000 - F8FF
-IsCJKCompatibilityIdeographs ---> Unicode Range:F900 - FAFF
-IsAlphabeticPresentationForms ---> Unicode Range:FB00 - FB4F
-IsArabicPresentationForms-A ---> Unicode Range:FB50 - FDFF
-IsVariationSelectors ---> Unicode Range:FE00 - FE0F
-IsCombiningHalfMarks ---> Unicode Range:FE20 - FE2F
-IsCJKCompatibilityForms ---> Unicode Range:FE30 - FE4F
-IsSmallFormVariants ---> Unicode Range:FE50 - FE6F
-IsArabicPresentationForms-B ---> Unicode Range:FE70 - FEFF
-IsHalfwidthandFullwidthForms ---> Unicode Range:FF00 - FFEF
-IsSpecials ---> Unicode Range:FFF0 - FFFF
-// As usual you use these like
-// \p{IsBasicLatin}
-// $var =~ m/\p{IsBasicLatin}/;
 </pre>

Core Function Regex Match

Revision as of 02:04, 13 September 2013

Contents

Description

Parameters

Expression

pattern

flags

Return Value

Remarks

Regular expressions

Character escapes

Repeating Characters

Character classes

Anchors

Example

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox