Core Function Regex Match

From Sputnik Wiki
(Difference between revisions)
Jump to: navigation, search
(Character classes)
(Character classes)
Line 213: Line 213:
 
xdigit hexadecimal digits
 
xdigit hexadecimal digits
 
// You use these like
 
// You use these like
// say ("CAT" =~ m/[:alnum:]/);
+
// say ("CAT" =~ m/\p{alnum}/);
// [:alnum:]
+
// \p{alnum}
 
</pre>
 
</pre>
  

Revision as of 01:00, 13 September 2013

<Expression> =~ m/pattern/flags

Contents

Description

Match a string to a regular expression pattern and check if it matches and optionally return captured groups

Parameters

Expression

Any valid expression that is a string.

pattern

The regular expression pattern to match.

flags

Optional; The flags to use in the pattern.

i = Ignore case.

m = Treat the string as multiple lines.

s = Treat the string as a single line.

o = Do not recompile the regular expression after the first compile (Improves speed of your matches if you run the pattern many times).

g = Match all occurrences of the pattern in the string (Default is only match the first).

d = Return a single dimension array when using flag "g" (Default is to return a multidimensional array).

x = Allows newlines and commands and ignores whitespace in the regex

Note - If flag "g" is used the $_rg array will contain all matches from first to last it will not contain the text it matched it will only contain the matches, However if "g" is not used then the first element will be the matched text followed by all the matched groups 1 2 3 4 etc.

Return Value

Success: Returns 1 if the match was successful.

Failure: Returns 0.

Remarks

Note - The below is only for single matches not flag "g" matches.

If a regular expression pattern is matched correctly and it has capture groups in the pattern the groups will be set to local variables for example:

Group 0 will be $0.

Group 1 will be $1.

Group 2 will be $2.

And so on.

Of course captured NAMED groups will also be returned as follows:

Named group "Test" will be $_rg["Test"]

Named group "Moo" will be $_rg["Moo"]

Named group "Cat" will be $_rg["Cat"]

And so on.

After each regular expression match all capture groups from the previous match will be deleted so its best to copy them if you intend to keep using them.

Regular expressions

Regular expression notation is a compact way of specifying a pattern for strings that can be searched. Regular expressions are character strings in which plain text characters indicate what text should exist in the target string, and a some characters are given special meanings to indicate what variability is allowed in the target string. AutoIt regular expressions are normally case-sensitive.

Regular expressions are constructed of one or more of the following simple regular expression specifiers. If the character is not in the following table, then it will match only itself.

Repeating characters (*, +, ?, {...} ) will try to match the largest set possible, which allows the following characters to match as well, unless followed immediately by a question mark; then it will find the smallest pattern that allows the following characters to match as well.

Nested groups are allowed, but keep in mind that all the groups, except non-capturing groups, assign to the returned array, with the outer groups assigning after the inner groups.

Matching Characters

Match any character in the set. e.g. [aeiou] matches any lower-case vowel. A contiguous set can be defined using a dash between the starting and ending characters. e.g. [a-z] matches any lower case character. To include a dash (-) in a set, use it as the first or last character of the set. To include a closing bracket in a set, use it as the first character of the set. e.g. [][] will match either [ or ]. Note that special characters do not retain their special meanings inside a set, with the exception of \\, \^, \-,\[ and \] match the escaped character inside a set.

Match any character not in the set. e.g. [^0-9] matches any non-digit. To include a caret (^) in a set, put it after the beginning of the set or escape it (\^).

Match a character in the given class of characters. Valid classes are: alpha (any alphabetic character), alnum (any alphanumeric character), lower (any lower-case letter), upper (any upper-case letter), digit (any decimal digit 0-9), xdigit (any hexadecimal digit, 0-9, A-F, a-f), space (any white space character), blank (only a space or tab), print (any printable character), graph (any printable character except spaces), cntrl (any control character [ascii 127 or <32]) or punct (any punctuation character). So [0-9] is equivalent to [:digit:].

Match any character not in the class, but only if the first character.

Group. The elements in the group are treated in order and can be repeated together. e.g. (ab)+ will match "ab" or "abab", but not "aba". A group will also store the text matched for use in back-references and in the array returned by the function, depending on flag value.

Case-insensitivity flag. This does not operate as a group. It tells the regular expression engine to do case-insensitive matching from that point on.

Case-sensitivity flag. This does not operate as a group. It tells the regular expression engine to do case-sensitive matching from that point on.

Case-insensitive group. Behaves just like a normal group, but performs case-insensitive matches within the group.

Case-sensitive group. Behaves just like a normal group, but performs case-sensitive matches within the group. Primarily for use after (-i) flag or inside a case-insensitive group.

Non-capturing group. Behaves just like a normal group, but does not record the matching characters in the array nor can the matched text be used for back-referencing.

Case-insensitive non-capturing group. Behaves just like a non-capturing group, but performs case-insensitive matches within the group.

Case-sensitive non-capturing group. Behaves just like a non-capturing group, but performs case-sensitive matches within the group.

^ and $ match newlines within data.

. matches anything including newline. (by default "." don't match newline)

Ignore white space and # comments.

Invert greediness of quantifiers.

Match any single character (except newline).

Or. The expression on one side or the other can be matched.

Escape a special character (have it match the actual character) or introduce a special character type (see below).

Match an actual backslash (\).

Alarm, that is, the BEL character (chr(7)).

Match only at beginning of string.

Matches at a word boundary.

Matches when not at a word boundary.

Match a control character, based on the next character. For example, \cM matches ctrl-M.

Match any digit (0-9).

Match any non-digit.

Match an escape character (chr(27)).

end case modification.

Match an form feed character (chr(12)).

any horizontal white space character.

any character that is not a horizontal white space character.

Match a linefeed (@LF, chr(10)).

quote (disable) pattern meta characters till \E.

Match a carriage return (@CR, chr(13)).

Match any white space character: Chr(9) through Chr(13) which are Horizontal Tab, Line Feed, Vertical Tab, Form Feed, and Carriage Return, and the standard space ( Chr(32) ).

Match any non-white space character.

Match a tab character (chr(9)).

any vertical white space character.

any character that is not a vertical white space character.

Match any "word" character: a-z, A-Z, 0-9 or underscore (_).

Match any non-word character.

Match the ascii character whose code is given or back-reference. Can be up to 3 octal digits. Match back-reference if found. Match the prior group number given exactly. For example, ([:alpha:])\1 would match a double letter.

Match the ascii character whose code is given in hexadecimal. Can be up to 2 digits.

Match only at end of string.

Match only at end of string, or before newline at the end.

Repeating Characters

Repeat the previous character, set or group exactly x times.

Repeat the previous character, set or group at least x times.

Repeat the previous character, set or group at most x times.

Repeat the previous character, set or group between x and y times, inclusive.

Repeat the previous character, set or group 0 or more times. Equivalent to {0,}

Repeat the previous character, set or group 1 or more times. Equivalent to {1,}

The previous character, set or group may or may not appear. Equivalent to {0, 1}

(after a repeating character) Find the smallest match instead of the largest.

Character classes
alnum	letters and digits
alpha	letters
ascii	character codes 0 - 127
blank	space or tab only
cntrl	control characters
digit	decimal digits (same as \d)
graph	printing characters, excluding space
lower	lower case letters
print	printing characters, including space
punct	printing characters, excluding letters and digits
space	white space (not quite the same as \s)
upper	upper case letters
lower	lower case letters
word	"word" characters (same as \w)
xdigit	hexadecimal digits
// You use these like
// say ("CAT" =~ m/\p{alnum}/);
// \p{alnum}

You can also use the Macros for even more character classes

Example

Check if a string matches a given pattern :

// Set a string to parse
$str = "Hello, World!";
 
if( $str =~ m/\w+,\s+\w+!/ )
{
	println("True");
}
else
{
	println("False");
}

Check if a string matches a given pattern case insensitive :

// Set a string to parse
$str = "Hello, World!";
 
if( $str =~ m/hello,\s+WORLD!/i )
{
	println("True");
}
else
{
	println("False");
}

Simple matching a string and returning 2 captured groups :

// Set a string to parse
$str = 'Account Test Credits 777';
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/i;
 
println("Account '$1' Credits '$2'");
// Prints
// Account 'Test' Credits '777'

Simple matching a string and returning 2 captured groups and saving the variables:

// Set a string to parse
$str = 'Account Test Credits 777';
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/i;
 
$Account = $1;
$Credits = $2;
 
println("Account '$Account' Credits '$Credits'");
// Prints
// Account 'Test' Credits '777'

Same thing but this time parsing multiple lines of accounts :

// Set a string to parse
$str = 'Account Test Credits 777' . @CRLF;
$str .= 'Account FoX Credits 1337' . @CRLF;
$str .= 'Account Cat Credits 100' . @CRLF;
$str .= 'Account Dog Credits 50' . @CRLF;
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/ig;
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	$Account = $_rg[$i][1];
	$Credits = $_rg[$i][2];
	println("Match ($i) |  Account '" . $Account . "' | Credits '" . $Credits . "'" );
}
// Prints
// Match (0) |  Account 'Test' | Credits '777'
// Match (1) |  Account 'FoX' | Credits '1337'
// Match (2) |  Account 'Cat' | Credits '100'
// Match (3) |  Account 'Dog' | Credits '50'

Case insensitive match on a string to capture all possible matches and return them as a multi-dimensional array :

// Set a string to parse
$str = '<test>a</test> <test>b</test> <test>c</Test>';
 
// Do the regex match
$str =~ m/<(?i)test>(.*?)<\/(?i)test>/ig;
 
// How many groups did we find?
println("Found groups: " . @Groups);
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	$match = $_rg[$i];
	println("Match ($i) |  Text '" . $match[0] . "' | Group text '" . $match[1] . "'" );
}
// Prints
// Found groups: 3
// Match (0) |  Text '<test>a</test>' | Group text 'a'
// Match (1) |  Text '<test>b</test>' | Group text 'b'
// Match (2) |  Text '<test>c</Test>' | Group text 'c'

Case insensitive match on a string to capture all possible matches and return them as a single dimension array :

// Set a string to parse
$str = '<test>a</test> <test>b</test> <test>c</Test>';
 
// Do the regex match
$str =~ m/<(?i)test>(.*?)<\/(?i)test>/igd;
 
// How many groups did we find?
println("Found groups: " . @Groups);
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	println("Match ($i) |  Group text '" . $_rg[$i] . "'" );
}
// Prints
// Found groups: 3
// Match (0) |  Group text 'a'
// Match (1) |  Group text 'b'
// Match (2) |  Group text 'c'

Example of using the /x flag

my $a = "xor eax, edx";
$a =~ m/
		(\w+) # You can add comments
		\s*
		(\w+)
		\s*	# Yup comments all over
		,
		\s*
		(\w+)
		/x;
print( "'$1' -> '$2' -> '$3'" );

Example of using a While loop (While loops with regexp wont work properly without the /g flag)

// Set a string to parse
$str = 'Account Test Credits 777' . @CRLF;
$str .= 'Account FoX Credits 1337' . @CRLF;
$str .= 'Account Cat Credits 100' . @CRLF;
$str .= 'Account Dog Credits 50' . @CRLF;
 
while( $str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/ig )
{
	$Account = $_rg[$_][1];
	$Credits = $_rg[$_][2];
	println("Match ($_) |  Account '" . $Account . "' | Credits '" . $Credits . "'" );
}

Example of named capture groups

$str = "xor eax, edx";
 
if( $str =~ m/xor\s*(?<first>\w*),\s*(?<second>\w*)/ )
{
	println("True: " . $_rg["first"] . " | " . $_rg["second"]);
}
else
{
	println("False");
}
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox