Core Function Regex Match

From Sputnik Wiki
(Difference between revisions)
Jump to: navigation, search
(Character classes)
(Regular expressions)
Line 79: Line 79:
 
Nested groups are allowed, but keep in mind that all the groups, except non-capturing groups, assign to the returned array, with the outer groups assigning after the inner groups.
 
Nested groups are allowed, but keep in mind that all the groups, except non-capturing groups, assign to the returned array, with the outer groups assigning after the inner groups.
  
===== Matching Characters =====
+
===== Character escapes =====
  
* [ ... ]
+
The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes in Regular Expressions.
Match any character in the set. e.g. [aeiou] matches any lower-case vowel. A contiguous set can be defined using a dash between the starting and ending characters. e.g. [a-z] matches any lower case character. To include a dash (-) in a set, use it as the first or last character of the set. To include a closing bracket in a set, use it as the first character of the set. e.g. [][] will match either [ or ]. Note that special characters do not retain their special meanings inside a set, with the exception of \\, \^, \-,\[ and \] match the escaped character inside a set.
+
* [^ ... ]
+
Match any character not in the set. e.g. [^0-9] matches any non-digit. To include a caret (^) in a set, put it after the beginning of the set or escape it (\^).
+
* [:class:]
+
Match a character in the given class of characters. Valid classes are: alpha (any alphabetic character), alnum (any alphanumeric character), lower (any lower-case letter), upper (any upper-case letter), digit (any decimal digit 0-9), xdigit (any hexadecimal digit, 0-9, A-F, a-f), space (any white space character), blank (only a space or tab), print (any printable character), graph (any printable character except spaces), cntrl (any control character [ascii 127 or <32]) or punct (any punctuation character). So [0-9] is equivalent to [:digit:].
+
* [^:class:]
+
Match any character not in the class, but only if the first character.
+
* ( ... )
+
Group. The elements in the group are treated in order and can be repeated together. e.g. (ab)+ will match "ab" or "abab", but not "aba". A group will also store the text matched for use in back-references and in the array returned by the function, depending on flag value.
+
* (?i)
+
Case-insensitivity flag. This does not operate as a group. It tells the regular expression engine to do case-insensitive matching from that point on.
+
* (?-i) (default)
+
Case-sensitivity flag. This does not operate as a group. It tells the regular expression engine to do case-sensitive matching from that point on.
+
* (?i ... )
+
Case-insensitive group. Behaves just like a normal group, but performs case-insensitive matches within the group.
+
* (?-i ... )
+
Case-sensitive group. Behaves just like a normal group, but performs case-sensitive matches within the group. Primarily for use after (-i) flag or inside a case-insensitive group.
+
* (?: ... )
+
Non-capturing group. Behaves just like a normal group, but does not record the matching characters in the array nor can the matched text be used for back-referencing.
+
* (?i: ... )
+
Case-insensitive non-capturing group. Behaves just like a non-capturing group, but performs case-insensitive matches within the group.
+
* (?-i: ... )
+
Case-sensitive non-capturing group. Behaves just like a non-capturing group, but performs case-sensitive matches within the group.
+
* (?m)
+
^ and $ match newlines within data.
+
* (?s)
+
. matches anything including newline. (by default "." don't match newline)
+
* (?x)
+
Ignore white space and # comments.
+
* (?U)
+
Invert greediness of quantifiers.
+
* .
+
Match any single character (except newline).
+
* |
+
Or. The expression on one side or the other can be matched.
+
* \
+
Escape a special character (have it match the actual character) or introduce a special character type (see below).
+
* \\
+
Match an actual backslash (\).
+
* \a
+
Alarm, that is, the BEL character (chr(7)).
+
* \A
+
Match only at beginning of string.
+
* \b
+
Matches at a word boundary.
+
* \B
+
Matches when not at a word boundary.
+
* \c
+
Match a control character, based on the next character. For example, \cM matches ctrl-M.
+
* \d
+
Match any digit (0-9).
+
* \D
+
Match any non-digit.
+
* \e
+
Match an escape character (chr(27)).
+
* \E
+
end case modification.
+
* \f
+
Match an form feed character (chr(12)).
+
* \h
+
any horizontal white space character.
+
* \H
+
any character that is not a horizontal white space character.
+
* \n
+
Match a linefeed (@LF, chr(10)).
+
* \Q
+
quote (disable) pattern meta characters till \E.
+
* \r
+
Match a carriage return (@CR, chr(13)).
+
* \s
+
Match any white space character: Chr(9) through Chr(13) which are Horizontal Tab, Line Feed, Vertical Tab, Form Feed, and Carriage Return, and the standard space ( Chr(32) ).
+
* \S
+
Match any non-white space character.
+
* \t
+
Match a tab character (chr(9)).
+
* \v
+
any vertical white space character.
+
* \V
+
any character that is not a vertical white space character.
+
* \w
+
Match any "word" character: a-z, A-Z, 0-9 or underscore (_).
+
* \W
+
Match any non-word character.
+
* \###
+
Match the ascii character whose code is given or back-reference. Can be up to 3 octal digits.
+
Match back-reference if found. Match the prior group number given exactly. For example, ([:alpha:])\1 would match a double letter.
+
* \x##
+
Match the ascii character whose code is given in hexadecimal. Can be up to 2 digits.
+
* \z
+
Match only at end of string.
+
* \Z
+
Match only at end of string, or before newline at the end.
+
  
===== Repeating Characters =====
+
<pre>
 +
Escaped character  Description                                Pattern          Matches
 +
\a                  Matches a bell character, \u0007.          \a              "\u0007" in "Error!" + '\u0007'
 +
\b                  In a character class, matches a
 +
                    backspace, \u0008.                        [\b]{3,}        "\b\b\b\b" in "\b\b\b\b"
  
* {x}
+
\t                  Matches a tab, \u0009.                    (\w+)\t          "item1\t", "item2\t" in "item1\titem2\t"
Repeat the previous character, set or group exactly x times.  
+
\r                  Matches a carriage return, \u000D.
* {x,}
+
                    (\r is not equivalent to the
Repeat the previous character, set or group at least x times.  
+
                    newline character, \n.)                    \r\n(\w+)        "\r\nThese" in "\r\nThese are\ntwo lines."
* {0,x}
+
\v                  Matches a vertical tab, \u000B.            [\v]{2,}         "\v\v\v" in "\v\v\v"
Repeat the previous character, set or group at most x times.  
+
\f                  Matches a form feed, \u000C.               [\f]{2,}         "\f\f\f" in "\f\f\f"
* {x, y}
+
\n                  Matches a new line, \u000A.                \r\n(\w+)        "\r\nThese" in "\r\nThese are\ntwo lines."
Repeat the previous character, set or group between x and y times, inclusive.
+
\e                  Matches an escape, \u001B.                \e              "\x001B" in "\x001B"
* *
+
\ nnn              Uses octal representation to specify
Repeat the previous character, set or group 0 or more times. Equivalent to {0,}
+
                    a character (nnn consists of two or
* +
+
                    three digits).                             \w\040\w        "a b", "c d" in "a bc d"
Repeat the previous character, set or group 1 or more times. Equivalent to {1,}
+
\x nn              Uses hexadecimal representation to
* ?
+
                    specify a character (nn consists of
The previous character, set or group may or may not appear. Equivalent to {0, 1}
+
                    exactly two digits).                      \w\x20\w        "a b", "c d" in "a bc d"
* ?
+
\c X
(after a repeating character) Find the smallest match instead of the largest.
+
\c x                Matches the ASCII control character
 +
                    that is specified by X or x, where
 +
                    X or x is the letter of the control
 +
                    character.                                \cC              "\x0003" in "\x0003" (Ctrl-C)
 +
\u nnnn            Matches a Unicode character by
 +
                    using hexadecimal representation
 +
                    (exactly four digits, as represented
 +
                    by nnnn).                                 \w\u0020\w      "a b", "c d" in "a bc d"
 +
----
 +
\                  When followed by a character that is                        "2+2" and "3*9" in "(2+2) * 3*9"
 +
                    not recognized as an escaped character
 +
                    in this and other tables in this topic,
 +
                    matches that character. For example, \*
 +
                    is the same as \x2A, and \. is the same
 +
                    as \x2E. This allows the regular
 +
                    expression engine to disambiguate language
 +
                    elements (such as * or ?) and character
 +
                    literals (represented by \* or \?).       \d+[\+-x\*]\d+\d+[\+-x\*\d+
 +
</pre>
 +
 
 +
===== Repeating Characters =====
  
 
===== Character classes =====
 
===== Character classes =====
Line 217: Line 149:
 
</pre>
 
</pre>
  
You can also use the [[Macros|Macros]] for even more character classes
+
You can also use the [[Macros|Macros]] for even more character classes.
 +
 
 +
Additional classes are:
 +
<pre>
 +
IsBasicLatin ---> Unicode Range:0000 - 007F
 +
IsLatin-1Supplement ---> Unicode Range:0080 - 00FF
 +
IsLatinExtended-A ---> Unicode Range:0100 - 017F
 +
IsLatinExtended-B ---> Unicode Range:0180 - 024F
 +
IsIPAExtensions ---> Unicode Range:0250 - 02AF
 +
IsSpacingModifierLetters ---> Unicode Range:02B0 - 02FF
 +
IsCombiningDiacriticalMarks ---> Unicode Range:0300 - 036F
 +
IsGreek ---> Unicode Range:0370 - 03FF
 +
IsCyrillic ---> Unicode Range:0400 - 04FF
 +
IsCyrillicSupplement ---> Unicode Range:0500 - 052F
 +
IsArmenian ---> Unicode Range:0530 - 058F
 +
IsHebrew ---> Unicode Range:0590 - 05FF
 +
IsArabic ---> Unicode Range:0600 - 06FF
 +
IsSyriac ---> Unicode Range:0700 - 074F
 +
IsThaana ---> Unicode Range:0780 - 07BF
 +
IsDevanagari ---> Unicode Range:0900 - 097F
 +
IsBengali ---> Unicode Range:0980 - 09FF
 +
IsGurmukhi ---> Unicode Range:0A00 - 0A7F
 +
IsGujarati ---> Unicode Range:0A80 - 0AFF
 +
IsOriya ---> Unicode Range:0B00 - 0B7F
 +
IsTamil ---> Unicode Range:0B80 - 0BFF
 +
IsTelugu ---> Unicode Range:0C00 - 0C7F
 +
IsKannada ---> Unicode Range:0C80 - 0CFF
 +
IsMalayalam ---> Unicode Range:0D00 - 0D7F
 +
IsSinhala ---> Unicode Range:0D80 - 0DFF
 +
IsThai ---> Unicode Range:0E00 - 0E7F
 +
IsLao ---> Unicode Range:0E80 - 0EFF
 +
IsTibetan ---> Unicode Range:0F00 - 0FFF
 +
IsMyanmar ---> Unicode Range:1000 - 109F
 +
IsGeorgian ---> Unicode Range:10A0 - 10FF
 +
IsHangulJamo ---> Unicode Range:1100 - 11FF
 +
IsEthiopic ---> Unicode Range:1200 - 137F
 +
IsCherokee ---> Unicode Range:13A0 - 13FF
 +
IsUnifiedCanadianAboriginalSyllabics ---> Unicode Range:1400 - 167F
 +
IsOgham ---> Unicode Range:1680 - 169F
 +
IsRunic ---> Unicode Range:16A0 - 16FF
 +
IsTagalog ---> Unicode Range:1700 - 171F
 +
IsHanunoo ---> Unicode Range:1720 - 173F
 +
IsBuhid ---> Unicode Range:1740 - 175F
 +
IsTagbanwa ---> Unicode Range:1760 - 177F
 +
IsKhmer ---> Unicode Range:1780 - 17FF
 +
IsMongolian ---> Unicode Range:1800 - 18AF
 +
IsLimbu ---> Unicode Range:1900 - 194F
 +
IsTaiLe ---> Unicode Range:1950 - 197F
 +
IsKhmerSymbols ---> Unicode Range:19E0 - 19FF
 +
IsPhoneticExtensions ---> Unicode Range:1D00 - 1D7F
 +
IsLatinExtendedAdditional ---> Unicode Range:1E00 - 1EFF
 +
IsGreekExtended ---> Unicode Range:1F00 - 1FFF
 +
IsGeneralPunctuation ---> Unicode Range:2000 - 206F
 +
IsSuperscriptsandSubscripts ---> Unicode Range:2070 - 209F
 +
IsCurrencySymbols ---> Unicode Range:20A0 - 20CF
 +
IsCombiningDiacriticalMarksforSymbols ---> Unicode Range:20D0 - 20FF
 +
IsLetterlikeSymbols ---> Unicode Range:2100 - 214F
 +
IsNumberForms ---> Unicode Range:2150 - 218F
 +
IsArrows ---> Unicode Range:2190 - 21FF
 +
IsMathematicalOperators ---> Unicode Range:2200 - 22FF
 +
IsMiscellaneousTechnical ---> Unicode Range:2300 - 23FF
 +
IsControlPictures ---> Unicode Range:2400 - 243F
 +
IsOpticalCharacterRecognition ---> Unicode Range:2440 - 245F
 +
IsEnclosedAlphanumerics ---> Unicode Range:2460 - 24FF
 +
IsBoxDrawing ---> Unicode Range:2500 - 257F
 +
IsBlockElements ---> Unicode Range:2580 - 259F
 +
IsGeometricShapes ---> Unicode Range:25A0 - 25FF
 +
IsMiscellaneousSymbols ---> Unicode Range:2600 - 26FF
 +
IsDingbats ---> Unicode Range:2700 - 27BF
 +
IsMiscellaneousMathematicalSymbols-A ---> Unicode Range:27C0 - 27EF
 +
IsSupplementalArrows-A ---> Unicode Range:27F0 - 27FF
 +
IsBraillePatterns ---> Unicode Range:2800 - 28FF
 +
IsSupplementalArrows-B ---> Unicode Range:2900 - 297F
 +
IsMiscellaneousMathematicalSymbols-B ---> Unicode Range:2980 - 29FF
 +
IsSupplementalMathematicalOperators ---> Unicode Range:2A00 - 2AFF
 +
IsMiscellaneousSymbolsandArrows ---> Unicode Range:2B00 - 2BFF
 +
IsCJKRadicalsSupplement ---> Unicode Range:2E80 - 2EFF
 +
IsKangxiRadicals ---> Unicode Range:2F00 - 2FDF
 +
IsIdeographicDescriptionCharacters ---> Unicode Range:2FF0 - 2FFF
 +
IsCJKSymbolsandPunctuation ---> Unicode Range:3000 - 303F
 +
IsHiragana ---> Unicode Range:3040 - 309F
 +
IsKatakana ---> Unicode Range:30A0 - 30FF
 +
IsBopomofo ---> Unicode Range:3100 - 312F
 +
IsHangulCompatibilityJamo ---> Unicode Range:3130 - 318F
 +
IsKanbun ---> Unicode Range:3190 - 319F
 +
IsBopomofoExtended ---> Unicode Range:31A0 - 31BF
 +
IsKatakanaPhoneticExtensions ---> Unicode Range:31F0 - 31FF
 +
IsEnclosedCJKLettersandMonths ---> Unicode Range:3200 - 32FF
 +
IsCJKCompatibility ---> Unicode Range:3300 - 33FF
 +
IsCJKUnifiedIdeographsExtensionA ---> Unicode Range:3400 - 4DBF
 +
IsYijingHexagramSymbols ---> Unicode Range:4DC0 - 4DFF
 +
IsCJKUnifiedIdeographs ---> Unicode Range:4E00 - 9FFF
 +
IsYiSyllables ---> Unicode Range:A000 - A48F
 +
IsYiRadicals ---> Unicode Range:A490 - A4CF
 +
IsHangulSyllables ---> Unicode Range:AC00 - D7AF
 +
IsHighSurrogates ---> Unicode Range:D800 - DB7F
 +
IsHighPrivateUseSurrogates ---> Unicode Range:DB80 - DBFF
 +
IsLowSurrogates ---> Unicode Range:DC00 - DFFF
 +
IsPrivateUse or IsPrivateUseArea ---> Unicode Range:E000 - F8FF
 +
IsCJKCompatibilityIdeographs ---> Unicode Range:F900 - FAFF
 +
IsAlphabeticPresentationForms ---> Unicode Range:FB00 - FB4F
 +
IsArabicPresentationForms-A ---> Unicode Range:FB50 - FDFF
 +
IsVariationSelectors ---> Unicode Range:FE00 - FE0F
 +
IsCombiningHalfMarks ---> Unicode Range:FE20 - FE2F
 +
IsCJKCompatibilityForms ---> Unicode Range:FE30 - FE4F
 +
IsSmallFormVariants ---> Unicode Range:FE50 - FE6F
 +
IsArabicPresentationForms-B ---> Unicode Range:FE70 - FEFF
 +
IsHalfwidthandFullwidthForms ---> Unicode Range:FF00 - FFEF
 +
IsSpecials ---> Unicode Range:FFF0 - FFFF
 +
// As usual you use these like
 +
// \p{IsBasicLatin}
 +
// $var =~ m/\p{IsBasicLatin}/;
 +
</pre>
  
 
=== Example ===
 
=== Example ===

Revision as of 01:48, 13 September 2013

<Expression> =~ m/pattern/flags

Contents

Description

Match a string to a regular expression pattern and check if it matches and optionally return captured groups

Parameters

Expression

Any valid expression that is a string.

pattern

The regular expression pattern to match.

flags

Optional; The flags to use in the pattern.

i = Ignore case.

m = Treat the string as multiple lines.

s = Treat the string as a single line.

o = Do not recompile the regular expression after the first compile (Improves speed of your matches if you run the pattern many times).

g = Match all occurrences of the pattern in the string (Default is only match the first).

d = Return a single dimension array when using flag "g" (Default is to return a multidimensional array).

x = Allows newlines and commands and ignores whitespace in the regex

Note - If flag "g" is used the $_rg array will contain all matches from first to last it will not contain the text it matched it will only contain the matches, However if "g" is not used then the first element will be the matched text followed by all the matched groups 1 2 3 4 etc.

Return Value

Success: Returns 1 if the match was successful.

Failure: Returns 0.

Remarks

Note - The below is only for single matches not flag "g" matches.

If a regular expression pattern is matched correctly and it has capture groups in the pattern the groups will be set to local variables for example:

Group 0 will be $0.

Group 1 will be $1.

Group 2 will be $2.

And so on.

Of course captured NAMED groups will also be returned as follows:

Named group "Test" will be $_rg["Test"]

Named group "Moo" will be $_rg["Moo"]

Named group "Cat" will be $_rg["Cat"]

And so on.

After each regular expression match all capture groups from the previous match will be deleted so its best to copy them if you intend to keep using them.

Regular expressions

Regular expression notation is a compact way of specifying a pattern for strings that can be searched. Regular expressions are character strings in which plain text characters indicate what text should exist in the target string, and a some characters are given special meanings to indicate what variability is allowed in the target string. AutoIt regular expressions are normally case-sensitive.

Regular expressions are constructed of one or more of the following simple regular expression specifiers. If the character is not in the following table, then it will match only itself.

Repeating characters (*, +, ?, {...} ) will try to match the largest set possible, which allows the following characters to match as well, unless followed immediately by a question mark; then it will find the smallest pattern that allows the following characters to match as well.

Nested groups are allowed, but keep in mind that all the groups, except non-capturing groups, assign to the returned array, with the outer groups assigning after the inner groups.

Character escapes

The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes in Regular Expressions.

Escaped character   Description                                Pattern          Matches
\a                  Matches a bell character, \u0007.          \a               "\u0007" in "Error!" + '\u0007'
\b                  In a character class, matches a
                    backspace, \u0008.                         [\b]{3,}         "\b\b\b\b" in "\b\b\b\b"

\t                  Matches a tab, \u0009.                     (\w+)\t          "item1\t", "item2\t" in "item1\titem2\t"
\r                  Matches a carriage return, \u000D.
                    (\r is not equivalent to the
                    newline character, \n.)                    \r\n(\w+)        "\r\nThese" in "\r\nThese are\ntwo lines."
\v                  Matches a vertical tab, \u000B.            [\v]{2,}         "\v\v\v" in "\v\v\v"
\f                  Matches a form feed, \u000C.               [\f]{2,}         "\f\f\f" in "\f\f\f"
\n                  Matches a new line, \u000A.                \r\n(\w+)        "\r\nThese" in "\r\nThese are\ntwo lines."
\e                  Matches an escape, \u001B.                 \e               "\x001B" in "\x001B"
\ nnn               Uses octal representation to specify
                    a character (nnn consists of two or
                    three digits).                             \w\040\w         "a b", "c d" in "a bc d"
\x nn               Uses hexadecimal representation to
                    specify a character (nn consists of
                    exactly two digits).                       \w\x20\w         "a b", "c d" in "a bc d"
\c X
\c x                Matches the ASCII control character
                    that is specified by X or x, where
                    X or x is the letter of the control
                    character.                                 \cC              "\x0003" in "\x0003" (Ctrl-C)
\u nnnn             Matches a Unicode character by
                    using hexadecimal representation
                    (exactly four digits, as represented
                    by nnnn).                                  \w\u0020\w       "a b", "c d" in "a bc d"
----
\                   When followed by a character that is                        "2+2" and "3*9" in "(2+2) * 3*9"
                    not recognized as an escaped character
                    in this and other tables in this topic,
                    matches that character. For example, \*
                    is the same as \x2A, and \. is the same
                    as \x2E. This allows the regular
                    expression engine to disambiguate language
                    elements (such as * or ?) and character
                    literals (represented by \* or \?).        \d+[\+-x\*]\d+\d+[\+-x\*\d+ 
Repeating Characters
Character classes
alnum	letters and digits
alpha	letters
ascii	character codes 0 - 127
blank	space or tab only
cntrl	control characters
digit	decimal digits (same as \d)
graph	printing characters, excluding space
lower	lower case letters
print	printing characters, including space
punct	printing characters, excluding letters and digits
space	white space (not quite the same as \s)
upper	upper case letters
lower	lower case letters
word	"word" characters (same as \w)
xdigit	hexadecimal digits
// You use these like
// say ("CAT" =~ m/\p{alnum}/);
// \p{alnum}

You can also use the Macros for even more character classes.

Additional classes are:

IsBasicLatin ---> Unicode Range:0000 - 007F
IsLatin-1Supplement ---> Unicode Range:0080 - 00FF
IsLatinExtended-A ---> Unicode Range:0100 - 017F
IsLatinExtended-B ---> Unicode Range:0180 - 024F
IsIPAExtensions ---> Unicode Range:0250 - 02AF
IsSpacingModifierLetters ---> Unicode Range:02B0 - 02FF
IsCombiningDiacriticalMarks ---> Unicode Range:0300 - 036F
IsGreek ---> Unicode Range:0370 - 03FF
IsCyrillic ---> Unicode Range:0400 - 04FF
IsCyrillicSupplement ---> Unicode Range:0500 - 052F
IsArmenian ---> Unicode Range:0530 - 058F
IsHebrew ---> Unicode Range:0590 - 05FF
IsArabic ---> Unicode Range:0600 - 06FF
IsSyriac ---> Unicode Range:0700 - 074F
IsThaana ---> Unicode Range:0780 - 07BF
IsDevanagari ---> Unicode Range:0900 - 097F
IsBengali ---> Unicode Range:0980 - 09FF
IsGurmukhi ---> Unicode Range:0A00 - 0A7F
IsGujarati ---> Unicode Range:0A80 - 0AFF
IsOriya ---> Unicode Range:0B00 - 0B7F
IsTamil ---> Unicode Range:0B80 - 0BFF
IsTelugu ---> Unicode Range:0C00 - 0C7F
IsKannada ---> Unicode Range:0C80 - 0CFF
IsMalayalam ---> Unicode Range:0D00 - 0D7F
IsSinhala ---> Unicode Range:0D80 - 0DFF
IsThai ---> Unicode Range:0E00 - 0E7F
IsLao ---> Unicode Range:0E80 - 0EFF
IsTibetan ---> Unicode Range:0F00 - 0FFF
IsMyanmar ---> Unicode Range:1000 - 109F
IsGeorgian ---> Unicode Range:10A0 - 10FF
IsHangulJamo ---> Unicode Range:1100 - 11FF
IsEthiopic ---> Unicode Range:1200 - 137F
IsCherokee ---> Unicode Range:13A0 - 13FF
IsUnifiedCanadianAboriginalSyllabics ---> Unicode Range:1400 - 167F
IsOgham ---> Unicode Range:1680 - 169F
IsRunic ---> Unicode Range:16A0 - 16FF
IsTagalog ---> Unicode Range:1700 - 171F
IsHanunoo ---> Unicode Range:1720 - 173F
IsBuhid ---> Unicode Range:1740 - 175F
IsTagbanwa ---> Unicode Range:1760 - 177F
IsKhmer ---> Unicode Range:1780 - 17FF
IsMongolian ---> Unicode Range:1800 - 18AF
IsLimbu ---> Unicode Range:1900 - 194F
IsTaiLe ---> Unicode Range:1950 - 197F
IsKhmerSymbols ---> Unicode Range:19E0 - 19FF
IsPhoneticExtensions ---> Unicode Range:1D00 - 1D7F
IsLatinExtendedAdditional ---> Unicode Range:1E00 - 1EFF
IsGreekExtended ---> Unicode Range:1F00 - 1FFF
IsGeneralPunctuation ---> Unicode Range:2000 - 206F
IsSuperscriptsandSubscripts ---> Unicode Range:2070 - 209F
IsCurrencySymbols ---> Unicode Range:20A0 - 20CF
IsCombiningDiacriticalMarksforSymbols ---> Unicode Range:20D0 - 20FF
IsLetterlikeSymbols ---> Unicode Range:2100 - 214F
IsNumberForms ---> Unicode Range:2150 - 218F
IsArrows ---> Unicode Range:2190 - 21FF
IsMathematicalOperators ---> Unicode Range:2200 - 22FF
IsMiscellaneousTechnical ---> Unicode Range:2300 - 23FF
IsControlPictures ---> Unicode Range:2400 - 243F
IsOpticalCharacterRecognition ---> Unicode Range:2440 - 245F
IsEnclosedAlphanumerics ---> Unicode Range:2460 - 24FF
IsBoxDrawing ---> Unicode Range:2500 - 257F
IsBlockElements ---> Unicode Range:2580 - 259F
IsGeometricShapes ---> Unicode Range:25A0 - 25FF
IsMiscellaneousSymbols ---> Unicode Range:2600 - 26FF
IsDingbats ---> Unicode Range:2700 - 27BF
IsMiscellaneousMathematicalSymbols-A ---> Unicode Range:27C0 - 27EF
IsSupplementalArrows-A ---> Unicode Range:27F0 - 27FF
IsBraillePatterns ---> Unicode Range:2800 - 28FF
IsSupplementalArrows-B ---> Unicode Range:2900 - 297F
IsMiscellaneousMathematicalSymbols-B ---> Unicode Range:2980 - 29FF
IsSupplementalMathematicalOperators ---> Unicode Range:2A00 - 2AFF
IsMiscellaneousSymbolsandArrows ---> Unicode Range:2B00 - 2BFF
IsCJKRadicalsSupplement ---> Unicode Range:2E80 - 2EFF
IsKangxiRadicals ---> Unicode Range:2F00 - 2FDF
IsIdeographicDescriptionCharacters ---> Unicode Range:2FF0 - 2FFF
IsCJKSymbolsandPunctuation ---> Unicode Range:3000 - 303F
IsHiragana ---> Unicode Range:3040 - 309F
IsKatakana ---> Unicode Range:30A0 - 30FF
IsBopomofo ---> Unicode Range:3100 - 312F
IsHangulCompatibilityJamo ---> Unicode Range:3130 - 318F
IsKanbun ---> Unicode Range:3190 - 319F
IsBopomofoExtended ---> Unicode Range:31A0 - 31BF
IsKatakanaPhoneticExtensions ---> Unicode Range:31F0 - 31FF
IsEnclosedCJKLettersandMonths ---> Unicode Range:3200 - 32FF
IsCJKCompatibility ---> Unicode Range:3300 - 33FF
IsCJKUnifiedIdeographsExtensionA ---> Unicode Range:3400 - 4DBF
IsYijingHexagramSymbols ---> Unicode Range:4DC0 - 4DFF
IsCJKUnifiedIdeographs ---> Unicode Range:4E00 - 9FFF
IsYiSyllables ---> Unicode Range:A000 - A48F
IsYiRadicals ---> Unicode Range:A490 - A4CF
IsHangulSyllables ---> Unicode Range:AC00 - D7AF
IsHighSurrogates ---> Unicode Range:D800 - DB7F
IsHighPrivateUseSurrogates ---> Unicode Range:DB80 - DBFF
IsLowSurrogates ---> Unicode Range:DC00 - DFFF
IsPrivateUse or IsPrivateUseArea ---> Unicode Range:E000 - F8FF
IsCJKCompatibilityIdeographs ---> Unicode Range:F900 - FAFF
IsAlphabeticPresentationForms ---> Unicode Range:FB00 - FB4F
IsArabicPresentationForms-A ---> Unicode Range:FB50 - FDFF
IsVariationSelectors ---> Unicode Range:FE00 - FE0F
IsCombiningHalfMarks ---> Unicode Range:FE20 - FE2F
IsCJKCompatibilityForms ---> Unicode Range:FE30 - FE4F
IsSmallFormVariants ---> Unicode Range:FE50 - FE6F
IsArabicPresentationForms-B ---> Unicode Range:FE70 - FEFF
IsHalfwidthandFullwidthForms ---> Unicode Range:FF00 - FFEF
IsSpecials ---> Unicode Range:FFF0 - FFFF
// As usual you use these like
// \p{IsBasicLatin}
// $var =~ m/\p{IsBasicLatin}/;

Example

Check if a string matches a given pattern :

// Set a string to parse
$str = "Hello, World!";
 
if( $str =~ m/\w+,\s+\w+!/ )
{
	println("True");
}
else
{
	println("False");
}

Check if a string matches a given pattern case insensitive :

// Set a string to parse
$str = "Hello, World!";
 
if( $str =~ m/hello,\s+WORLD!/i )
{
	println("True");
}
else
{
	println("False");
}

Simple matching a string and returning 2 captured groups :

// Set a string to parse
$str = 'Account Test Credits 777';
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/i;
 
println("Account '$1' Credits '$2'");
// Prints
// Account 'Test' Credits '777'

Simple matching a string and returning 2 captured groups and saving the variables:

// Set a string to parse
$str = 'Account Test Credits 777';
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/i;
 
$Account = $1;
$Credits = $2;
 
println("Account '$Account' Credits '$Credits'");
// Prints
// Account 'Test' Credits '777'

Same thing but this time parsing multiple lines of accounts :

// Set a string to parse
$str = 'Account Test Credits 777' . @CRLF;
$str .= 'Account FoX Credits 1337' . @CRLF;
$str .= 'Account Cat Credits 100' . @CRLF;
$str .= 'Account Dog Credits 50' . @CRLF;
 
// Do the regex match
$str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/ig;
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	$Account = $_rg[$i][1];
	$Credits = $_rg[$i][2];
	println("Match ($i) |  Account '" . $Account . "' | Credits '" . $Credits . "'" );
}
// Prints
// Match (0) |  Account 'Test' | Credits '777'
// Match (1) |  Account 'FoX' | Credits '1337'
// Match (2) |  Account 'Cat' | Credits '100'
// Match (3) |  Account 'Dog' | Credits '50'

Case insensitive match on a string to capture all possible matches and return them as a multi-dimensional array :

// Set a string to parse
$str = '<test>a</test> <test>b</test> <test>c</Test>';
 
// Do the regex match
$str =~ m/<(?i)test>(.*?)<\/(?i)test>/ig;
 
// How many groups did we find?
println("Found groups: " . @Groups);
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	$match = $_rg[$i];
	println("Match ($i) |  Text '" . $match[0] . "' | Group text '" . $match[1] . "'" );
}
// Prints
// Found groups: 3
// Match (0) |  Text '<test>a</test>' | Group text 'a'
// Match (1) |  Text '<test>b</test>' | Group text 'b'
// Match (2) |  Text '<test>c</Test>' | Group text 'c'

Case insensitive match on a string to capture all possible matches and return them as a single dimension array :

// Set a string to parse
$str = '<test>a</test> <test>b</test> <test>c</Test>';
 
// Do the regex match
$str =~ m/<(?i)test>(.*?)<\/(?i)test>/igd;
 
// How many groups did we find?
println("Found groups: " . @Groups);
 
// Print them all
for($i = 0; $i < @Groups; $i++)
{
	println("Match ($i) |  Group text '" . $_rg[$i] . "'" );
}
// Prints
// Found groups: 3
// Match (0) |  Group text 'a'
// Match (1) |  Group text 'b'
// Match (2) |  Group text 'c'

Example of using the /x flag

my $a = "xor eax, edx";
$a =~ m/
		(\w+) # You can add comments
		\s*
		(\w+)
		\s*	# Yup comments all over
		,
		\s*
		(\w+)
		/x;
print( "'$1' -> '$2' -> '$3'" );

Example of using a While loop (While loops with regexp wont work properly without the /g flag)

// Set a string to parse
$str = 'Account Test Credits 777' . @CRLF;
$str .= 'Account FoX Credits 1337' . @CRLF;
$str .= 'Account Cat Credits 100' . @CRLF;
$str .= 'Account Dog Credits 50' . @CRLF;
 
while( $str =~ m/Account\s+(\w+)\s+\w+\s+(\d+)/ig )
{
	$Account = $_rg[$_][1];
	$Credits = $_rg[$_][2];
	println("Match ($_) |  Account '" . $Account . "' | Credits '" . $Credits . "'" );
}

Example of named capture groups

$str = "xor eax, edx";
 
if( $str =~ m/xor\s*(?<first>\w*),\s*(?<second>\w*)/ )
{
	println("True: " . $_rg["first"] . " | " . $_rg["second"]);
}
else
{
	println("False");
}
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox