Unofficial LSL Reference

[[articles:regex]]


Unofficial LSL reference

User Tools

Login

You are currently not logged in! Enter your authentication credentials below to log in. You need to have cookies enabled to log in.

Login

Forgotten your password? Get a new one: Set new password

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

articles:regex [2020-09-13 11:05 SLT]
sei Fix ) behaviour, minor formatting
articles:regex [2023-04-08 15:31 SLT] (current)
sei Wizardry and Steamworks pages now deleted due to a DMCA
Line 11: Line 11:
   * Non-greedy matching operators ''??'',​ ''​*?'',​ ''​+?''​ are not supported.   * Non-greedy matching operators ''??'',​ ''​*?'',​ ''​+?''​ are not supported.
   * Special groups starting with ''​(?''​ or ''​(*''​ are not supported.   * Special groups starting with ''​(?''​ or ''​(*''​ are not supported.
-  * No flags supported. Matches are always case-sensitive. 
   * There'​s no error checking; errors in the expression may have bad consequences.   * There'​s no error checking; errors in the expression may have bad consequences.
 +  * Due to limitations in ''​llToUpper''/''​llToLower''​ in LSL, case insensitive matching is limited to the Basic Multilingual Plane, i.e. characters with a Unicode codepoint greater than 65535 will be matched case-sensitively regardless of the case insensitive setting.
  
 So what //is// supported? So what //is// supported?
Line 21: Line 21:
   * ''​.'' ​ ->  matches any character including newline ''​\n''​   * ''​.'' ​ ->  matches any character including newline ''​\n''​
   * ''​`\n'' ​ ->  matches a newline   * ''​`\n'' ​ ->  matches a newline
-  * ''​`\<​symbol>'' ​ ->  matches the symbol as a character; for example, ''​`\`^''​ matches the character ''​`^''​ and ''​`\`\''​ matches the character ''​`\''​.+  * ''​`\<​symbol>'' ​ ->  matches the symbol as a character; for example, ''​`\`^''​ matches the character ''​`^''​ and ''​`\`\''​ matches the character ''​`\'' ​(remember that inside LSL strings, the ''​`\''​ character itself needs to be escaped in turn).
   * ''​expr1''''​expr2'' ​ ->  matches the first expression followed by the second   * ''​expr1''''​expr2'' ​ ->  matches the first expression followed by the second
   * ''​expr1`|expr2''​ -> matches when ''​expr1''​ matches or ''​expr2''​ matches (or both). Note that ''​`|''​ has the lowest precedence.   * ''​expr1`|expr2''​ -> matches when ''​expr1''​ matches or ''​expr2''​ matches (or both). Note that ''​`|''​ has the lowest precedence.
Line 28: Line 28:
   * ''​expr*'' ​ ->  matches zero or more expressions   * ''​expr*'' ​ ->  matches zero or more expressions
   * ''​expr?'' ​ ->  matches zero or one expressions   * ''​expr?'' ​ ->  matches zero or one expressions
-  * ''​`^'' ​ ->  matches at the beginning of the string +  * ''​`^'' ​ ->  matches ​only at the beginning of the string ​(zero-length match) 
-  * ''​`$'' ​ ->  matches at the end of the string+  * ''​`$'' ​ ->  matches ​only at the end of the string ​(zero-length match)
   * ''​[''​...''​]'' ​ ->  matches a set of characters   * ''​[''​...''​]'' ​ ->  matches a set of characters
 +  * Case insensitive flag.
 +  * In POSIX EREs, empty subexpressions are undefined. In this engine they are valid and are equivalent to an empty RE, so for example ''​(c|)''​ is equivalent to ''​c?''​ and ''​a()b''​ is equivalent to ''​ab''​.
  
 == Set syntax == == Set syntax ==
Line 41: Line 43:
   * ''​[-a-d]'' ​ ->  same as the previous one   * ''​[-a-d]'' ​ ->  same as the previous one
   * ''​[^]a-d-]'' ​ ->  matches anything except ''​]'',​ ''​a'',​ ''​b'',​ ''​c'',​ ''​d'',​ ''​-''​   * ''​[^]a-d-]'' ​ ->  matches anything except ''​]'',​ ''​a'',​ ''​b'',​ ''​c'',​ ''​d'',​ ''​-''​
-  * ''​[`\]'' ​ ->  the character ''​`\''​. ​All symbols ​lose their special meaning ​within the brackets, except:+  * ''​[`\]'' ​ ->  the character ''​`\''​. ​Within the brackets, no symbols ​have any special meaning, except:
     * ''​]''​ at a position other than the first or immediately after a leading ''​^''​     * ''​]''​ at a position other than the first or immediately after a leading ''​^''​
     * ''​-''​ at a position other than the last, first or immediately after a leading ''​^''​     * ''​-''​ at a position other than the last, first or immediately after a leading ''​^''​
-    * ''​^''​ at a position other than the first.+    * ''​^''​ at the first position.
  
 === Usage === === Usage ===
Line 51: Line 53:
  
 If you intend to use this code for something like validating strings against pre-defined patterns, it may be best to compile the expression to a list in advance, then use this list and only the matching code in the final script. To that end, we've divided the script into the common section, the compiling section, the matching section and the test code. To use only matching, copy the common section and the matching section to your program. If you intend to use this code for something like validating strings against pre-defined patterns, it may be best to compile the expression to a list in advance, then use this list and only the matching code in the final script. To that end, we've divided the script into the common section, the compiling section, the matching section and the test code. To use only matching, copy the common section and the matching section to your program.
 +
 +Case sensitivity is decided when matching, not when compiling. If you want a case insensitive match, set the global variable ''​I''​ to ''​TRUE''​ before calling ''​match()'',​ or to ''​FALSE''​ for case sensitive. It starts with ''​FALSE''​.
  
 == Common section == == Common section ==
Line 91: Line 95:
 //      llOwnerSay("​does not match!"​);​ //      llOwnerSay("​does not match!"​);​
  
-integer p; // parse position 
-list mono=[p]; // If you get an error here, you're not compiling in Mono mode. 
-               // This program requires Mono. It won't work otherwise. 
-</​code>​ 
- 
-== Compilation section == 
- 
-<code lsl2> 
 // Limitations:​ // Limitations:​
 // - Only matches / does not match. No captures. // - Only matches / does not match. No captures.
Line 126: Line 122:
 //     - Any predefined range or special like \S, \s, \b, \d, \a, etc. //     - Any predefined range or special like \S, \s, \b, \d, \a, etc.
 //     - {[n][,[m]]} for "​between n and m times" //     - {[n][,[m]]} for "​between n and m times"
-//     - POSIX collations ​[= ... =], [: ... :], [. ... .] +//     - POSIX collation related stuff [= ... =], [: ... :], [. ... .] 
-//     - Non-greedy matching operator, e.g. a*?+//     - Non-greedy matching operator, e.g. c*?
 //     - Special groups that begin with (? or (* //     - Special groups that begin with (? or (*
 // - No error checking. For example an unmatched [ may have bad consequences. // - No error checking. For example an unmatched [ may have bad consequences.
  
 +integer p; // parse position
 +list mono=[p]; // If you get an error here, you're not compiling in Mono mode.
 +               // This program requires Mono. It won't work otherwise.
 +</​code>​
 +
 +== Compilation section ==
 +
 +<code lsl2>
 // Compile a regular expression into an NFA // Compile a regular expression into an NFA
 // //
 // The compiled format is a strided list of stride 2. // The compiled format is a strided list of stride 2.
-// - If the 1st element is a string, it's a match range, positive ​or negative.+// - If the first element is a string, it'​s ​either ​set or a dot "​."​. 
 +//   ​Positive sets are encoded with a "​["​ character followed by the set; 
 +//   negative ​sets are encoded with a "​^"​ character followed by the set. 
 +//   ​1-character matches are encoded as a set of 1 character, i.e. "​["​ + c.
 //   - The second element is an integer: the offset of the next state relative //   - The second element is an integer: the offset of the next state relative
 //     to the first element of the stride. //     to the first element of the stride.
Line 142: Line 149:
 //     are offsets for the next state in that case. //     are offsets for the next state in that case.
 //   - If the second element is a string, it's either "​^"​ for a left anchor, //   - If the second element is a string, it's either "​^"​ for a left anchor,
-//     "​$"​ for a right anchor, or ""​ for a node that always matches. The first+//     "​$"​ for a right anchor, or ""​ for a state with just 1 edge. The first
 //     ​element is the state to go to (as an offset) when the current position //     ​element is the state to go to (as an offset) when the current position
-//     is the expected one or when the second element is ""​.+//     matches ​the anchor ​or when the second element is ""​.
  
 string rx; // regular expression being parsed string rx; // regular expression being parsed
Line 256: Line 263:
 list compile(string s) list compile(string s)
 { {
 +    // Initialize globals
     p = 0;     p = 0;
-    // Our match engine only performs anchored matches; modify the regex to allow +    ​rx = s; 
-    // anything before and after it, as usual. Anchoring will sill work. + 
-    ​rx = ".*(" + + ").*"; +    ​// Our match engine only performs anchored matches; modify the regex to 
-    return _compile();+    // allow anything before and after it, as is customary. Anchoring will 
 +    // still work. This is equivalent to surrounding the expression with 
 +    ​// .*(regex).* except that an unmatched closing parenthesis inside the 
 +    // regex will not close the outer paren. 
 +    //return [2,4,".",​-2] ​_compile() ​[2,4,"​."​,-2]; 
 +    // optimized version: 
 +    return ​(list)2 + 4 + "​."​ + -2 + (_compile() + ((list)2 + 4 + "​."​ + -2));
 } }
 </​code>​ </​code>​
Line 270: Line 284:
 <code lsl2> <code lsl2>
 // Match section - Run the NFA one input character at a time // Match section - Run the NFA one input character at a time
- 
-// Tell whether a character is between two others, Unicode-wise 
-integer _between(string s, string a, string b) 
-{ 
-    // Works for same-length arguments 
-    return a + s + b == (string)llListSort((list)a + s + b, 1, TRUE); 
-/* 
-    // Works for arguments of any length 
-    list aux = llListSort((list)a + s + b, 1, TRUE); 
-    return a == llList2String(aux,​ 0) & b == llList2String(aux,​ 2); 
-*/ 
-} 
  
 integer L; // length of input string, used to evaluate anchors integer L; // length of input string, used to evaluate anchors
 +integer I; // case insensitivity flag
  
-// Follow the edges that are not associated to an input character+// Follow the edges that are not associated to an input character.
 // Returns the input list plus the first state which *is* associated // Returns the input list plus the first state which *is* associated
 // to a character, if it wasn't in the list already. // to a character, if it wasn't in the list already.
Line 316: Line 319:
     integer acceptState = llGetListLength(rc);​     integer acceptState = llGetListLength(rc);​
  
-    ​@again; +    ​do
- +
-    list new = []; +
-    string c = llGetSubString(s,​ p, p); +
-    ++p; +
-    integer job = llGetListLength(jobs);​ +
-    while (job)+
     {     {
-        ​st llList2Integer(jobs,​ --job)+        ​list new []
-        ​if (st == acceptState) +        ​integer job llGetListLength(jobs);
-            jump continue; // the current character can't advance from accept; +
-                           // this happens when there are extra chars after +
-                           // the RE +
-        m = llList2String(rc, st);+
  
-        ​if (!llSubStringIndex(m"​["​| !llSubStringIndex(m, "​^"​))+        ​string c = llGetSubString(sp, p)
 +        string C = c; 
 +        if (I)
         {         {
-            // starts with [ or ^, it's a range or single character +            // Invert case of C to test both 
-            ​integer mlen llStringLength(m); +            ​llToUpper(c); 
-            ​integer match; +            ​if (C == c) C = llToLower(c)
-            integer inv !llSubStringIndex(m, "​^"​);+        } 
 +        string s1; 
 +        string s2; 
 +        ++p;
  
-            if (mlen == 2)+        while (job) 
 +        { 
 +            st llList2Integer(jobs,​ --job); 
 +            // check if the state is not an accept state, 
 +            // else the current character can't advance from accept; 
 +            // this happens when there are extra chars after the RE 
 +            if (st !acceptState)
             {             {
-                ​// Fast-track single-character matches/​non-matches +                ​m = llList2String(rc,​ st); 
-                ​match = "​["​ + c == m "​^" ​+ c == m; +                ​integer inv !llSubStringIndex(m"​^"​);​
-            } +
-            else if (mlen > 2) +
-            { +
-                // Multi-character matches are more difficult +
-                integer pm = 1; +
-                @loop;+
  
-                if (mlen > pm+2 & "​-"​ == llGetSubString(m, pm+1, pm+1))+                if (inv | !llSubStringIndex(m, "​["​))
                 {                 {
-                    // Range +                    // starts with [ or ^, it's a range or single character 
-                    match = match | _between(cllGetSubString(m,​ pm, pm), llGetSubString(m,​ pm+2, pm+2)); +                    ​integer mlen = llStringLength(m);​ 
-                    pm += 3;+                    integer ​match
 + 
 +                    if (mlen == 2) 
 +                    { 
 +                        // Fast-track single-character matches/​non-matches 
 +                        ​match = "​["​ + c == m "​^"​ + == m | "​["​ + C == m | "​^"​ + C == m; 
 +                    } 
 +                    else if (mlen > 2) 
 +                    { 
 +                        // Multi-character matches are more difficult 
 +                        integer pm = 1; 
 +                        do 
 +                        { 
 +                            if (mlen > pm+2 & "​-"​ == llGetSubString(m,​ pm+1, pm+1)
 +                            { 
 +                                // Range 
 +                                s1 = llGetSubString(mpm, pm); 
 +                                s2 = llGetSubString(m,​ pm+2, pm+2)
 +                                match = match | s1 + c + s2 == (string)llListSort((list)s1 + c + s2, 1, TRUE); 
 +                                if (I) 
 +                                    match = match | s1 + C + s2 == (string)llListSort((list)s1 + C + s2, 1, TRUE); 
 +                                ​pm += 3; 
 +                            } 
 +                            else 
 +                            { 
 +                                s1 = llGetSubString(m,​ pm, pm); 
 +                                match = match | c == s1 | C == s1; 
 +                                ++pm; 
 +                            } 
 +                        } 
 +                        while (pm < mlen); 
 +                    } 
 +                    if (inv ^ match) 
 +                    { 
 +                        // Match found; advance this job 
 +                        new = _follow(st + llList2Integer(rc,​ st+1), rc, new); 
 +                    }
                 }                 }
-                else+                else if (m == "​."​)
                 {                 {
-                    ​match = match | c == llGetSubString(mpmpm)+                    ​// always matches 
-                    ++pm;+                    new _follow(st + llList2Integer(rcst+1)rc, new);
                 }                 }
- 
-                if (pm < mlen) 
-                    jump loop; 
-            } 
-            if (inv ^ match) 
-            { 
-                // Match found; advance this job 
-                new = _follow(st + llList2Integer(rc,​ st+1), rc, new); 
             }             }
         }         }
-        ​else if (m == "​."​) +        ​jobs = new;
-        { +
-            // always matches +
-            new = _follow(st + llList2Integer(rc,​ st+1), rc, new); +
-        }+
  
-        @continue; 
     }     }
-    ​jobs = new; +    ​while (jobs != [] & -(p != L));
- +
-    if (jobs != [] & -(p != L)) +
-        jump again;+
  
     return ~llListFindList(jobs,​ (list)acceptState);​     return ~llListFindList(jobs,​ (list)acceptState);​
Line 387: Line 405:
 </​code>​ </​code>​
  
-== Testing section (with list dumping tool) ==+== Testing ​suite section (with list dumping tool) ==
  
 This section also includes a function to dump a list to LSL format, allowing you to copy paste it to another script. This section also includes a function to dump a list to LSL format, allowing you to copy paste it to another script.
Line 454: Line 472:
 { {
     integer i;     integer i;
-    llOwnerSay("​pattern:​ " + pattern);+    ​if (I) 
 +        llOwnerSay("​pattern (case insensitive):​ " + pattern); 
 +    else 
 +        ​llOwnerSay("​pattern:​ " + pattern);
     list rc = compile(pattern);​     list rc = compile(pattern);​
     llOwnerSay("​compiled pattern:"​);​     llOwnerSay("​compiled pattern:"​);​
Line 472: Line 493:
     state_entry()     state_entry()
     {     {
 +        I = TRUE;
 +        test("​^aËc$",​ ["​Aëc",​ "​aËcc",​ "​aËcd",​ "​aaËc",​ "​aË"​]);​
 +        I = FALSE;
         test("​^abc$",​ ["​abc",​ "​abcc",​ "​abcd",​ "​aabc",​ "​ab"​]);​         test("​^abc$",​ ["​abc",​ "​abcc",​ "​abcd",​ "​aabc",​ "​ab"​]);​
         test("​^a*b$",​ ["​abc",​ "​bb",​ "​b",​ "​aab",​ "​ba"​]);​         test("​^a*b$",​ ["​abc",​ "​bb",​ "​b",​ "​aab",​ "​ba"​]);​
-        test("​^(ab)+c$",​ ["​abc",​ "​ababc",​ "​abac",​ "​bac",​ "​ab"​]);​ +        test("​^(ab)+c$",​ ["​abc",​ "​ababc",​ "​abac",​ "​bac",​ "ab", "​c"​]);​ 
-        test("​^(a|b)+c$",​ ["​abc",​ "​ababc",​ "​abac",​ "​bac",​ "​ab"​]);​ +        test("​^(ab)*c$",​ ["​abc",​ "​ababc",​ "​abac",​ "​bac",​ "​ab",​ "c"]); 
-        test("​abc",​ ["​abc",​ "​ababc",​ "​abac",​ "​bac",​ "​ababcde"​]);​+        test("​^(a|b)+c$",​ ["​abc",​ "​ababc",​ "​abac",​ "​bac",​ "ab", "c"]); 
 +        test("​abc",​ ["​abc",​ "​ababc",​ "​abac",​ "​bac",​ "​ababcde", "​ddabcdd"]);
  
         // Validate strings like "​19xx-1-1"​ or "​20xx-01-01",​ no leap years         // Validate strings like "​19xx-1-1"​ or "​20xx-01-01",​ no leap years
Line 493: Line 518:
         // Regex for '​string that does not contain the word "​string"'​         // Regex for '​string that does not contain the word "​string"'​
         test("​^([^s]|s(s|t(s|r(s|i(s|ns))))*([^st]|t([^rs]|r([^is]|i([^ns]|n[^gs])))))*(s(s|t(s|r(s|i(s|ns))))*(t(r?​|rin?​))?​)?​$",​ [         test("​^([^s]|s(s|t(s|r(s|i(s|ns))))*([^st]|t([^rs]|r([^is]|i([^ns]|n[^gs])))))*(s(s|t(s|r(s|i(s|ns))))*(t(r?​|rin?​))?​)?​$",​ [
-          "no strings allowed",​ "stringstringstring", "​it'​s stringalicious",​ "this does not contain the word s-t-r-i-n-g even if it ends in strin"+          "no strings allowed",​ "strinstringstrin",​ "​strinstrinstrin", "​it'​s stringalicious",​ "this does not contain the word s-t-r-i-n-g even if it ends in strin"
         ]);         ]);
     }     }
Line 501: Line 526:
 ==== Motivation ==== ==== Motivation ====
  
-There'​s ​a page in the Second Life wiki that introduces ​a so-called "[[http://​wiki.secondlife.com/​wiki/​LSL_Regex_Engine|regex engine]]"to disprove the argument that a "​regular expression engine in LSL will not work well". However, the argument ​is basically framed as a [[https://​en.wikipedia.org/​wiki/​Straw_man|straw-man]]:​ it builds ​an engine that is so limited that it isn't even capable of parsing most regular languages, and therefore ​can't really be characterized as a regular expression engine in the theoretical sense.+There was a page in the Second Life wiki titled "LSL Regex Engine",​ which is now gone due to a DMCA applying to all of the Wizardry and Steamworks pages, ​that introduced ​a so-called "regex engine"​ to disprove the argument that a "​regular expression engine in LSL will not work well". However, the argument ​was basically framed as a [[https://​en.wikipedia.org/​wiki/​Straw_man|straw-man]]:​ it built an engine that was so limited that it wasn't even capable of parsing most regular languages, and therefore ​couldn't really be characterized as a regular expression engine in the theoretical sense.
  
-So I (Sei Lisa) set up to build an engine that is able to parse any regular language (limited in practice by LSL memory of course), using a large enough subset of the syntax of [[https://​pubs.opengroup.org/​onlinepubs/​7908799/​xbd/​re.html#​tag_007_004|POSIX EREs]] as to guarantee that all regular languages are covered. The result is not too practical, but it also isn't as bad as I though. It is able to understand somewhat complex regular expressions ​(see tests), though the cost of parsing (compiling) is prohibitive with complex expressions for run-time use; however, the cost of matching is not extremely bad, and it might even find some practical applications. My advice is not to use it with very long input text (say 1K or more), because matching is not too fast. Matching speed also depends on the complexity of the (compiled version of the) regular expression itself, so be careful with that if using it in any serious application. Run some benchmarks with your regex and be careful to limit the length of the input text if necessary.+So I (Sei Lisa) set up to build an engine that is able to parse any regular language (limited in practice by LSL memory of course), using a large enough subset of the syntax of [[https://​pubs.opengroup.org/​onlinepubs/​7908799/​xbd/​re.html#​tag_007_004|POSIX EREs]] as to guarantee that all regular languages are covered. The result is not too practical, but it also isn't as bad as I thought. It is able to understand somewhat complex regular expressions,​ though the cost of parsing (compiling) is prohibitive with complex expressions for run-time use; however, the cost of matching is not extremely bad, and it might even find some practical applications. My advice is not to use it with long input textsay 40 chars or more, because matching is not too fast. Matching speed also depends on the complexity of the (compiled version of the) regular expression itself, so be careful with that if using it in any serious application. Run some benchmarks with your regex and be careful to limit the length of the input text if necessary.