Unofficial LSL Reference

[[articles:regex]]


Unofficial LSL reference

User Tools

Login

You are currently not logged in! Enter your authentication credentials below to log in. You need to have cookies enabled to log in.

Login

Forgotten your password? Get a new one: Set new password

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

articles:regex [2020-10-11 13:12 SLT]
sei Explain better what an empty subexpression does
articles:regex [2023-04-08 15:31 SLT] (current)
sei Wizardry and Steamworks pages now deleted due to a DMCA
Line 21: Line 21:
   * ''​.'' ​ ->  matches any character including newline ''​\n''​   * ''​.'' ​ ->  matches any character including newline ''​\n''​
   * ''​`\n'' ​ ->  matches a newline   * ''​`\n'' ​ ->  matches a newline
-  * ''​`\<​symbol>'' ​ ->  matches the symbol as a character; for example, ''​`\`^''​ matches the character ''​`^''​ and ''​`\`\''​ matches the character ''​`\''​ (remember that inside LSL strings, the ''​`\''​ character itself needs to be escaped ​too).+  * ''​`\<​symbol>'' ​ ->  matches the symbol as a character; for example, ''​`\`^''​ matches the character ''​`^''​ and ''​`\`\''​ matches the character ''​`\''​ (remember that inside LSL strings, the ''​`\''​ character itself needs to be escaped ​in turn).
   * ''​expr1''''​expr2'' ​ ->  matches the first expression followed by the second   * ''​expr1''''​expr2'' ​ ->  matches the first expression followed by the second
   * ''​expr1`|expr2''​ -> matches when ''​expr1''​ matches or ''​expr2''​ matches (or both). Note that ''​`|''​ has the lowest precedence.   * ''​expr1`|expr2''​ -> matches when ''​expr1''​ matches or ''​expr2''​ matches (or both). Note that ''​`|''​ has the lowest precedence.
Line 28: Line 28:
   * ''​expr*'' ​ ->  matches zero or more expressions   * ''​expr*'' ​ ->  matches zero or more expressions
   * ''​expr?'' ​ ->  matches zero or one expressions   * ''​expr?'' ​ ->  matches zero or one expressions
-  * ''​`^'' ​ ->  matches at the beginning of the string +  * ''​`^'' ​ ->  matches ​only at the beginning of the string ​(zero-length match) 
-  * ''​`$'' ​ ->  matches at the end of the string+  * ''​`$'' ​ ->  matches ​only at the end of the string ​(zero-length match)
   * ''​[''​...''​]'' ​ ->  matches a set of characters   * ''​[''​...''​]'' ​ ->  matches a set of characters
   * Case insensitive flag.   * Case insensitive flag.
-  * In POSIX EREs, empty subexpressions are undefined. In this engine they are valid and are equivalent to an empty RE, so for example ''​(c|)''​ is equivalent to ''​c?''​.+  * In POSIX EREs, empty subexpressions are undefined. In this engine they are valid and are equivalent to an empty RE, so for example ''​(c|)''​ is equivalent to ''​c?​''​ and ''​a()b''​ is equivalent to ''​ab''​.
  
 == Set syntax == == Set syntax ==
Line 339: Line 339:
         {         {
             st = llList2Integer(jobs,​ --job);             st = llList2Integer(jobs,​ --job);
-            if (st == acceptState) +            ​// check if the state is not an accept state, 
-                jump continue; ​// the current character can't advance from accept; +            // else the current character can't advance from accept; 
-                               ​// this happens when there are extra chars after +            // this happens when there are extra chars after the RE 
-                               // ​the RE +            ​if (st !acceptState)
-            ​m = llList2String(rc, st); +
-            integer inv !llSubStringIndex(m,​ "​^"​);​ +
- +
-            if (inv | !llSubStringIndex(m,​ "​["​))+
             {             {
-                ​// starts with [ or ^it's a range or single character +                ​m = llList2String(rcst); 
-                integer ​mlen llStringLength(m)+                integer ​inv !llSubStringIndex(m, "​^"​);
-                integer match;+
  
-                if (mlen == 2)+                if (inv | !llSubStringIndex(m,​ "​["​))
                 {                 {
-                    // Fast-track ​single-character ​matches/​non-matches +                    // starts with [ or ^, it's a range or single character 
-                    ​match = "​["​ + c == m | "​^"​ + c == m | "​["​ + C == m | "​^"​ + C == m; +                    ​integer mlen llStringLength(m)
-                } +                    integer ​match;
-                else if (mlen > 2) +
-                { +
-                    // Multi-character matches are more difficult +
-                    integer ​pm = 1; +
-                    @loop;+
  
-                    if (mlen > pm+2 & "​-" ​== llGetSubString(m,​ pm+1, pm+1))+                    if (mlen == 2)
                     {                     {
-                        // Range +                        // Fast-track single-character matches/​non-matches 
-                        s1 = llGetSubString(m,​ pm, pm); +                        match = "​[" ​+ c == m | "​^" ​+ c == m "​[" ​+ C == m | "​^" ​+ C == m;
-                        s2 = llGetSubString(m,​ pm+2, pm+2); +
-                        match = match | s1 + c + s2 == (string)llListSort((list)s1 ​+ c + s2, 1, TRUE); +
-                        match match s1 + C + s2 == (string)llListSort((list)s1 ​+ C + s2, 1, TRUE); +
-                        pm +3;+
                     }                     }
-                    else+                    else if (mlen > 2)
                     {                     {
-                        s1 = llGetSubString(m,​ pm, pm); +                        ​// Multi-character matches are more difficult 
-                        match = match | c == s1 | C == s1; +                        integer pm = 1; 
-                        ++pm;+                        do 
 +                        { 
 +                            if (mlen > pm+2 & "​-"​ == llGetSubString(m,​ pm+1, pm+1)) 
 +                            { 
 +                                // Range 
 +                                ​s1 = llGetSubString(m,​ pm, pm); 
 +                                s2 = llGetSubString(m,​ pm+2, pm+2); 
 +                                ​match = match | s1 + + s2 == (string)llListSort((list)s1 + c + s2, 1, TRUE); 
 +                                if (I) 
 +                                    match = match s1 + + s2 == (string)llListSort((list)s1 + C + s2, 1, TRUE)
 +                                pm += 3; 
 +                            } 
 +                            else 
 +                            { 
 +                                s1 = llGetSubString(m,​ pm, pm); 
 +                                match = match | c == s1 | C == s1; 
 +                                ​++pm
 +                            } 
 +                        } 
 +                        while (pm < mlen); 
 +                    } 
 +                    if (inv ^ match) 
 +                    { 
 +                        // Match found; advance this job 
 +                        new = _follow(st + llList2Integer(rc,​ st+1), rc, new);
                     }                     }
- 
-                    if (pm < mlen) 
-                        jump loop; 
                 }                 }
-                if (inv ^ match)+                ​else if (m == "​."​)
                 {                 {
-                    // Match found; advance this job+                    // always matches
                     new = _follow(st + llList2Integer(rc,​ st+1), rc, new);                     new = _follow(st + llList2Integer(rc,​ st+1), rc, new);
                 }                 }
             }             }
-            else if (m == "​."​) 
-            { 
-                // always matches 
-                new = _follow(st + llList2Integer(rc,​ st+1), rc, new); 
-            } 
- 
-            @continue; 
         }         }
         jobs = new;         jobs = new;
Line 518: Line 518:
         // Regex for '​string that does not contain the word "​string"'​         // Regex for '​string that does not contain the word "​string"'​
         test("​^([^s]|s(s|t(s|r(s|i(s|ns))))*([^st]|t([^rs]|r([^is]|i([^ns]|n[^gs])))))*(s(s|t(s|r(s|i(s|ns))))*(t(r?​|rin?​))?​)?​$",​ [         test("​^([^s]|s(s|t(s|r(s|i(s|ns))))*([^st]|t([^rs]|r([^is]|i([^ns]|n[^gs])))))*(s(s|t(s|r(s|i(s|ns))))*(t(r?​|rin?​))?​)?​$",​ [
-          "no strings allowed",​ "stringstringstring", "​it'​s stringalicious",​ "this does not contain the word s-t-r-i-n-g even if it ends in strin"+          "no strings allowed",​ "strinstringstrin",​ "​strinstrinstrin", "​it'​s stringalicious",​ "this does not contain the word s-t-r-i-n-g even if it ends in strin"
         ]);         ]);
     }     }
Line 526: Line 526:
 ==== Motivation ==== ==== Motivation ====
  
-There'​s ​a page in the Second Life wiki that introduces ​a so-called "[[http://​wiki.secondlife.com/​wiki/​LSL_Regex_Engine|regex engine]]"to disprove the argument that a "​regular expression engine in LSL will not work well". However, the argument ​is basically framed as a [[https://​en.wikipedia.org/​wiki/​Straw_man|straw-man]]:​ it builds ​an engine that is so limited that it isn't even capable of parsing most regular languages, and therefore ​can't really be characterized as a regular expression engine in the theoretical sense.+There was a page in the Second Life wiki titled "LSL Regex Engine",​ which is now gone due to a DMCA applying to all of the Wizardry and Steamworks pages, ​that introduced ​a so-called "regex engine"​ to disprove the argument that a "​regular expression engine in LSL will not work well". However, the argument ​was basically framed as a [[https://​en.wikipedia.org/​wiki/​Straw_man|straw-man]]:​ it built an engine that was so limited that it wasn't even capable of parsing most regular languages, and therefore ​couldn't really be characterized as a regular expression engine in the theoretical sense.
  
-So I (Sei Lisa) set up to build an engine that is able to parse any regular language (limited in practice by LSL memory of course), using a large enough subset of the syntax of [[https://​pubs.opengroup.org/​onlinepubs/​7908799/​xbd/​re.html#​tag_007_004|POSIX EREs]] as to guarantee that all regular languages are covered. The result is not too practical, but it also isn't as bad as I thought. It is able to understand somewhat complex regular expressions ​(see tests), though the cost of parsing (compiling) is prohibitive with complex expressions for run-time use; however, the cost of matching is not extremely bad, and it might even find some practical applications. My advice is not to use it with very long input text (say 1K or more), because matching is not too fast. Matching speed also depends on the complexity of the (compiled version of the) regular expression itself, so be careful with that if using it in any serious application. Run some benchmarks with your regex and be careful to limit the length of the input text if necessary.+So I (Sei Lisa) set up to build an engine that is able to parse any regular language (limited in practice by LSL memory of course), using a large enough subset of the syntax of [[https://​pubs.opengroup.org/​onlinepubs/​7908799/​xbd/​re.html#​tag_007_004|POSIX EREs]] as to guarantee that all regular languages are covered. The result is not too practical, but it also isn't as bad as I thought. It is able to understand somewhat complex regular expressions,​ though the cost of parsing (compiling) is prohibitive with complex expressions for run-time use; however, the cost of matching is not extremely bad, and it might even find some practical applications. My advice is not to use it with long input textsay 40 chars or more, because matching is not too fast. Matching speed also depends on the complexity of the (compiled version of the) regular expression itself, so be careful with that if using it in any serious application. Run some benchmarks with your regex and be careful to limit the length of the input text if necessary.