Regular expressions

Regular expressions are a language for matching strings to certain patterns. In this language, certain symbols are reserved for things like "match one or more of these", "match any of these possibilities", "match any character", etc.

We present here an implementation of regular expressions in LSL. Albeit very limited, it provides a good subset of POSIX extended regular expressions. Due to these limitations, it's basically restricted to validating strings and nothing else.

So what is supported?

Regex language accepted

Set syntax

Usage

The regular expression must first be compiled before being used; the result of the compilation is a list. This compiled list can be used to check if any number of strings matches the original expression that was compiled.

If you intend to use this code for something like validating strings against pre-defined patterns, it may be best to compile the expression to a list in advance, then use this list and only the matching code in the final script. To that end, we've divided the script into the common section, the compiling section, the matching section and the test code. To use only matching, copy the common section and the matching section to your program.

Case sensitivity is decided when matching, not when compiling. If you want a case insensitive match, set the global variable I to TRUE before calling match(), or to FALSE for case sensitive. It starts with FALSE.

Common section

This must be included with both the compilation section and the matching section.

// Regular Expression Engine in LSL
//
// Copyright © 2020 Sei Lisa
//
// Permission is hereby granted, free of charge, to any person obtaining a
// copy of this software and associated documentation files (the "Software"),
// to deal in the Software without restriction, including without limitation
// the rights to use, copy, modify, merge, publish, distribute, sublicense,
// and/or sell copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
// DEALINGS IN THE SOFTWARE.
//
// (End of legal blurb)

// Thanks to Russell Cox for showing the way.
// https://swtch.com/~rsc/regexp/regexp1.html

// Usage:
//  list my_compiled_regex = compile("regex");
//  if (match(my_compiled_regex, "string"))
//      llOwnerSay("matches!");
//  else
//      llOwnerSay("does not match!");

// Limitations:
// - Only matches / does not match. No captures.
// - The accepted syntax is:
//     - c   -> character
//     - .   -> any character including newline (\n)
//     - \n  -> newline
//     - \<symbol> -> matches the symbol as a character e.g. \+, \\
//     - e1|e2 -> either e1 or e2
//     - (e) -> subexpression
//     - e*  -> zero or more
//     - e+  -> one or more
//     - e?  -> zero or one
//     - ^   -> matches at the beginning
//     - $   -> matches at the end
//     - [abc]  -> any of a, b, c
//     - [a-df] -> any of a, b, c, d or f
//     - [^abc] -> anything except a, b or c
//     - []abc] -> the character ] or a or b or c
//     - [-abc] -> the character - or a or b or c
//     - [abc-] -> the character - or a or b or c
//     - [+--]  -> the character + (2B), comma (2C) or - (2D)
//     - [^]a-c-] -> anything except ], a, b, c or -
//     - [*]    -> the character * (works for all symbols except ^)
// - The \ character is NOT special within brackets.
// - Does NOT support:
//     - Any predefined range or special like \S, \s, \b, \d, \a, etc.
//     - {[n][,[m]]} for "between n and m times"
//     - POSIX collation related stuff [= ... =], [: ... :], [. ... .]
//     - Non-greedy matching operator, e.g. c*?
//     - Special groups that begin with (? or (*
// - No error checking. For example an unmatched [ may have bad consequences.

integer p; // parse position
list mono=[p]; // If you get an error here, you're not compiling in Mono mode.
               // This program requires Mono. It won't work otherwise.
Compilation section
// Compile a regular expression into an NFA
//
// The compiled format is a strided list of stride 2.
// - If the first element is a string, it's either a set or a dot ".".
//   Positive sets are encoded with a "[" character followed by the set;
//   negative sets are encoded with a "^" character followed by the set.
//   1-character matches are encoded as a set of 1 character, i.e. "[" + c.
//   - The second element is an integer: the offset of the next state relative
//     to the first element of the stride.
// - If the 1st element is an integer, it's an edge of the graph that does
//   not consume characters.
//   - If the second element is an integer, the node is a split; both elements
//     are offsets for the next state in that case.
//   - If the second element is a string, it's either "^" for a left anchor,
//     "$" for a right anchor, or "" for a state with just 1 edge. The first
//     element is the state to go to (as an offset) when the current position
//     matches the anchor or when the second element is "".

string rx; // regular expression being parsed
integer Lv; // nesting level of ()

// Find and process suffixes
list _suff(list rc)
{
    string c = llGetSubString(rx, p, p);
    integer rclen = llGetListLength(rc);
    if (c == "*")
    {
        ++p;
        // Add a split at the beginning that goes to the end of the list,
        // and a dummy node to go back to the split. This last state could
        // be avoided by patching the list, but that's harder.
        //return [2, rclen] + rc + [-2 - rclen, ""];
        // optimized version:
        return 2 + (rclen + 4 + (rc + -(2 + rclen) + ""));
    }
    if (c == "+")
    {
        ++p;
        // Add a split after the list, that goes back to the list
        return rc + 2 + -rclen;
    }
    if (c == "?")
    {
        ++p;
        // Add a split before the list, that skips the list
        return 2 + (2 + rclen + rc);
    }
    return rc;
}

// Main compilation engine
list _compile()
{
    list rc;
    list aux;

    string c;

    @again;

    c = llGetSubString(rx, p, p+!llSubStringIndex(rx, "\\")); // grab 1 or 2 chars

    if (c == "" | -(c == ")") & Lv)
        jump break;

    if (c == "(")
    {
        ++p;
        ++Lv;
        aux = _compile();
        // assume we're in a matching ")"
        --Lv;
        ++p;
        rc += _suff(aux);
        jump again;
    }
    if (c == "|")
    {
        ++p;
        aux = _compile();

        // We're adding here an extra state which could be avoided by patching the list
        rc = 2 + (4 + llGetListLength(rc) + rc + (2 + llGetListLength(aux)) + "" + aux);
        jump break;
    }
    if (c == "^" | c == "$")
    {
        ++p;
        rc = rc + 2 + c;
        jump again;
    }
    if (c == "[")
    {
        integer p2 = p + 1;
        if (llGetSubString(rx, p2, p2) == "^")
            ++p2;
        c = llGetSubString(rx, p2 + -1, p = p2 + llSubStringIndex(llGetSubString(rx, p2 + 1, -1), "]"));
        p += 2; // skip past ]
    }
    else if (c == "\\n")
    {
        c = "[\n";
        p += 2;
    }
    else if (1 < llStringLength(c))
    {
        c = "[" + llGetSubString(c, 1, 1);
        p += 2;
    }
    else if (c == ".")
    {
        ++p;
    }
    else
    {
        c = "[" + c;
        ++p;
    }
    rc += _suff((list)c + 2);

    jump again;
    @break;
    return rc;
}

list compile(string s)
{
    // Initialize globals
    p = 0;
    rx = s;

    // Our match engine only performs anchored matches; modify the regex to
    // allow anything before and after it, as is customary. Anchoring will
    // still work. This is equivalent to surrounding the expression with
    // .*(regex).* except that an unmatched closing parenthesis inside the
    // regex will not close the outer paren.
    //return [2,4,".",-2] + _compile() + [2,4,".",-2];
    // optimized version:
    return (list)2 + 4 + "." + -2 + (_compile() + ((list)2 + 4 + "." + -2));
}
Matching section

This section provides the function match(compiled_regex, string_to_match). It needs the common section above.

// Match section - Run the NFA one input character at a time

integer L; // length of input string, used to evaluate anchors
integer I; // case insensitivity flag

// Follow the edges that are not associated to an input character.
// Returns the input list plus the first state which *is* associated
// to a character, if it wasn't in the list already.
list _follow(integer st, list rc, list new)
{
    if (TYPE_INTEGER == llGetListEntryType(rc, st))
    {
        // Split or anchor - follow if condition met
        string s = llList2String(rc, st+1);
        if (s != "^" & s != "$" | s == "^" & !p | s == "$" & p == L)
          new = _follow(st + llList2Integer(rc, st), rc, new);
        if (s != "" & s != "^" & s != "$")
          new = _follow(st + (integer)s, rc, new);
        return new;
    }
    if (!~llListFindList(new, (list)st))
        new += st;
    return new;
}

// Match a compiled regular expression against a string
// Return 0 (FALSE) if it doesn't match; something else if it matches
integer match(list rc, string s)
{
    L = llStringLength(s);
    integer st;
    p = st;
    list jobs = _follow(st, rc, []);
    string m; // current match being tested
    integer acceptState = llGetListLength(rc);

    do
    {
        list new = [];
        integer job = llGetListLength(jobs);

        string c = llGetSubString(s, p, p);
        string C = c;
        if (I)
        {
            // Invert case of C to test both
            C = llToUpper(c);
            if (C == c) C = llToLower(c);
        }
        string s1;
        string s2;
        ++p;

        while (job)
        {
            st = llList2Integer(jobs, --job);
            // check if the state is not an accept state,
            // else the current character can't advance from accept;
            // this happens when there are extra chars after the RE
            if (st != acceptState)
            {
                m = llList2String(rc, st);
                integer inv = !llSubStringIndex(m, "^");

                if (inv | !llSubStringIndex(m, "["))
                {
                    // starts with [ or ^, it's a range or single character
                    integer mlen = llStringLength(m);
                    integer match;

                    if (mlen == 2)
                    {
                        // Fast-track single-character matches/non-matches
                        match = "[" + c == m | "^" + c == m | "[" + C == m | "^" + C == m;
                    }
                    else if (mlen > 2)
                    {
                        // Multi-character matches are more difficult
                        integer pm = 1;
                        do
                        {
                            if (mlen > pm+2 & "-" == llGetSubString(m, pm+1, pm+1))
                            {
                                // Range
                                s1 = llGetSubString(m, pm, pm);
                                s2 = llGetSubString(m, pm+2, pm+2);
                                match = match | s1 + c + s2 == (string)llListSort((list)s1 + c + s2, 1, TRUE);
                                if (I)
                                    match = match | s1 + C + s2 == (string)llListSort((list)s1 + C + s2, 1, TRUE);
                                pm += 3;
                            }
                            else
                            {
                                s1 = llGetSubString(m, pm, pm);
                                match = match | c == s1 | C == s1;
                                ++pm;
                            }
                        }
                        while (pm < mlen);
                    }
                    if (inv ^ match)
                    {
                        // Match found; advance this job
                        new = _follow(st + llList2Integer(rc, st+1), rc, new);
                    }
                }
                else if (m == ".")
                {
                    // always matches
                    new = _follow(st + llList2Integer(rc, st+1), rc, new);
                }
            }
        }
        jobs = new;

    }
    while (jobs != [] & -(p != L));

    return ~llListFindList(jobs, (list)acceptState);
}
Testing suite section (with list dumping tool)

This section also includes a function to dump a list to LSL format, allowing you to copy paste it to another script.

string QuoteString(string s)
{
    s = llDumpList2String(llParseStringKeepNulls(s, ["\\"], []), "\\\\");
    s = llDumpList2String(llParseStringKeepNulls(s, ["\n"], []), "\\n");
    return "\"" + llDumpList2String(llParseStringKeepNulls(s, ["\""], []), "\\\"") + "\"";
}

string AddLine(string lines, string lineToAdd)
{
    if (llStringLength(llStringToBase64(lines + "\n, " + lineToAdd)) > 1364)
    {
        llOwnerSay(lines);
        lines = "\n, " + lineToAdd;
        llSleep(1);
    }
    else
    {
        if (lines == "")
            lines = "\n[ " + lineToAdd;
        else
            lines = lines + ", " + lineToAdd;
    }
    return lines;
}

DumpListLSL(list L)
{
    string ret = "";
    integer len = llGetListLength(L);
    integer i;
    for (i = 0; i < len; ++i)
    {
        integer typ = llGetListEntryType(L, i);
        if (typ == TYPE_KEY | typ == TYPE_STRING)
        {
            if (typ == TYPE_KEY)
                ret = AddLine(ret, "((key)" + QuoteString(llList2String(L, i)) + ")");
            else
                ret = AddLine(ret, QuoteString(llList2String(L, i)));
        }
        else if (typ == TYPE_VECTOR | typ == TYPE_ROTATION)
        {
            if (typ == TYPE_VECTOR)
            {
                vector v = llList2Vector(L, i);
                ret = AddLine(ret, "<" + llList2CSV((list)v.x + v.y + v.z) + ">");
            }
            else
            {
                rotation v = llList2Rot(L, i);
                ret = AddLine(ret, "<" + llList2CSV((list)v.x + v.y + v.z + v.s) + ">");
            }
        }
        else
            ret = AddLine(ret, llList2CSV(llList2List(L, i, i)));
    }
    llOwnerSay(ret + "]");
}

test(string pattern, list subjects)
{
    integer i;
    if (I)
        llOwnerSay("pattern (case insensitive): " + pattern);
    else
        llOwnerSay("pattern: " + pattern);
    list rc = compile(pattern);
    llOwnerSay("compiled pattern:");
    DumpListLSL(rc);
    for (i = 0; i < llGetListLength(subjects); ++i)
    {
        string s = llList2String(subjects, i);
        string result = "no";
        if (match(rc, s))
            result = "yes";
        llOwnerSay("match(compiled," + QuoteString(s) + "): " + result);
    }
}

default
{
    state_entry()
    {
        I = TRUE;
        test("^aËc$", ["Aëc", "aËcc", "aËcd", "aaËc", "aË"]);
        I = FALSE;
        test("^abc$", ["abc", "abcc", "abcd", "aabc", "ab"]);
        test("^a*b$", ["abc", "bb", "b", "aab", "ba"]);
        test("^(ab)+c$", ["abc", "ababc", "abac", "bac", "ab", "c"]);
        test("^(ab)*c$", ["abc", "ababc", "abac", "bac", "ab", "c"]);
        test("^(a|b)+c$", ["abc", "ababc", "abac", "bac", "ab", "c"]);
        test("abc", ["abc", "ababc", "abac", "bac", "ababcde", "ddabcdd"]);

        // Validate strings like "19xx-1-1" or "20xx-01-01", no leap years
        test("^((19|20)[0-9][0-9])-(0?[1-9]|1[012])-(0?[1-9]|[12][0-9]|3[01])$", [
          "2020-01-31", "2020-01-32", "2099-12-31", "2100-01-01", "2020-2-28",
          "2020-02-19", "2099-02-31"
        ]);

        // Validate dates of the form YYYY-MM-DD including leap years
        // (valid for years 1600 to 9999)
        test("^((1[6-9]|[2-9][0-9])[0-9][0-9]-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01]))$|^((1[6-9]|[2-9][0-9])[0-9][0-9]-(0[469]|11)-(0[1-9]|[12][0-9]|30))$|^((16|[248][048]|[3579][26])00)|(1[6-9]|[2-9][0-9])(0[48]|[13579][26]|[2468][048])-02-(0[1-9]|[12][0-9])$|^(1[6-9]|[2-9][0-9])[0-9][0-9]-02-(0[1-9]|1[0-9]|2[0-8])$", [
          "2020-01-31", "2020-01-32", "2099-12-31", "2100-01-01", "2020-02-28",
          "2020-02-29", "2021-02-29", "2030-04-31", "2030-04-30"
        ]);

        // Regex for 'string that does not contain the word "string"'
        test("^([^s]|s(s|t(s|r(s|i(s|ns))))*([^st]|t([^rs]|r([^is]|i([^ns]|n[^gs])))))*(s(s|t(s|r(s|i(s|ns))))*(t(r?|rin?))?)?$", [
          "no strings allowed", "strinstringstrin", "strinstrinstrin", "it's stringalicious", "this does not contain the word s-t-r-i-n-g even if it ends in strin"
        ]);
    }
}

Motivation

There was a page in the Second Life wiki titled "LSL Regex Engine", which is now gone due to a DMCA applying to all of the Wizardry and Steamworks pages, that introduced a so-called "regex engine" to disprove the argument that a "regular expression engine in LSL will not work well". However, the argument was basically framed as a straw-man: it built an engine that was so limited that it wasn't even capable of parsing most regular languages, and therefore couldn't really be characterized as a regular expression engine in the theoretical sense.

So I (Sei Lisa) set up to build an engine that is able to parse any regular language (limited in practice by LSL memory of course), using a large enough subset of the syntax of POSIX EREs as to guarantee that all regular languages are covered. The result is not too practical, but it also isn't as bad as I thought. It is able to understand somewhat complex regular expressions, though the cost of parsing (compiling) is prohibitive with complex expressions for run-time use; however, the cost of matching is not extremely bad, and it might even find some practical applications. My advice is not to use it with long input text, say 40 chars or more, because matching is not too fast. Matching speed also depends on the complexity of the (compiled version of the) regular expression itself, so be careful with that if using it in any serious application. Run some benchmarks with your regex and be careful to limit the length of the input text if necessary.