views:

755

answers:

2

I'm want to parse a custom string format that is persisting an object graphs state. This is ASP.Net scenario and I wanted something easy to use on the client (javascript) and server (c#).

I have a format something like {Name1|Value1|Value2|...|ValueN}{Name2|Value1|...}{...}{NameN|...}. In this format I have 3 delimiters, {, }, and |. Further, because these characters are conceivable in the name/values, I defined an escape sequence using the very common \, such that {, } and \| are all interpreted as normal versions of themselves and of course \ is a backslash. All pretty standard.

Originally I tried to use a regex to try to parse out the string representation of an object with something like this (?<!\\)\{(.*?)(?<!\\)\}. Keep in mind \, {, and } are all reserved in regexs. This of course will be able to parse out something like {category|foo}|bar{} correctly. However I realized it would fail with something like {category|foo|bar\}.

It only took me a minute or so to try this (?<!(?<!\\)\\)\{(.*?)(?<!(?<!\\)\\)\} and realize that this approach was not possible given that you'd need an infinite number of negative backreferences to deal with a potential infinite number of escape sequences. Of course it's unlikely that I'd ever have more than one or two levels so I could probaly hard code it. However I feel that this is a common enough problem it should have a well defined solution.

My next approach was to try to write a defined parser where I actually scanned the input buffer and consumed each character in a forward only method. I haven't actually finished this yet but it seems overly complicated and I feel I must be missing something obvious. I mean we've had parsers as long as we've had computer languages.

So my question would be what is the simplest, efficient and elegant way to decode an input buffer like this with possible escape sequences?

+2  A: 
(?<!\\)(?:\\\\)*\{(.*?(?<!\\)(?:\\\\)*)\}

(?<!\\) will prevent any \ before this point.

(?:\\\\)* will allow any number of escaped \.

\{ matches an opening brace.

( begins a capture group.

.*? matches the content, including any |.

(?<!\\) will prevent any \ before this point.

(?:\\\\)* will allow any number of escaped \.

) ends the capture group.

\} matches an closing brace.

MizardX
+1  A: 

This sort of parser is fairly easy to do with minimal state tracking. The below took me a few minutes, is fairly ugly, and even does a tiny bit of error checking. :)

Arguably, it's a more readable approach than complicated regexes, though the former are a bit more concise.

struct RECORD
{
    public string[] Entries;
}
struct FILE
{
    public RECORD[] Records;
}

static FILE parseFile(string input)
{
    List<RECORD> records = new List<RECORD>();
    List<string> entries = new List<string>();
    bool escaped = false;
    bool inRecord = false;
    StringBuilder sb = new StringBuilder();
    foreach (char c in input)
    {
        switch (c)
        {
            case '|':
                if (escaped)
                {
                    sb.Append('|');
                    escaped = false;
                }
                else if (inRecord)
                {
                    entries.Add(sb.ToString());
                    sb = new StringBuilder();
                }
                else
                    throw new Exception("Invalid sequence");
                break;
            case '{':
                if (escaped)
                {
                    sb.Append('{');
                    escaped = false;
                }
                else if (inRecord)
                    throw new Exception("Invalid sequence");
                else
                {
                    inRecord = true;
                    sb = new StringBuilder();
                }
                break;
            case '}':
                if (escaped)
                {
                    sb.Append('}');
                    escaped = false;
                }
                else if (inRecord)
                {
                    inRecord = false;
                    entries.Add(sb.ToString());
                    sb = new StringBuilder();
                    records.Add(new RECORD(){Entries = entries.ToArray()});
                    entries.Clear();
                }
                else
                    throw new Exception("Invalid sequence");
                break;
            case '\\':
                if (escaped)
                {
                    sb.Append('\\');
                    escaped = false;
                }
                else if (!inRecord)
                    throw new Exception("Invalid sequence");
                else
                    escaped = true;
                break;
            default:
                if (escaped)
                    throw new Exception("Unrecognized escape sequence");
                else
                    sb.Append(c);
                break;
        }
    }
    if (inRecord)
        throw new Exception("Invalid sequence");
    return new FILE() { Records = records.ToArray() };
}