views:

595

answers:

5

I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.

Here's the rules:

  1. You are starting with delimited/escaped data to split into an array.
  2. The delimiter is one arbitrary character
  3. The escape character is one arbitrary character
  4. Both the delimiter and the escape character could occur in data
  5. Regex is fine, but a good-performance solution is best
  6. Edit: Empty elements (including leading or ending delimiters) can be ignored

The code signature (in C# would be, basically)

public static string[] smartSplit(
                         string delimitedData, 
                         char delimiter, 
                         char escape) {}

The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,

Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.

A: 

You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.

Charlie Martin
+1  A: 

The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.

You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).


Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:

state(input)     action
========================
BEGIN(*):         token.clear(); state=START;
END(*):           return;
*(\n\0):          token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE):    state=ESC; // NB: the input is *not* added to the token!
START(*):         token.append(input); state=NORM;
NORM(DELIMITER):  token.emit(); token.clear(); state=START;
NORM(ESCAPE):     state=ESC; // NB: the input is *not* added to the token!
NORM(*):          token.append(input);
ESC(*):           token.append(input); state=NORM;

This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).

dmckee
+3  A: 

A simple state machine is usually the easiest and fastest way. Example in Python:

def extract(input, delim, escape):
  # states
  parsing = 0
  escaped = 1

  state = parsing
  found = []
  parsed = ""

  for c in input:
    if state == parsing:
      if c == delim:
        found.append(parsed)
        parsed = ""
      elif c == escape:
        state = escaped
      else:
        parsed += c
    else: # state == escaped
       parsed += c
       state = parsing

  if parsed:
    found.append(parsed)

  return found
sth
+2  A: 
void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
    enum State { NORMAL, IN_ESC };
    State state = NORMAL;
    string frag;

    for (size_t i = 0; i<text.length(); ++i)
    {
     char c = text[i];
     switch (state)
     {
     case NORMAL:
      if (c == delim)
      {
       if (!frag.empty())
        tokens.push_back(frag);
       frag.clear();
      }
      else if (c == esc)
       state = IN_ESC;
      else
       frag.append(1, c);
      break;
     case IN_ESC:
      frag.append(1, c);
      state = NORMAL;
      break;
     }
    }
    if (!frag.empty())
     tokens.push_back(frag);
}
KenE
Pretty good. I'm porting this to C# and will post.
danieltalsky
Actually.... this must not have been run. This certainly got me on the right path, but, for instance, those break; statements need to be continue;'s or the loop will only run once, and only one character will be added to the array.
danieltalsky
Actually it has been run and it works. The breaks apply to the switch statement, not the for loop.
KenE
A: 

Here's my ported function in C#

    public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
    {
        bool currentlyEscaped = false;
        StringBuilder fragment = new StringBuilder();

        for (int i = 0; i < text.Length; i++)
        {
            char c = text[i];
            if (currentlyEscaped)
            {
                fragment.Append(c);
                currentlyEscaped = false;
            }
            else 
            {
                if (c == delim)
                {
                    if (fragment.Length > 0)
                    {
                        listToBuild.Add(fragment.ToString());
                        fragment.Remove(0, fragment.Length);
                    }

                }
                else if (c == esc)
                    currentlyEscaped = true;
                else
                    fragment.Append(c);
            }
        }

        if (fragment.Length > 0)
        {
            listToBuild.Add(fragment.ToString());
        }
    }

Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.

danieltalsky