tags:

views:

707

answers:

2

Hello
I wonder if someone could help me figure out how to parse a string having the following format:

;field1-field2-fieldN;field1-field2-fieldN;

Each record is delimited by ';' and each field within a record is delimited by '-'. The complication is that the individual fields may contain escaped delimiter characters like so "\;" or "-". This causes my simple parsing code below to fail. So what I'm trying to do is come up with regex expressions that will match the delimiters but not match the escaped delimiters. My regex knowledge is not that great but I expected there must be a way of combining "([^\;])" and "([;])" to get what I require.

public static List<ParsedRecord> parse(String data) {
    List<ParsedRecord> parsedRecords = new List<ParsedRecord>();
    String[] records = data.split(";");
    for (String record : records) {
        String[] fields = data.split("-");
        parsedRecords.add(new parsedRecord(fields));
    }
    return parsedRecords;
}

Thanks very much in advance.

+2  A: 

You're likely to be best off doing the unescaping and the splitting in the same pass. I know it feels wrong in terms of separating the two separate pieces of functionality, but it avoids some awkward corner-cases (imagine "foo\;bar" for example, where the ; follows a backslash but is still a delimiter).

Here's some extremely simplistic code to do the parsing - it assumes that any backslash basically means "treat the next character as plain input" but that's all.

import java.util.*;

public class Test
{
    public static void main(String[] args)
    {
        List<String> parsed = parse(args[0]);
        for (String x : parsed)
        {
            System.out.println(x);
        }
    }

    public static List<String> parse(String text)
    {
        List<String> ret = new ArrayList<String>();
        StringBuilder current = new StringBuilder();
        boolean escaping = false;

        for (int i=0; i < text.length(); i++)
        {
            char c = text.charAt(i);
            if (escaping)
            {
                current.append(c);
                escaping = false;
            }
            else
            {
                if (c == '\\')
                {
                    escaping = true;
                }
                else if (c == ';')
                {
                    ret.add(current.toString());
                    current = new StringBuilder();
                }
                else
                {
                    current.append(c);
                }
            }
        }
        if (escaping)
        {
            throw new IllegalArgumentException("Ended in escape sequence");
        }
        ret.add(current.toString());
        return ret;
    }
}

(Note that this doesn't do the business of splitting each record into multiple fields, but you'd just need to change what you do with ';' and also react to '-' - the principle is the same.)

Jon Skeet
+5  A: 

You could perhaps refine your regular expression used with split like this:

split("[^\\];")

To split at anything that is a ";" but not if before that there is a "\". And the same for the dashes:

split("[^\\]-")
Fabian Steeg
Thanks! I used a combination of your answer and Jon's to get the parser working. Much appreciated!
leftbrainlogic