views:

77

answers:

3

I have a text data file which contains text like this:

"[category.type.group.subgroup]" - "2934:10,4388:20,3949:30"
"[category.type.group.subgroup]" - "2934:10,4388:20,3949:30"
"[category.type.group.subgroup]" - "2934:10,4388:20,3949:30"
"[category.type.group.subgroup]" - "2934:10,4388:20,3949:30"
34i23042034002340 ----- 
"[category.type.group.subgroup]" - "2934:10,4388:20,3949:30"
"[category.type.group.subgroup]" - "2934:10,4388:20,3949:30"
828728382 ------ 3498293485  AAAAAAA

I need the best way to parse the data, specifically I need the category, type, group, subgroup, and numeric values in the quotes. I was thinking of using Regex, but I was wondering if there are other ideas instead of having several IF statements to analize the data?

A: 

Try the FileHelpers library, it'll take a little work to set up, but save you a lot of work dealing with all the tricky situations that come up in parsing a file like that. It can handle delimited, fixed-width or record based parsing.

jball
Very cool - checking this out now!
Joker
+2  A: 

If you use Regex, you won't need several IF statements. Something like this would read several values with one regular expression:

Regex parseLine = new Regex(@"(?<num1>\d+)\:(?<num2>\d+)\,(?<num3>\d+)", RegexOptions.Compiled);
foreach (string line in File.ReadAllLines(yourFilePath))
{
  var match = parseLine.Match(line);
  if (match.Success) {
    var num1 = match.Groups["num1"].Value;
    var num2 = match.Groups["num2"].Value;
    var num3 = match.Groups["num3"].Value;
    // use the values.
  }
}
John Fisher
That's not going to be a nightmare to maintain as you add in gotchas for all the variations that are sure to arise...
jball
I understand so far - however the data varies. Sometimes I could have 10-20 sets of "id:value" between the quotes, sometimes none. I need to parse out the data between the "[ and ]" (name) in addition to any "id:value" sets that may follow it (if they exist). I created a struct with name, id, value (as strings for simplicity). The name is between the "[ and ]", the id is the first integer before the ':' and the value is the integers following the ':'. Does that make sense?
Joker
A: 
string reg = "\"\\[([^.]+)\\.([^.]+)\\.([^.]+)\\.([^.]+)\\]\"\\s+-\\s+\"([0-9]+):([0-9]+),([0-9]+):([0-9]+),([0-9]+):([0-9]+)\"";
Regex r = new Regex(reg);
Match m = r.Match(aline);
if (m.Success)
{
    string category = m.Groups[1];
    string type = m.Groups[2];
    string group = m.Groups[3];
    string subgroup = m.Groups[4];
    string num1 = m.Groups[5];
    // and so on...
}

EDIT Just saw that you can have an arbitrary number of number sets. The following should handle that:

        string reg = "\"\\[([^.]+)\\.([^.]+)\\.([^.]+)\\.([^.]+)\\]\"(\\s+-\\s+\"(([0-9]+):([0-9]+),?)+\")?";
        string reg2 = "([0-9]+):([0-9]+),?";
        Regex r = new Regex(reg);

        Console.WriteLine(a);
        Console.WriteLine(reg);
        Match m = r.Match(a);
        if (m.Success)
        {
            string category = m.Groups[1];
            string type = m.Groups[2];
            string group = m.Groups[3];
            string subgroup = m.Groups[4];

            MatchCollection mc = Regex.Matches(m.Groups[5].Value, reg2);
            List<string> numbers = new List<string>();
            foreach (Match match in mc)
            {
                numbers.Add(match.Groups[1].Value);
                numbers.Add(match.Groups[2].Value);
            }
        }
Mark Synowiec