views:

52

answers:

2

I have a plain text file something like this:

Ford\tTaurus
  F-150
  F-250
Toyota\tCamry
  Corsica

In other words, a two-level hierarchy where the first child is on the same line as the parent, but subsequent children on lines following, distinguished from being a parent by a two-space prefix (\t above represents a literal tab in the text).

I need to convert to this using RegEx:

Ford\tTaurus
Ford\tF-150
Ford\tF-250
Toyota\tCamry
Toyota\tCorsica

So, I need to capture the parent (text between \r\n and \t not starting with \s\s), and apply that in the middle of any \r\n\s\s found until the next parent.

I have a feeling this can be done with some sort of nested groups, but I think I need more caffeine or something, can't seem to work out the pattern.

(Using .NET with IgnoreWhitespace off and Multiline off)

+3  A: 

Any particular reason you want to use regular expressions for this? Here's code which does what I think you want, without bothering to work out regular expressions:

using System;
using System.IO;

class Test
{
    static void Main(string[] args)
    {
        string currentManufacturer = null;

        using (TextReader reader = File.OpenText(args[0]))
        using (TextWriter writer = File.CreateText(args[1]))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                string car;
                if (line.StartsWith("  "))
                {
                    if (currentManufacturer == null)
                    {
                        // Handle this properly in reality :)
                        throw new Exception("Invalid data");
                    }
                    car = line.Substring(2);
                }
                else
                {
                    string[] bits = line.Split('\t');
                    if (bits.Length != 2)
                    {
                        // Handle this properly in reality :)
                        throw new Exception("Invalid data");
                    }
                    currentManufacturer = bits[0];
                    car = bits[1];
                }
                writer.WriteLine("{0}\t{1}", currentManufacturer, car);
            }
        }
    }
}
Jon Skeet
Thanks Jon... I have an intranet app that scrubs text from various data sources by applying RegEx replacements pulled from a database table, based on the input source. This allows me to handle wonky data from dozens of sources without recompiling. The app has the ability to call custom functions instead, but I avoid using that functionality where practical. Looks like I may be without a choice here.
richardtallent
@richardtallent: Well, it may be possible - but it feels a heck of a lot simpler to me without the regex voodoo going on :)
Jon Skeet
A: 

It is simple (but not wise or fast) to achieve this by using regular expressions.

Replace

(?<=^(Ford\t|Toyota\t).*?)^  

with $1. Make sure ^ and $ match at line beginnings/endings and . matches newline.

tiftik