views:

343

answers:

3

I have a file with many rows. Each row has a column which may contain comma separated values. I need each row to be distinct (ie no comma separated values).

Here is an example row:

AB  AB10,AB11,AB12,AB15,AB16,AB21,AB22,AB23,AB24,AB25,AB99  ABERDEEN    Aberdeenshire

The columns are comma separated (Postcode area, Postcode districts, Post town, Former postal county).

So the above row would get turned into:

AB  AB10    ABERDEEN    Aberdeenshire
AB  AB11    ABERDEEN    Aberdeenshire
AB  AB12    ABERDEEN    Aberdeenshire
...
...

I tried the following but it didn't work...

(.+)\t(([0-9A-Z]+),)+\t(.+)\t(.+)
A: 
stakx
Damn... Thats not good news!
Nick
There is only 1 column with multiple values (the district), eg AB10, AB11, etc...
Nick
A: 

I agree that RegEx are not be the best way but this should work hopefully if that's all you have available to you. (Done repeatedly until there are no more matches)

Edit

Updated with the OP's final solution from the comments.

Find: (.+)\t([^,\s]+),([^\t]+)\t(.+)
Replace: \1\t\2\t\4\r\1\t\3\t\4
Martin Smith
That works very well - thanks... This is a slight modification which worked a little better. Thanks!Find: "(.+)\t([^,\s]+),([^\t]+)\t(.+)"Replace: "\1\t\2\t\4\r\1\t\3\t\4"EDIT: Seems these comments cant contain markup...
Nick
Yes the penny just dropped that your fields were tab delimited and I had just come back to update my post. Glad you spotted it!
Martin Smith
@Nick RE: Markup in the comments you can do a limited amount. See the accepted answer here http://meta.stackoverflow.com/questions/4481/apply-markup-code-in-comments
Martin Smith
A: 

I agree with stakx that this doesn't sound like a good place for regexes.

I would write a small program instead which read each line, split the line into columns, split each relevant column into a list of values, and then iterated over all combinations of those, outputting a line each time.

Assuming it's only that one column which can have multiple tokens, it would basically look like this:

while not InputFile.EndOfFile:
  line = InputFile.readline();
  columns = line.split('\t'); //Assuming 1-based array, so indexes 1-4
  col2values = columns[2].split(',');
  for each value in col2values:
    OutputFile.WriteLine(columns[1]+'\t'+value+'\t'+columns[3]+'\t'+columns[4]);

If multiple columns can have multiple values, simply put another loop inside the for each.

Michael Madsen
Yeah a script would work too - was just hoping that a Regex would do the trick.
Nick