views:

615

answers:

5

If I have a comma separated file like the following:

foo,bar,n
,a,bc,d
one,two,three
,a,bc,d

And I want to join the \n, to produce this:

foo,bar,n,a,bc,d
one,two,three,a,bc,d

What is the regex trick? I thought that an if (/\n,/) would catch this.

Also, will I need to do anything special for a UTF-8 encoded file?

Finally, a solution in Groovy would also be helpful.

+12  A: 

You should be using Text::CSV_XS instead of doing this yourself. It supports newlines embedded in records as well as Unicode files. You need to specify the right options when creating the parser, so be sure to read the documentation carefully.

Michael Carman
A: 

This works for me:

open(F, "test.txt") or die;
undef $/;
$s = <F>;
close(F);
$s =~ s/\n,/,/g;
print $s;

$ cat test.txt
foo,bar,n
,a,bc,d
one,two,three
,a,bc,d
$ perl test.pl 
foo,bar,n,a,bc,d
one,two,three,a,bc,d
Greg Hewgill
This doesn't work for records where the first field is empty (and the line starts with a comma). Usually you'll have to read a line, see if it has the right number of fields, then decided what to do next.
brian d foy
That's true, but I opted to answer the original question ("What is the regex trick?") rather than guess about what might need to be done in the case of empty initial fields.
Greg Hewgill
A: 

Here's a groovy version. Depending on the requirements, there are some nuances that this might not catch (like quoted strings that can have commas in them). It'd also have to be tweaked if the newline can happen in the middle of the field rather than always at the end.

def input = """foo,bar,n
,a,bc,d
one,two,three
,a,bc,d"""

def answer = (input =~ /(.*\n?,){5}.*(\n|$)/).inject ("") { ans, match  ->
    ans << match.replaceAll("\n","") << "\n"
}

assert answer.toString() == 
"""foo,bar,n,a,bc,d
one,two,three,a,bc,d
"""
Ted Naleid
A: 

This might be too simple (or not handle the general case well enough),

def input = """foo,bar,n
,a,bc,d
one,two,three
,a,bc,d"""

def last
input.eachLine {
    if(it.startsWith(',')) {
        last += it;
        return;
    }
    if(last)
        println last;
    last = it
}
println last

emits;

foo,bar,n,a,bc,d
one,two,three,a,bc,d
Bob Herrmann
A: 

This is primarily in response to your UTF-8 encoding question.

Depending on specific encoding, you may also need to look for null bytes. If the above advice did not work for you, replacing 's/\n,/,/g' with 's/\c@?\n(\c@?,)/$1/g' might work without breaking the encoding, though it may be safer to do it iteratively (applying 's/\c@?\n(\c@?,)/$1/' to each line instead of concatenating them and applying this globally). This is really a hack, not a substitute for real unicode support, but if you just need a quick fix, or if you have guarantees concerning the encoding, it could help.