ansaurus

Question

Parsing delimited text with escape characters

Answer 1

+2 A:

I'd give the CSV class a try.

And a regex solution (hack?) might look like this:

#!/usr/bin/ruby -w

# contents of test.csv:
#   a,b,c
#   \a,b\,c
#   a,b,c\
#   d
#   a,b\\\,c

file = File.new("test.csv", "r")
tokens = file.read.scan(/(?:\\.|[^,\r\n])*|\r?\n/m)
puts "-----------"
tokens.length.times do |i|
  if tokens[i] == "\n" or tokens[i] == "\r\n"
    puts "-----------"
  else
    puts ">" + tokens[i] + "<"
  end
end
file.close

which will produce the output:

-----------
>a<
>b<
>c<
-----------
>\a<
>b\,c<
-----------
>a<
>b<
>c\
d<
-----------
>a<
>b\\\,c<
-----------

Bart Kiers 2010-02-12 20:53:18

If you squint at this the right way, it kinda looks like CSV. It's not, of course, but if you're lucky you can convince the `csv` library that it is.

Jörg W Mittag 2010-02-12 21:51:31

The CSV class is too slow for what I need, unfortunately.

Stephen Touset 2010-02-13 16:41:48

I am pretty sure a regex solution will be slower than a (CSV) parser.

Bart Kiers 2010-02-13 17:42:45

Answer 2

+3 A:

Try this:

s.scan(/((?:\\.|[^,])*,?)/m)

It doesn't translate the characters following a \, but that can be done afterwards as a separate step.

Mark Byers 2010-02-12 21:01:32

Very nice, but it doesn't change the escaped characters to literal characters as the OP wanted. Which probably can't be done in a regex.

Tim Pietzcker 2010-02-12 21:06:08

I think this modification works, and doesn't capture the trailing commas themselves: s.scan(/((?:\\.|[^,])*),?/m)

Stephen Touset 2010-02-13 15:20:48

s.scan /((?:\\.|[^,])*)[,\n$]/mx seems to be a little more robust.

Stephen Touset 2010-02-13 15:34:47

ansaurus

tags:

views:

answers:

Parsing delimited text with escape characters

related questions