views:

162

answers:

2

I'm trying to parse (in Ruby) what's effectively the UNIX passwd file-format: comma delimiters, with an escape character \ such that anything escaped should be considered literally. I'm trying to use a regular expression for this, but I'm coming up short — even when using Oniguruma for lookahead/lookbehind assertions.

Essentially, all of the following should work:

a,b,c    # => ["a", "b", "c"]
\a,b\,c  # => ["a", "b,c"]
a,b,c\
d        # => ["a", "b", "c\nd"]
a,b\\\,c # => ["a", "b\,c"]

Any ideas?

The first response looks pretty good. With a file containing

\a,,b\\\,c\,d,e\\f,\\,\
g

it gives:

[["\\a,"], [","], ["b\\\\\\,c\\,d,"], ["e\\\\f,"], ["\\\\,"], ["\\\ng\n"], [""]]

which is pretty close. I don't need the unescaping done on this first pass, as long as everything splits correctly on the commas. I tried Oniguruma and ended up with (the much longer):

Oniguruma::ORegexp.new(%{
  (?:       # - begins with (but doesn't capture)
    (?<=\A) #   - start of line
    |       #   - (or) 
    (?<=,)  #   - a comma
  )

  (?:           # - contains (but doesn't capture)
    .*?         #   - any set of characters
    [^\\\\]?    #   - not ending in a slash
    (\\\\\\\\)* #   - followed by an even number of slashes
  )*?

  (?:      # - ends with (but doesn't capture)
    (?=\Z) #   - end of line
    |      #   - (or)
    (?=,)) #   - a comma
  },

  'mx'
).scan(s)
+2  A: 

I'd give the CSV class a try.

And a regex solution (hack?) might look like this:

#!/usr/bin/ruby -w

# contents of test.csv:
#   a,b,c
#   \a,b\,c
#   a,b,c\
#   d
#   a,b\\\,c

file = File.new("test.csv", "r")
tokens = file.read.scan(/(?:\\.|[^,\r\n])*|\r?\n/m)
puts "-----------"
tokens.length.times do |i|
  if tokens[i] == "\n" or tokens[i] == "\r\n"
    puts "-----------"
  else
    puts ">" + tokens[i] + "<"
  end
end
file.close

which will produce the output:

-----------
>a<
>b<
>c<
-----------
>\a<
>b\,c<
-----------
>a<
>b<
>c\
d<
-----------
>a<
>b\\\,c<
-----------
Bart Kiers
If you squint at this the right way, it kinda looks like CSV. It's not, of course, but if you're lucky you can convince the `csv` library that it is.
Jörg W Mittag
The CSV class is too slow for what I need, unfortunately.
Stephen Touset
I am pretty sure a regex solution will be slower than a (CSV) parser.
Bart Kiers
+3  A: 

Try this:

s.scan(/((?:\\.|[^,])*,?)/m)

It doesn't translate the characters following a \, but that can be done afterwards as a separate step.

Mark Byers
Very nice, but it doesn't change the escaped characters to literal characters as the OP wanted. Which probably can't be done in a regex.
Tim Pietzcker
I think this modification works, and doesn't capture the trailing commas themselves: s.scan(/((?:\\.|[^,])*),?/m)
Stephen Touset
s.scan /((?:\\.|[^,])*)[,\n$]/mx seems to be a little more robust.
Stephen Touset