tags:

views:

357

answers:

4
+2  Q: 

Parse a TSV file

Hi,

I need to parse a file in TSV format (tab separated values). I use a regex to break down the file into each line, but I cannot find a satisfying one to parse each line. For now I've come up this:

(?<g>("[^"]+")+|[^\t]+)

But it does not work if an item in the line has more than 2 consecutive double quotes.

Here's how the file is formatted: each element is separated by a tabulation. If an item contains a tab, it is encased with double quotes. If an item contains a double quote, it is doubled. But sometimes an element contains 4 conscutive double quotes, and the above regex splits the element into 2 different ones.

Examples:

item1ok "item""2""oK"

is correctly parsed into 2 elements: item1ok and item"2"ok (after trimming of the unnecessary quotes), but:

item1oK "item""""2oK"

is parsed into 3 elements: item1ok, item and "2ok (after trimming again).

Has anyone an idea how to make the regex fit this case? Or is there another solution to parse TSV simply? (I'm doing this in C#).

A: 

Instead of using RegEx, maybe you could try the String.Split Method (Char[]) method.

DaveB
String.Split() will consider encased tabulations as delimiters as well, so it's not correct.
Antoine
I thought of that as soon as I hit the save button. What can I say? I know, I suck.
DaveB
+5  A: 

Instead of trying to build your own CSV/TSV file parser (or using String.Split), I'd recommend you have a look at "Fast CSV Reader" or "FileHelpers library".

I'm using the first one, and am very happy with it (it supports any separator characters, e.g. comma, semicolon, tab).

M4N
I've used the Lumenworks CSV reader, works well and would for a good base for a TSV reader.
Lazarus
+1 for FileHelpers!! Excellent library.
marc_s
That's surely a good solution, but I want to avoid additional dependencies to my code, so the .net class answer suits my needs better.
Antoine
+5  A: 

You could use the TextFieldParser. This is technically a VB assembly, but you can use it even in C# by referencing the Microsoft.VisualBasic.FileIO assembly.

The example at the link above even shows using it on a tab-separated file.

Adam Neal
+1 It's part of the .Net framework: it's supported by Microsoft, it doesn't need separate deployment.
MarkJ
+1  A: 

I don't know C# but this should do the trick (in python)

txt = 'item1ok\t"item""2""oK"\titem1oK\t"item""""2oK"\tsomething else'
regex = '''
(?:                    # definition of a field
 "((?:[^"]|"")*)"   # either a double quoted field (allowing consecutive "")
 |                  # or
 ([^"]*)            # any character except a double quote
)                      # end of field
(?:$|\t)               # each field followed by a tab (except the last one)
'''
r = re.compile(regex, re.X)
# now find each match, and replace "" by " and remove trailing \t
# remove also the latest entry in the list (empty string)
columns = [t[0].replace('""', '"') if t[0] != '' else t[1].strip() for t in r.findall(txt)][:-1]
print columns
# prints: ['item1ok', 'item"2"oK', 'item1oK', 'item""2oK', 'something else']
Alex