views:

207

answers:

2

I know I already have the Haskell Data.ByteString.Lazy function to split a CSV on a single character, such as:

split :: Word8 -> ByteString -> [ByteString]

But I want to split on a multi-character ByteString (like splitting on a String instead of a Char):

split :: ByteString -> ByteString -> [ByteString]

I have multi-character separators in a csv-like text file that I need to parse, and the individual characters themselves appear in some of the fields, so choosing just one separator character and discarding the others would contaminate the data import.

I've had some ideas on how to do this, but they seem kind of hacky (e.g. take three Word8s, test if they're the separator combination, start a new field if they are, recurse further), and I imagine I would be reinventing a wheel anyway. Is there a way to do this without rebuilding the function from scratch?

+2  A: 

There are a few functions in bytestring for splitting on subsequences:

breakSubstring :: ByteString -> ByteString -> (ByteString,ByteString)

There's also a

Don Stewart
I would have to convert lazy ByteStrings to strict ByteStrings to use breakSubstring, but it looks like it might be worth it.
daniel
It looks like breakSubstring isn't in GHC 6.8 libs... is that right?
Jared Updike
+2  A: 

The documentation of Bytestrings breakSubstring contains a function that does what you are asking for:

tokenise x y = h : if null t then [] else tokenise x (drop (length x) t)
    where (h,t) = breakSubstring x y
sth
Nice function there, read my mind.It looks like we have a consensus of 3 for breakSubstring, even though I will still need to "toChunks" and "fromChunk" my ByteStrings to Stict ByteStrings and back to use this. Any reason breakSubstring isn't in ByteString.Lazy?
daniel