views:

163

answers:

3

I am converting some functioning Haskell code that uses Parsec to instead use Attoparsec in the hope of getting better performance. I have made the changes and everything compiles but my parser does not work correctly.

I am parsing a file that consists of various record types, one per line. Each of my individual functions for parsing a record or comment works correctly but when I try to write a function to compile a sequence of records the parser always returns a partial result because it is expecting more input.

These are the two main variations that I've tried. Both have the same problem.

items :: Parser [Item]
items = sepBy (comment <|> recordType1 <|> recordType2) endOfLine

For this second one I changed the record/comment parsers to consume the end-of-line characters.

items :: Parser [Item]
items = manyTill (comment <|> recordType1 <|> recordType2) endOfInput

Is there anything wrong with my approach? Is there some other way to achieve what I am attempting?

A: 

You give quite little information which is why I think it is hard to give you good help. However there are a couple of comments I would like to give:

  • Perhaps the parser don't realize that the input is done and it hinges on either getting an EOL or getting another record. Hence it asks for a partial result. Try feeding it the equivalent of EOL in the hope it forces it.
  • I can't remember the code, but using the Alternative instance might be detrimental to parsing performance. If that is the case, you may want to case on the comment and recordTypes.
  • I use cereal for a lot of binary parsing and that is also extremely fast. attoparsec seems better as a text-parser though. You should definitely consider the option.
  • Another option is to use iteratee-based IO in the longer run. John Lato did an excellent article on iteratees in the latest monad reader (issue #16 I believe). The end-of-line condition is the iteratees to signal. Beware though that the iteratee types are quite daunting and take some time getting used to.
jlouis
Thanks for the suggestions. I've tried reducing the parser to the minimal version that exhibits the problem, such as removing the alternatives and just having a sequence of comments. In GHCi the comment function works as I expect, but the items function is problematic. I was wondering if there was something fundamentally wrong with my approach. I'm happy to share any extra information that might be useful. I tried feeding it an extra EOL but it made no difference. I saw iteratees mentioned in a few places but I've not encountered them before so I was trying to avoid them for the moment.
Dan Dyer
+2  A: 

I've run into this problem before and my understanding is that it's caused by the way that <|> works in the definition of sepBy:

sepBy1 :: Alternative f => f a -> f s -> f [a]
sepBy1 p s = scan
    where scan = liftA2 (:) p ((s *> scan) <|> pure [])

This will only move to pure [] once (s *> scan) has failed, which won't happen just because you're at the end of the input.

My solution has been just to call feed with an empty ByteString on the Result returned by parse. This might be kind of a hack, but it also seems to be how attoparsec-iteratee deals with the issue:

f k (EOF Nothing)  = finalChunk $ feed (k S.empty) S.empty

As far as I can tell this is the only reason that attoparsec-iteratee works here and plain old parse doesn't.

Travis Brown
Thanks, that solved my problem.
Dan Dyer
+2  A: 

If you write an attoparsec parser that consumes as much input as possible before failing, you must tell the partial result continuation when you've reached the end of your input.

Bryan O'Sullivan
So `feed` is actually the correct way to deal with this situation? It might be a good idea to make this a little clearer in the documentation—I know it confused me when I first came across it.
Travis Brown