ansaurus

Question

Avoiding a space leak reading an HTML document with HXT

Answer 1

A:

Try to use a ByteString of the module Data.Bytestring.Lazy. The usual string is optimized for recursion and behaves pretty bad in case of large amounts of data. Also you can try to make your functions more strict (eg. using seq) to avoid large overhead due to unevaluated thunks. But be carefull as this may make things even worser if applied wrong.

PS: It's always a good idea to supply a brief example.

FUZxxl 2010-09-05 01:57:40

Do you mean ByteString? I hadn't heard of BitString until you mentioned it, and it seems like it's specialized for when I care about individual bits.

Alex R 2010-09-06 10:11:06

Yes... You're right.

FUZxxl 2010-09-06 12:23:33

That was the first thing I tried, even before I moved to HXT. It didn't help at all; the problem was with the actual parsing.

Alex R 2010-09-06 17:11:01

Did you used `Data.Bytestring` or `Data.Bytestring.Lazy`? Only the second one is lazy, while the first one will - as in your example - copy the whole thing in your precious RAM first.

FUZxxl 2010-09-07 05:48:27

I used Data.ByteString.Lazy. Profiling the code, it became clear that the String vs. ByteString issue was not the major one.

Alex R 2010-09-08 14:13:23

Answer 2

A:

Is HXT's parser an "online" parser?

The example you have have works fine for String, provided each line isn't pathologically long:

unwords . take 62 . drop 11 . lines

Here you will only consume 73 lines of input, 11 that you drop and 62 that you operate on. However the example is mostly irrelevant to XML processing. If HXT's parser is not an online parser you will have to read the whole file into memory before you can operator on any embedded string data.

I'm afraid I don't whether or not HXT is a online parser, but that would seem to be the crux of your problem.

Stephen Tetley 2010-09-05 12:19:34

Unfortunately, I'm not sure whether HXT is an online parser or not; I tried first with TagSoup, and that had a similar problem. I have seen people claiming to use HXT on files GBs in size, so I suspect it is an online parser.

Alex R 2010-09-06 10:14:18

I'd suspect HXT is not an online (a.k.a. streaming) parser. Generally speaking it is more work to design a online parser than a regular one, thus unless it is stated otherwise I'd assume a given parser would be non-streaming.A quick search tuned up this thread:http://stackoverflow.com/questions/2292729/with-haskell-how-do-i-process-large-volumes-of-xmlCertainly hexpat would be a good candidate as it is built on a SAX parser, also HaXML has support for online (streaming) parsing - vis importing the right module.

Stephen Tetley 2010-09-06 10:55:33

ansaurus

tags:

views:

answers:

Avoiding a space leak reading an HTML document with HXT

related questions