views:

85

answers:

2

Link to truncated version of example document

I'm trying to extract the large chunk of text in the last "pre", process it, and output it. For the purposes of argument, let's say I want to apply

concatMap (unwords . take 62 . drop 11) . lines

to the text and output it.

This takes over 400M of space on a 4M html document when I do it.

The code I have is pretty simple, so I'm not including it for fear of biasing responses.
Here is one iteration of the code:

file = readDocument [(a_validate, v_0), (a_parse_html, v_1)] "Cache entry information.xhtml"
text = fmap last $ runX $
  file >>>
  deep (hasName "pre") />
  isText >>>
--  changeText (unwords . take 62 . drop 11 . lines) >>>
  getText

I think the problem is that the way I'm doing it, HXT is trying to keep all the text in memory as it reads it.

According to this it appears that HXT needs to at least read the whole document, although not to store it in memory.

I'm going to try other parsers, HaXmL, being the next one.
N.B. I have solved the initial problem by treating the input file as plain text and the desired portion a delimited by "<pre>00000000:" and "</pre></body>\n</html>"

A: 

Try to use a ByteString of the module Data.Bytestring.Lazy. The usual string is optimized for recursion and behaves pretty bad in case of large amounts of data. Also you can try to make your functions more strict (eg. using seq) to avoid large overhead due to unevaluated thunks. But be carefull as this may make things even worser if applied wrong.

PS: It's always a good idea to supply a brief example.

FUZxxl
Do you mean ByteString? I hadn't heard of BitString until you mentioned it, and it seems like it's specialized for when I care about individual bits.
Alex R
Yes... You're right.
FUZxxl
That was the first thing I tried, even before I moved to HXT. It didn't help at all; the problem was with the actual parsing.
Alex R
Did you used `Data.Bytestring` or `Data.Bytestring.Lazy`? Only the second one is lazy, while the first one will - as in your example - copy the whole thing in your precious RAM first.
FUZxxl
I used Data.ByteString.Lazy. Profiling the code, it became clear that the String vs. ByteString issue was not the major one.
Alex R
A: 

Is HXT's parser an "online" parser?

The example you have have works fine for String, provided each line isn't pathologically long:

unwords . take 62 . drop 11 . lines

Here you will only consume 73 lines of input, 11 that you drop and 62 that you operate on. However the example is mostly irrelevant to XML processing. If HXT's parser is not an online parser you will have to read the whole file into memory before you can operator on any embedded string data.

I'm afraid I don't whether or not HXT is a online parser, but that would seem to be the crux of your problem.

Stephen Tetley
Unfortunately, I'm not sure whether HXT is an online parser or not; I tried first with TagSoup, and that had a similar problem. I have seen people claiming to use HXT on files GBs in size, so I suspect it is an online parser.
Alex R
I'd suspect HXT is not an online (a.k.a. streaming) parser. Generally speaking it is more work to design a online parser than a regular one, thus unless it is stated otherwise I'd assume a given parser would be non-streaming.A quick search tuned up this thread:http://stackoverflow.com/questions/2292729/with-haskell-how-do-i-process-large-volumes-of-xmlCertainly hexpat would be a good candidate as it is built on a SAX parser, also HaXML has support for online (streaming) parsing - vis importing the right module.
Stephen Tetley