Link to truncated version of example document
I'm trying to extract the large chunk of text in the last "pre", process it, and output it. For the purposes of argument, let's say I want to apply
concatMap (unwords . take 62 . drop 11) . lines
to the text and output it.
This takes over 400M of space on a 4M html document when I do it.
The code I have is pretty simple, so I'm not including it for fear of biasing responses.
Here is one iteration of the code:
file = readDocument [(a_validate, v_0), (a_parse_html, v_1)] "Cache entry information.xhtml"
text = fmap last $ runX $
file >>>
deep (hasName "pre") />
isText >>>
-- changeText (unwords . take 62 . drop 11 . lines) >>>
getText
I think the problem is that the way I'm doing it, HXT is trying to keep all the text in memory as it reads it.
According to this it appears that HXT needs to at least read the whole document, although not to store it in memory.
I'm going to try other parsers, HaXmL, being the next one.
N.B. I have solved the initial problem by treating the input file as plain text and the desired portion a delimited by "<pre>00000000:"
and "</pre></body>\n</html>"