I'm trying to write some application, that performs analysis of data, stored in pretty big XML files (from 10 to 800MB). Each set of data is stored as single tag, with concrete data specified as attrobutes. I'm currently saxParse from HaXml, and I'm not satisfied with memory usage during work with it. On parsing of 15Mb XML file it consumes more than 1Gb of memory, although I tried to not to store data in the lists, and process it immediately. I use following code:
importOneFile file proc ioproc = do
xml <- readFile file
let (sxs, res) = saxParse file $ stripUnicodeBOM xml
case res of
Just str -> putStrLn $ "Error: " ++ str;
Nothing -> forM_ sxs (ioproc . proc . (extractAttrs "row"))
where 'proc' - procedure, that performs conversion of data from attributes into record, and 'ioproc' - procedure, that performs some IO action - output to screen, storing in database, etc.
How i can decrease memory consumption during XML parsing? Should switching to another XML parser help?
Update: and which parser supports for different input encodings - utf-8, utf-16, utf-32, etc.?