views:

509

answers:

2

I'm trying to write some application, that performs analysis of data, stored in pretty big XML files (from 10 to 800MB). Each set of data is stored as single tag, with concrete data specified as attrobutes. I'm currently saxParse from HaXml, and I'm not satisfied with memory usage during work with it. On parsing of 15Mb XML file it consumes more than 1Gb of memory, although I tried to not to store data in the lists, and process it immediately. I use following code:

importOneFile file proc ioproc = do
  xml <- readFile file
  let (sxs, res) = saxParse file $ stripUnicodeBOM xml
  case res of
      Just str -> putStrLn $ "Error: " ++ str;
      Nothing -> forM_ sxs (ioproc . proc . (extractAttrs "row"))

where 'proc' - procedure, that performs conversion of data from attributes into record, and 'ioproc' - procedure, that performs some IO action - output to screen, storing in database, etc.

How i can decrease memory consumption during XML parsing? Should switching to another XML parser help?

Update: and which parser supports for different input encodings - utf-8, utf-16, utf-32, etc.?

+2  A: 

I'm no Haskell expert, but what you're running into sounds like a classic space-leak (i.e., a situation in which Haskell's lazy evaluation is causing it to reserve more memory than necessary). You may be able to solve it by forcing strictness on your saxParse output.

There's also a good chapter on profiling and optimization in Real World Haskell.

EDIT: Found another good resource on profiling/finding bottlenecks here.

rtperson
+3  A: 

If you're willing to assume that your inputs are valid, consider looking at TagSoup or Text.XML.Light from the Galois folks.

These take strings as input, so you can (indirectly) feed them anything Data.Encoding understands, namely

  • ASCII
  • UTF8
  • UTF16
  • UTF32
  • KOI8R
  • KOI8U
  • ISO88591
  • GB18030
  • BootString
  • ISO88592
  • ISO88593
  • ISO88594
  • ISO88595
  • ISO88596
  • ISO88597
  • ISO88598
  • ISO88599
  • ISO885910
  • ISO885911
  • ISO885913
  • ISO885914
  • ISO885915
  • ISO885916
  • CP1250
  • CP1251
  • CP1252
  • CP1253
  • CP1254
  • CP1255
  • CP1256
  • CP1257
  • CP1258
  • MacOSRoman
  • JISX0201
  • JISX0208
  • ISO2022JP
  • JISX0212
Greg Bacon