ansaurus

Question

With Haskell, how do I process large volumes of XML?

Answer 1

+10 A:

I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.

That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml

I'm not sure which support bytestrings, but that's the condition you're looking for.

Don Stewart 2010-02-18 23:36:39

+1 for ByteStrings, those (still) don't get as much love as they deserve. Forgetting the poor performance of long `String`s is an all-too-easy mistake--don't leave things in lists just because they're easy, folks!

camccann 2010-02-19 01:40:49

Answer 2

+1 A:

Perhaps you need a lazy XML parser: your usage looks like a pretty straightforward scan through the input. HaXml has a lazy parser, although you must ask for it explicitly by importing the correct module.

Malcolm Wallace 2010-02-19 11:55:22

Answer 3

+2 A:

Below is an example that uses hexpat:

{-# LANGUAGE PatternGuards #-}

module Main where

import Text.XML.Expat.SAX

import qualified Data.ByteString.Lazy as B

userid = "83805"

main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing

ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (StartElement "row" as)
  | Just ouid <- lookup "OwnerUserId" as = ouid == uid
  | otherwise = False
ownedBy _ _ = False

The definition of ownedBy is a little clunky. Maybe a view pattern instead:

{-# LANGUAGE ViewPatterns #-}

module Main where

import Text.XML.Expat.SAX

import qualified Data.ByteString.Lazy as B

userid = "83805"

main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing

ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (ownerUserId -> Just ouid) = uid == ouid
ownedBy _ _ = False

ownerUserId :: SAXEvent String String -> Maybe String
ownerUserId (StartElement "row" as) = lookup "OwnerUserId" as
ownerUserId _ = Nothing

Greg Bacon 2010-02-22 20:02:33

Answer 4

+2 A:

TagSoup supports ByteString via it's class Text.StringLike. The only changes needed to ur example were to call ByteString.Lazy readline, and add a fromString to the fromAttrib:

import Text.StringLike
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BSC

userid = "83805"
file = "blah//posts.xml"
main = do
posts <- liftM parseTags (BSL.readFile file)
print $ head $ map (fromAttrib (fromString "Id")) $
               filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
               posts

Your example ran for me (4 gig RAM), taking 6 minutes; the ByteString version took 10 minutes.

ja 2010-02-28 17:17:39

ansaurus

tags:

views:

answers:

With Haskell, how do I process large volumes of XML?

TagSoup

hxt

xml

related questions