views:

205

answers:

4

I am working working with very large XML files (100s of MBs). The tree is fairly simple

<items>
  <item>
    <column1>ABC</column1>
    <column2>DEF</column2>
  </item>
  <item>
    <column1>GHI</column1>
    <column2>KLM</column2>
  </item>
</items>

I need to parse this document, and remove some <item> elements. So far, the best peerformance I achieved is using XmlReader, caching each <item> in memory and the writing it back using XmlWriter out if it meets the criteria, and simply ignoring it if it doesn't. Is there anyting i can do to make it faster?

A: 

You could use perl or shell scripting to replace out the required items if you can write a quick regular expression to get rid of it. That would avoid loading the whole thing into memory and writing it back out.

Matt
In general, regular expressions cannot be used to match XML (or HTML), because they are not regular languages.
John Saunders
+1  A: 

You might be able to save a step by implementing a subclass of XmlReader whose Read method skips over the item elements you're not interested in. Right now, you seem to have two steps: reading and filtering the document with an XmlReader and then using XmlWriter to write it to something that you presumably then read it from. Subclassing XmlReader eliminates that second step; you use the subclassed XmlReader as the input to your XSLT transform or XmlDocument or whatever, and it never builds an intermediate representation of the filtered XML document.

Robert Rossney
This may work, but once i read forward, if my item is good, i'll need to move my "cursor" back to the start of the item. How do i do that?
Pasha
Well, there's (at least) two ways. You can have your XmlReader check its Stream's CanSeek property at creation and throw an exception if it can't seek; then you know you can save the position in the Stream when you start parsing an element, and if the element's good you can parse it again. The better way is to build some kind of intermediate representation for each node - the XmlNodeType, Name, Value, etc. - and save it in a list. Then either throw the list a way or update the XmlReader's properties from the next item in the list when Read is called.
Robert Rossney
A: 

see if you can use xpath querys to determine what you want to and dont want to read with that xmldocument object....look into the following methods of that class SelectSingleNode() which returns an XmlNode object... SelectNodes() which returns an XmlNodeList object.... see if that helps....

kd
A: 

This URL has the answer you look for

http://stackoverflow.com/questions/62423/how-to-update-large-xml-file

vtd-xml-author
Note that Mr. Zhang is the author of VTD-XML.
John Saunders