views:

3513

answers:

10

I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.

I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.

One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.

Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.

I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...

A: 

Have you been trying XPathDocument? This class is optimized for handling XPath queries efficiently.

If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.

0xA3
A: 

You've outlined your choices already.

Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.

If you choose the latter use XPathDocument its readonly restriction allows better used of memory.

AnthonyWJones
+1  A: 

In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReader is a nice class for handling such tasks.

Darin Dimitrov
XPathDocument is a light-weight class too.
0xA3
The problem with XPathDocument is that the whole document will be loaded in memory.
Darin Dimitrov
A: 
Dimitre Novatchev
A: 

I don't believe that using the classes in the System.Xml namespace is the answer here. Even if you streamed the file, depending on the location of the elements you are looking for, you could still take a lot of time.

My recommendation would be to use SQL Server and it's support for XML. You can store the XML in a column, and then apply indexes on it and query the column using your XPath statement.

SQL Server is going to do a much better job of traversing the tree and of managing the memory consumed by hosting the document.

If this is for a networked application, then it makes even more sense. If the installs are local, then use SQL Server Express.

casperOne
+1  A: 

How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.

Donny V.
Another option could be to create a generic list with a data class.Fill it with the xml data and than query it using linq. I'v been doing that a lot lately and it works really well.
Donny V.
+2  A: 

Gigabyte XML files! I don't envy you this task.

Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.

I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?

One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.

I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?

This might also help http://articles.techrepublic.com.com/5100-10878_11-1044772.html

Fortyrunner
A: 

I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files. The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes

Ahmed Said
+3  A: 

XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN artile.

If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.

I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.

Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".

Download from Microsoft

Richard Wolf
The status/version of XPathReader is unsure. Apparently hasn't been updated since 2004. See http://stackoverflow.com/questions/465237/what-ever-happened-to-xpathreader
mjv
A: 

Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.

Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.

StevenzNPaul