Background
We have a project that was started in .NET 1.1, moved to .NET 2.0, and recently moved again to .NET 3.5. The project is extremely data-driven and utilizes XML for many of its data files. Some of these XML files are quite large and I would like to take the opportunity I currently have to improve the application's interaction with them. If possible, I want to avoid having to hold them entirely in memory at all times, but on the other hand, I want to make accessing their data fast.
The current setup uses XmlDocument
and XPathDocument
(depending on when it was written and by whom). The data is looked up when first requested and cached in an internal data structure (rather than as XML, which would take up more memory in most scenarios). In the past, this was a nice model as it had fast access times and low memory footprint (or at least, satisfactory memory footprint). Now, however, there is a feature that queries a large proportion of the information in one go, rather than the nicely spread out requests we previously had. This causes the XML loading, validation, and parsing to be a visible bottleneck in performance.
Question
Given a large XML file, what is the most efficient and responsive way to query its contents (such as, "does element A with id=B exist?") repeatedly without having the XML in memory?
Note that the data itself can be in memory, just not in its more bloated XML form if we can help it. In the worst case, we could accept a single file being loaded into memory to be parsed and then unloaded again to free resources, but I'd like to avoid that if at all possible.
Considering that we're already caching data where we can, this question could also be read as "which is faster and uses less memory; XmlDocument
, XPathDocument
, parsing based on XmlReader
, or XDocument
/LINQ-to-XML?"
Edit: Even simpler, can we randomly access the XML on disk without reading in the entire file at once?
Example
An XML file has some records:
<MyXml>
<Record id='1'/>
<Record id='2'/>
<Record id='3'/>
</MyXml>
Our user interface wants to know if a record exists with an id of 3. We want to find out without having to parse and load every record in the file, if we can. So, if it is in our cache, there's no XML interaction, if it isn't, we can just load that record into the cache and respond to the request.
Goal
To have a scalable, fast way of querying and caching XML data files so that our user interface is responsive without resorting to multiple threads or the long-term retention of entire XML files in memory.
I realize that there may well be a blog or MSDN article on this somewhere and I will be continuing to Google after I've posted this question, but if anyone has some data that might help, or some examples of when one approach is better or faster than another, that would be great. Thanks.