tags:

views:

281

answers:

4

In my application, I have a known offset of interest in an XML string, and want to answer questions like "what is my parent element?" without parsing the whole document.

This article mentions a library which appears to be in Objective-C for "backwards" XML parsing. My application doesn't require full XML support, so I'm happy to put up with all the caveats about not being able to parse completely robustly. Is there anything like this for C#/.NET?

Clarification: I'm not asking about parsing solutions or performance tradeoffs in general, I'm interested in particular situations where I am at some point midway through a text stream and just need to know something about the local structure. Imagine a situation where I don't want to get the top of the document because accesses have very high latency.

+2  A: 

Sounds like XPathDocument might be what you are looking for. This class provides a fast, read-only, in-memory representation of an XML document. It doesn't build up a DOM and is optimized for XPath queries.

XPathDocument can also be used to parse XML fragments. To do so you have to create it from an XmlReader that has its conformance level set to fragment.

The following sample code first selects a set of XML nodes from an XML fragment and then selects the parent of each node based on an XPath expression:

using System;
using System.IO;
using System.Xml;
using System.Xml.XPath;

class Program
{
    static void Main(string[] args)
    {
        string xml = File.ReadAllText(@"C:\tmp\smplInput.xml");

        XmlReaderSettings xrs = new XmlReaderSettings();
        xrs.ConformanceLevel = ConformanceLevel.Fragment;

        using (TextReader textReader = new StringReader(xml))
        {
            using (XmlReader xmlReader = XmlReader.Create(textReader, xrs))
            {
                // Create a new XPathDocument   
                XPathDocument doc = new XPathDocument(xmlReader);

                // Create navigator   
                XPathNavigator navigator = doc.CreateNavigator();

                // Set up namespace manager for XPath   
                XmlNamespaceManager ns = new XmlNamespaceManager(navigator.NameTable);
                ns.AddNamespace("w", "http://www.example.com/2010/");

                // Select nodes  
                XPathNodeIterator users = navigator.Select("//w:user", ns);

                while (users.MoveNext())
                {
                    XPathNavigator user = users.Current;
                    XPathNavigator department = user.SelectSingleNode("parent::node()", ns);
                    Console.WriteLine(string.Format("User {0} is in department {1}",
                        user.GetAttribute("name", ns.DefaultNamespace),
                        department.GetAttribute("type", ns.DefaultNamespace)));
                }
            }
        }
    }
}

To try the code you could use the following XML input document:

<?xml version="1.0" encoding="utf-8" ?>
<w:departments xmlns:w="http://www.example.com/2010/"&gt;
  <w:department type="A">
    <w:user name="w" />
    <w:user name="x" />
    <w:department type="B">
      <w:user name="x" />
      <w:user name="y" />
    </w:department>
    <w:department type="C">
      <w:user name="x" />
      <w:user name="y" />
      <w:user name="z" />
    </w:department>
  </w:department>
  <w:department type="D">
    <w:user name="w" />
  </w:department>
</w:departments>
0xA3
Can you give an example of how I might use it to achieve this? In the normal way of using XPathDocument I still pass it a whole string, without indicating where in that string parsing should start.
kdt
I added an example. Just pass the whole string and execute your XPath queries to select the nodes of interest. Relying on a textual offset doesn't seem to be a good idea. XPathDocument should run with reasonable performance in most cases. So before trying to write your own parser I would give this a try and see if you get a fast enough result (Writing your own parser would seem a bit like premature optimization). Please also note that performance might be optimized by fine-tuning the XPath queries.
0xA3
Okay, so XPathDocument isn't what I'm looking for -- it's not about general speed or efficiency, it's about the specific case where I know where in the text I want to start, and I want to completely avoid looking anywhere else. For example, I've got part of a file, and getting any other parts of it would involve going to high-latency storage like a tape robot.
kdt
@jcs: `XPathDocument` can also be used for parsing XML fragments. The only condition being that it is a well-formed fragment, i.e. every opening tag must have a corresponding closing tag on the same level. See my updated sample.
0xA3
+3  A: 

It's not possible to do this without making some significant assumptions about the nature of your text. Most notably, you have to assume that it's well-formed XML, and that it contains neither CDATA sections nor namespaces.

If you start at any position in the middle of a stream and back up until you hit what appears to be the start of an element, you have no way of knowing that the text you're looking at actually is the start of an element. It could be CDATA. And you can't tell that it's not CDATA until you've backtracked through the entire stream looking for <![CDATA[ and haven't found it.

Namespaces present a similar problem. If you find a start tag like <Foo, you can't know for certain that Foo is in the default namespace until you've backtracked all the way to the document's root element and ascertained that no ancestor element has a namespace declaration. If you find <x:Foo, you have to backtrack until you find an enclosing element with an xmlns:x declaration.

If you know for sure that the text is well-formed XML, that it doesn't contain CDATA, and that its use of namespaces is limited (i.e. you can tell what namespace an element is in just by looking at its start tag), then some of what you're trying to do is at least possible.

You can back up to the first start tag you encounter, create a StreamReader whose origin is that position, and use that to create an XPathDocument that's set up to handle document fragments. Note, by the way, that you have no assurance that the XPathDocument won't read all the way to the end of the text the first time you use it unless, again, you have knowledge about the nature of the text and you know that the matching end tag is going to be present.

But this won't handle the specific case you mentioned, i.e. finding the parent element. To find the parent element you'd need to find a start tag that isn't preceded (as you move backwards) by a matching end tag. This isn't terribly difficult to do - every < character you find is going to be the beginning of either a start tag, an end tag, or an empty element, and you can just put end tags on a stack and pop them off when you find their matching start tag. When you hit a start tag and the stack is empty, you're at the start of the parent element.

But this too is a process that might result in your backtracking all the way to the stream's origin, especially in the trivial case where the XML you're looking is the classically moronic XML log format:

<log>
   <entry>...</entry>
   <entry>...</entry>

...repeated ad infinitum

Robert Rossney
+1  A: 

Another approach is to parse XML once, then generate XML index so next time you load the index and don't need to parse XML repeatedly... see the article below

http://xml.sys-con.com/node/453082

vtd-xml-author
Jimmy, please realise that there are some questions to which VTD is not the answer. This is one of them.Here's a useful link...http://meta.stackoverflow.com/questions/21823/what-constitutes-spam
kdt
there is nothing wrong about proposing a general approach, rather than a concrete answer ... why is that link even relevant??
vtd-xml-author
@kdt: From the question as you asked it it's not clear that VTD could not be an interesting approach. "I am at some point midway through a text stream and just need to know something about the local structure." This seems to be satisfied by VTD if you already created the index. And no matter whether you are going to use VTD or not, pre-indexing your XML might tackle your actual problem. Maybe you need to give some more details in your question, as it seems all answers so far are not meeting your requirements.
0xA3
Note that Mr. Zhang is the author of VTD-XML.
John Saunders
A: 

CAX from xponentsoftware does exactly what you want.

bill seacham