ansaurus

Question

Is there a backwards XML parser for .NET?

Answer 1

+2 A:

Sounds like XPathDocument might be what you are looking for. This class provides a fast, read-only, in-memory representation of an XML document. It doesn't build up a DOM and is optimized for XPath queries.

XPathDocument can also be used to parse XML fragments. To do so you have to create it from an XmlReader that has its conformance level set to fragment.

The following sample code first selects a set of XML nodes from an XML fragment and then selects the parent of each node based on an XPath expression:

using System;
using System.IO;
using System.Xml;
using System.Xml.XPath;

class Program
{
    static void Main(string[] args)
    {
        string xml = File.ReadAllText(@"C:\tmp\smplInput.xml");

        XmlReaderSettings xrs = new XmlReaderSettings();
        xrs.ConformanceLevel = ConformanceLevel.Fragment;

        using (TextReader textReader = new StringReader(xml))
        {
            using (XmlReader xmlReader = XmlReader.Create(textReader, xrs))
            {
                // Create a new XPathDocument   
                XPathDocument doc = new XPathDocument(xmlReader);

                // Create navigator   
                XPathNavigator navigator = doc.CreateNavigator();

                // Set up namespace manager for XPath   
                XmlNamespaceManager ns = new XmlNamespaceManager(navigator.NameTable);
                ns.AddNamespace("w", "http://www.example.com/2010/");

                // Select nodes  
                XPathNodeIterator users = navigator.Select("//w:user", ns);

                while (users.MoveNext())
                {
                    XPathNavigator user = users.Current;
                    XPathNavigator department = user.SelectSingleNode("parent::node()", ns);
                    Console.WriteLine(string.Format("User {0} is in department {1}",
                        user.GetAttribute("name", ns.DefaultNamespace),
                        department.GetAttribute("type", ns.DefaultNamespace)));
                }
            }
        }
    }
}

To try the code you could use the following XML input document:

<?xml version="1.0" encoding="utf-8" ?>
<w:departments xmlns:w="http://www.example.com/2010/"&gt;
  <w:department type="A">
    <w:user name="w" />
    <w:user name="x" />
    <w:department type="B">
      <w:user name="x" />
      <w:user name="y" />
    </w:department>
    <w:department type="C">
      <w:user name="x" />
      <w:user name="y" />
      <w:user name="z" />
    </w:department>
  </w:department>
  <w:department type="D">
    <w:user name="w" />
  </w:department>
</w:departments>

0xA3 2010-01-04 11:27:14

Can you give an example of how I might use it to achieve this? In the normal way of using XPathDocument I still pass it a whole string, without indicating where in that string parsing should start.

kdt 2010-01-04 11:37:19

I added an example. Just pass the whole string and execute your XPath queries to select the nodes of interest. Relying on a textual offset doesn't seem to be a good idea. XPathDocument should run with reasonable performance in most cases. So before trying to write your own parser I would give this a try and see if you get a fast enough result (Writing your own parser would seem a bit like premature optimization). Please also note that performance might be optimized by fine-tuning the XPath queries.

0xA3 2010-01-04 11:50:47

Okay, so XPathDocument isn't what I'm looking for -- it's not about general speed or efficiency, it's about the specific case where I know where in the text I want to start, and I want to completely avoid looking anywhere else. For example, I've got part of a file, and getting any other parts of it would involve going to high-latency storage like a tape robot.

kdt 2010-01-04 14:27:18

@jcs: `XPathDocument` can also be used for parsing XML fragments. The only condition being that it is a well-formed fragment, i.e. every opening tag must have a corresponding closing tag on the same level. See my updated sample.

0xA3 2010-01-04 15:53:51

Answer 2

+3 A:

It's not possible to do this without making some significant assumptions about the nature of your text. Most notably, you have to assume that it's well-formed XML, and that it contains neither CDATA sections nor namespaces.

If you start at any position in the middle of a stream and back up until you hit what appears to be the start of an element, you have no way of knowing that the text you're looking at actually is the start of an element. It could be CDATA. And you can't tell that it's not CDATA until you've backtracked through the entire stream looking for <![CDATA[ and haven't found it.

Namespaces present a similar problem. If you find a start tag like <Foo, you can't know for certain that Foo is in the default namespace until you've backtracked all the way to the document's root element and ascertained that no ancestor element has a namespace declaration. If you find <x:Foo, you have to backtrack until you find an enclosing element with an xmlns:x declaration.

If you know for sure that the text is well-formed XML, that it doesn't contain CDATA, and that its use of namespaces is limited (i.e. you can tell what namespace an element is in just by looking at its start tag), then some of what you're trying to do is at least possible.

You can back up to the first start tag you encounter, create a StreamReader whose origin is that position, and use that to create an XPathDocument that's set up to handle document fragments. Note, by the way, that you have no assurance that the XPathDocument won't read all the way to the end of the text the first time you use it unless, again, you have knowledge about the nature of the text and you know that the matching end tag is going to be present.

But this won't handle the specific case you mentioned, i.e. finding the parent element. To find the parent element you'd need to find a start tag that isn't preceded (as you move backwards) by a matching end tag. This isn't terribly difficult to do - every < character you find is going to be the beginning of either a start tag, an end tag, or an empty element, and you can just put end tags on a stack and pop them off when you find their matching start tag. When you hit a start tag and the stack is empty, you're at the start of the parent element.

But this too is a process that might result in your backtracking all the way to the stream's origin, especially in the trivial case where the XML you're looking is the classically moronic XML log format:

<log>
   <entry>...</entry>
   <entry>...</entry>

...repeated ad infinitum

Robert Rossney 2010-01-04 17:06:31

Answer 3

+1 A:

Another approach is to parse XML once, then generate XML index so next time you load the index and don't need to parse XML repeatedly... see the article below

http://xml.sys-con.com/node/453082

vtd-xml-author 2010-01-04 21:42:07

Jimmy, please realise that there are some questions to which VTD is not the answer. This is one of them.Here's a useful link...http://meta.stackoverflow.com/questions/21823/what-constitutes-spam

kdt 2010-01-04 22:56:19

there is nothing wrong about proposing a general approach, rather than a concrete answer ... why is that link even relevant??

vtd-xml-author 2010-01-05 00:40:30

@kdt: From the question as you asked it it's not clear that VTD could not be an interesting approach. "I am at some point midway through a text stream and just need to know something about the local structure." This seems to be satisfied by VTD if you already created the index. And no matter whether you are going to use VTD or not, pre-indexing your XML might tackle your actual problem. Maybe you need to give some more details in your question, as it seems all answers so far are not meeting your requirements.

0xA3 2010-01-05 10:40:02

Note that Mr. Zhang is the author of VTD-XML.

John Saunders 2010-03-09 09:23:56

Answer 4

A:

CAX from xponentsoftware does exactly what you want.

bill seacham 2010-01-06 04:09:30

ansaurus

tags:

views:

answers:

Is there a backwards XML parser for .NET?

related questions