views:

882

answers:

6

I am struggling with a sensible logic loop for stripping out nodes from an XML file too large to use with XPath supporting .NET classes.

I am attempting to replace the single line of code I had (that called SelectNodes with an XPath query string) with code that does the same but uses an XmTextReader.

I have to go several levels down as illustraed by the previously used XPath quey (which was for reference):

ConfigurationRelease/Profiles/Profile[Name='MyProfileName']/Screens/Screen[Id='MyScreenId']/Settings/Setting[Name='MySettingName']

I thought this would be annoying but simple. However, I just can't seem to get the loop right.

I need to get a node, check a node under that to see if the value matches a target string and then walk down further if it does or skip that branch if it does't.

In fact, I think my problem is that I don't know how to ignore a branch if I'm not inetersted in it. I can't allow it to walk irrelevant branches as the element names are not unique (as illustrated by the XPath query).

I thought I could maintain some booleans e.g. bool expectingProfileName that gets set to true when I hit a Profile node. However, if its not the particular profile node I want, I can't get out of that branch.

So...hopefully this makes sense to someone...I've been staring at the problem for a couple hours and may just be missing something obvious.....

I'd like to post a portion of the file up but can't figure out how so the structure is roughly:

ConfigRelease > Profiles > Profile > Name > Screens > Screen > Settings > Setting > Name

I will know ProfileName, ScreenName and SettingName and I need the Setting node.

I am trying to avoid reading the whole file in one hit e.g. at app start-up because half the stuff in it won't ever be used. I also have no control over what generates the xml file so cannot change it to say, produce multiple smaller files.

Any tips will be greatly appreciated.

UPDATE

I have re-opened this. A poster suggested XPathDocument which should have been perfect. Unfortunatley, I didn't mention that this is a mobile app and XPathDocument is not supported.

The file isn't large by most standards which is why the system was originally coded to use XmlDocument. It is currently 4MB which is apparently large enough to crash a Mobile App when it is loaded into an XmlDocument. It's probably just as well it came up now as the file is epxected to get much bigger. Anyway, I am now trying the DataSet suggestion but am still open to other ideas.

UPDATE 2

I got suspiscious because quite a few people have said they would not expect a file this size to crash the system. Further experiments have shown that this is an intermittent crash. Yesterday it crashed everytime but this morning after I reset the device, I can't reproduce it. I am now trying to figure out a reliable set of reproductive steps. And also decide the best way to handle the problem which I'm sure is still there. I can't just leave it because if the app can't access this file, it is useless and I don't think can tell my users that they can't run anything else on their devices when my app is running.......

+2  A: 

Try loading the file into a dataset:

DataSet ds = new Dataset();
ds.ReadXml("C:\MyXmlFile.xml")

Then you can use linq to search it.

Joel Coehoorn
If it's dying due to an OOM on the XML, a DataSet representation is likely to take up a large chunk of memory as well and will likely also die.
ctacke
+8  A: 

Have a look at XPathDocument.

XPathDocument is more light-weight than XmlDocument and is optimized for read-only XPath queries.

0xA3
Aaaaah - there is an overload that takes a stream - this is looking exactly like what I need - thanks!!!
J M
I wanted to vote this up but my rep isn't high enough yet. So please accept my thanks anyway - this was right to the point of my issue i.e. I dn't want a massive set of data in memory. Ta!
J M
Wait, what? XPathDocument doesn't appear to be part of Compact Framework.
Jeffrey Hantin
+3  A: 

Okay I was entertained by this so I hacked some code together. It isn't pretty and only really supports this one use case but I think it does the job you're looking for and acts as a decent platform to start from. I haven't tested it that thoroughly either. Finally you'll need to modify the code to get it to return the contents (see the method called Output()).

Here is the code:

using System;

using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Xml;

namespace XPathInCE
{
    class Program
    {
     static void Main(string[] args)
     {
      try
      {
       if (args.Length != 2)
       {
        ShowUsage();
       }
       else
       {
        Extract(args[0], args[1]);
       }
      }
      catch (Exception ex)
      {
       Console.WriteLine("{0} was thrown", ex.GetType());
       Console.WriteLine(ex.Message);
       Console.WriteLine(ex.StackTrace);
      }

      Console.WriteLine("Press ENTER to exit");
      Console.ReadLine();
     }

     private static void Extract(string filePath, string queryString)
     {
      if (!File.Exists(filePath))
      {
       Console.WriteLine("File not found! Path: {0}", filePath);
       return;
      }

      XmlReaderSettings settings = new XmlReaderSettings { IgnoreComments = true, IgnoreWhitespace = true };
      using (XmlReader reader = XmlReader.Create(filePath, settings))
      {
       XPathQuery query = new XPathQuery(queryString);
       query.Find(reader);
      }
     }

     static void ShowUsage()
     {
      Console.WriteLine("No file specified or incorrect number of parameters");
      Console.WriteLine("Args must be: Filename XPath");
      Console.WriteLine();
      Console.WriteLine("Sample usage:");
      Console.WriteLine("XPathInCE someXmlFile.xml ConfigurationRelease/Profiles/Profile[Name='MyProfileName']/Screens/Screen[Id='MyScreenId']/Settings/Setting[Name='MySettingName']");
     }

     class XPathQuery
     {
      private readonly LinkedList<ElementOfInterest> list = new LinkedList<ElementOfInterest>();
      private LinkedListNode<ElementOfInterest> currentNode;

      internal XPathQuery(string query)
      {
       Parse(query);
       currentNode = list.First;
      }

      internal void Find(XmlReader reader)
      {
       bool skip = false;
       while (true)
       {
        if (skip)
        {
         reader.Skip();
         skip = false;
        }
        else
        {
         if (!reader.Read())
         {
          break;
         }
        }
        if (reader.NodeType == XmlNodeType.EndElement
          && String.Compare(reader.Name, currentNode.Previous.Value.ElementName, StringComparison.CurrentCultureIgnoreCase) == 0)
        {
         currentNode = currentNode.Previous ?? currentNode;
         continue;
        }
        if (reader.NodeType == XmlNodeType.Element)
        {
         string currentElementName = reader.Name;
         Console.WriteLine("Considering element: {0}", reader.Name);

         if (String.Compare(reader.Name, currentNode.Value.ElementName, StringComparison.CurrentCultureIgnoreCase) != 0)
         {
          // don't want
          Console.WriteLine("Skipping");
          skip = true;
          continue;
         }
         if (!FindAttributes(reader))
         {
          // don't want
          Console.WriteLine("Skipping");
          skip = true;
          continue;
         }

         // is there more?
         if (currentNode.Next != null)
         {
          currentNode = currentNode.Next;
          continue;
         }

         // we're at the end, this is a match! :D
         Console.WriteLine("XPath match found!");
         Output(reader, currentElementName);
        }
       }
      }

      private bool FindAttributes(XmlReader reader)
      {
       foreach (AttributeOfInterest attributeOfInterest in currentNode.Value.Attributes)
       {
        if (String.Compare(reader.GetAttribute(attributeOfInterest.Name), attributeOfInterest.Value,
               StringComparison.CurrentCultureIgnoreCase) != 0)
        {
         return false;
        }
       }
       return true;
      }

      private static void Output(XmlReader reader, string name)
      {
       while (reader.Read())
       {
        // break condition
        if (reader.NodeType == XmlNodeType.EndElement
         && String.Compare(reader.Name, name, StringComparison.CurrentCultureIgnoreCase) == 0)
        {
         return;
        }

        if (reader.NodeType == XmlNodeType.Element)
        {
         Console.WriteLine("Element {0}", reader.Name);
         Console.WriteLine("Attributes");
         for (int i = 0; i < reader.AttributeCount; i++)
         {
          reader.MoveToAttribute(i);
          Console.WriteLine("Attribute: {0} Value: {1}", reader.Name, reader.Value);
         }
        }

        if (reader.NodeType == XmlNodeType.Text)
        {
         Console.WriteLine("Element value: {0}", reader.Value);
        }
       }
      }

      private void Parse(string query)
      {
       IList<string> elements = query.Split('/');
       foreach (string element in elements)
       {
        ElementOfInterest interestingElement = null;
        string elementName = element;
        int attributeQueryStartIndex = element.IndexOf('[');
        if (attributeQueryStartIndex != -1)
        {
         int attributeQueryEndIndex = element.IndexOf(']');
         if (attributeQueryEndIndex == -1)
         {
          throw new ArgumentException(String.Format("Argument: {0} has a [ without a corresponding ]", query));
         }
         elementName = elementName.Substring(0, attributeQueryStartIndex);
         string attributeQuery = element.Substring(attributeQueryStartIndex + 1,
            (attributeQueryEndIndex - attributeQueryStartIndex) - 2);
         string[] keyValPair = attributeQuery.Split('=');
         if (keyValPair.Length != 2)
         {
          throw new ArgumentException(String.Format("Argument: {0} has an attribute query that either has too many or insufficient = marks. We currently only support one", query));
         }
         interestingElement = new ElementOfInterest(elementName);
         interestingElement.Add(new AttributeOfInterest(keyValPair[0].Trim().Replace("'", ""),
          keyValPair[1].Trim().Replace("'", "")));
        }
        else
        {
         interestingElement = new ElementOfInterest(elementName);
        }

        list.AddLast(interestingElement);
       }
      }

      class ElementOfInterest
      {
       private readonly string elementName;
       private readonly List<AttributeOfInterest> attributes = new List<AttributeOfInterest>();

       public ElementOfInterest(string elementName)
       {
        this.elementName = elementName;
       }

       public string ElementName
       {
        get { return elementName; }
       }

       public List<AttributeOfInterest> Attributes
       {
        get { return attributes; }
       }

       public void Add(AttributeOfInterest attribute)
       {
        Attributes.Add(attribute);
       }
      }

      class AttributeOfInterest
      {
       private readonly string name;
       private readonly string value;

       public AttributeOfInterest(string name, string value)
       {
        this.name = name;
        this.value = value;
       }

       public string Value
       {
        get { return value; }
       }

       public string Name
       {
        get { return name; }
       }
      }
     }
    }
}

This is the test input I was using:

<?xml version="1.0" encoding="utf-8" ?>
<ConfigurationRelease>
  <Profiles>
    <Profile Name ="MyProfileName">
      <Screens>
        <Screen Id="MyScreenId">
          <Settings>
            <Setting Name="MySettingName">
              <Paydirt>Good stuff</Paydirt>
            </Setting>
          </Settings>
        </Screen>
      </Screens>
    </Profile>
    <Profile Name ="SomeProfile">
      <Screens>
        <Screen Id="MyScreenId">
          <Settings>
            <Setting Name="Boring">
              <Paydirt>NOES you should not find this!!!</Paydirt>
            </Setting>
          </Settings>
        </Screen>
      </Screens>
    </Profile>
    <Profile Name ="SomeProfile">
      <Screens>
        <Screen Id="Boring">
          <Settings>
            <Setting Name="MySettingName">
              <Paydirt>NOES you should not find this!!!</Paydirt>
            </Setting>
          </Settings>
        </Screen>
      </Screens>
    </Profile>
    <Profile Name ="Boring">
      <Screens>
        <Screen Id="MyScreenId">
          <Settings>
            <Setting Name="MySettingName">
              <Paydirt>NOES you should not find this!!!</Paydirt>
            </Setting>
          </Settings>
        </Screen>
      </Screens>
    </Profile>
  </Profiles>
</ConfigurationRelease>

And this is the output I got.

C:\Sandbox\XPathInCE\XPathInCE\bin\Debug>XPathInCE MyXmlFile.xml ConfigurationRe
lease/Profiles/Profile[Name='MyProfileName']/Screens/Screen[Id='MyScreenId']/Set
tings/Setting[Name='MySettingName']
Considering element: ConfigurationRelease
Considering element: Profiles
Considering element: Profile
Considering element: Screens
Considering element: Screen
Considering element: Settings
Considering element: Setting
XPath match found!
Element Paydirt
Attributes
Element value: Good stuff
Considering element: Profile
Skipping
Considering element: Profile
Skipping
Considering element: Profile
Skipping
Press ENTER to exit

I ran it on desktop but it was a CF 2.00 .exe I generated so it should work fine on CE. As you can see it skips when it doesn't match so it wont walk the whole file.

Feedback from anyone is appreciated, especially if people have pointers to make the code more concise.

Quibblesome
Thanks for this. I came up with similar but yours is better becuase mine doesn't inculde parsing any xpath query (and is limited to only the particular search I need now).One comment I have is that once I get to the good stuff I want to completely stop.
J M
It also became a mute point this morning when our tech architect decided xml was the wrong route in the first place and we should have implemented a custom file format. The whole issue is now on hold until we can do this. I am hence about to close this down.
J M
Well to be honest I agree, life is bad when your xml file is too big to load. However you might want to suggest looking at a database as well. SqlLite is pretty damn fast and has a free ADO.NET wrapper available. (btw you wanted "moot" not "mute")
Quibblesome
Lol, thanks for the correction. Embarrassing but now at least won't do that one again.
J M
+2  A: 

I'm adding this as the issue is now dead but the selected solution doesn't match anything listed so far.

Our technical architect took this issue over and decided that we should never have implemented Xml in the first place. This decision was partly due to this issue but also due to some complaints about the level of data transfer charges.

His verdict is that we should have implemented a custom file format (with indexing) optimised for size and speed of query.

So, the issue is on hold until that work is approved and properly specced.

The end for now.

J M
A: 

Loading it into a Dataset isn't gonna work - that will take up still more memory.

When faced with similar, I used an XmlReader and built an in-memory index at load time. I presented the index, and then when the user clicks on a link or activates a search, then i re-read the XML document, again with XmlReader, and load the appropriate subset.

This sounds laborious, and I guess it is, in some ways. It trades CPU cycles for memory. But it works, and the app is responsive enough. The data size is only 2mb, not that large. but I was getting OOM with a Dataset. Then I went to XmlSerializer and that worked for a while, but again I hit an OOM. So I finally stepped back to this custom index thing.

Cheeso
A: 

You could implement a sax based parser - so that you only take branches you are interested in when parsing the XML. This would be the best approach because it doesn't load the entire xml as a document.

Optimally you would design your custom parser around your needs on an as needed basis- and do all the parsing for everything on a single pass- for example, if you may be interested in specific nodes later save references to them so you can start there later rather than reparsing or traversing again.

The down side here is that its a bit of custom programming.

The upside is that you will only be reading what you are interested in and handling the xml document based on your requirements. You can also start processing results before finishing a pass through the document. This is great for kicking off worker threads based on the contents of the document. Example: you can take the entire contents of an element as the root of another XML document, and then load it as such separately (use xpath, etc). You can copy the contents into a buffer and then hand that off to a worker to process- etc.

I have used this a long time ago using libxml2 for C, but there are C# bindings as well (and many other languages).

Klathzazt