ansaurus

Question

Large XML file, XmlDocument not feasible but need to be able to search

Answer 1

+2 A:

Try loading the file into a dataset:

DataSet ds = new Dataset();
ds.ReadXml("C:\MyXmlFile.xml")

Then you can use linq to search it.

Joel Coehoorn 2009-02-17 16:09:12

If it's dying due to an OOM on the XML, a DataSet representation is likely to take up a large chunk of memory as well and will likely also die.

ctacke 2009-02-18 13:19:04

Answer 2

+8 A:

Have a look at XPathDocument.

XPathDocument is more light-weight than XmlDocument and is optimized for read-only XPath queries.

0xA3 2009-02-17 16:12:39

Aaaaah - there is an overload that takes a stream - this is looking exactly like what I need - thanks!!!

J M 2009-02-17 16:15:52

I wanted to vote this up but my rep isn't high enough yet. So please accept my thanks anyway - this was right to the point of my issue i.e. I dn't want a massive set of data in memory. Ta!

J M 2009-02-17 16:21:47

Wait, what? XPathDocument doesn't appear to be part of Compact Framework.

Jeffrey Hantin 2010-02-08 17:50:50

Answer 3

+3 A:

Okay I was entertained by this so I hacked some code together. It isn't pretty and only really supports this one use case but I think it does the job you're looking for and acts as a decent platform to start from. I haven't tested it that thoroughly either. Finally you'll need to modify the code to get it to return the contents (see the method called Output()).

Here is the code:

using System;

using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Xml;

namespace XPathInCE
{
    class Program
    {
     static void Main(string[] args)
     {
      try
      {
       if (args.Length != 2)
       {
        ShowUsage();
       }
       else
       {
        Extract(args[0], args[1]);
       }
      }
      catch (Exception ex)
      {
       Console.WriteLine("{0} was thrown", ex.GetType());
       Console.WriteLine(ex.Message);
       Console.WriteLine(ex.StackTrace);
      }

      Console.WriteLine("Press ENTER to exit");
      Console.ReadLine();
     }

     private static void Extract(string filePath, string queryString)
     {
      if (!File.Exists(filePath))
      {
       Console.WriteLine("File not found! Path: {0}", filePath);
       return;
      }

      XmlReaderSettings settings = new XmlReaderSettings { IgnoreComments = true, IgnoreWhitespace = true };
      using (XmlReader reader = XmlReader.Create(filePath, settings))
      {
       XPathQuery query = new XPathQuery(queryString);
       query.Find(reader);
      }
     }

     static void ShowUsage()
     {
      Console.WriteLine("No file specified or incorrect number of parameters");
      Console.WriteLine("Args must be: Filename XPath");
      Console.WriteLine();
      Console.WriteLine("Sample usage:");
      Console.WriteLine("XPathInCE someXmlFile.xml ConfigurationRelease/Profiles/Profile[Name='MyProfileName']/Screens/Screen[Id='MyScreenId']/Settings/Setting[Name='MySettingName']");
     }

     class XPathQuery
     {
      private readonly LinkedList<ElementOfInterest> list = new LinkedList<ElementOfInterest>();
      private LinkedListNode<ElementOfInterest> currentNode;

      internal XPathQuery(string query)
      {
       Parse(query);
       currentNode = list.First;
      }

      internal void Find(XmlReader reader)
      {
       bool skip = false;
       while (true)
       {
        if (skip)
        {
         reader.Skip();
         skip = false;
        }
        else
        {
         if (!reader.Read())
         {
          break;
         }
        }
        if (reader.NodeType == XmlNodeType.EndElement
          && String.Compare(reader.Name, currentNode.Previous.Value.ElementName, StringComparison.CurrentCultureIgnoreCase) == 0)
        {
         currentNode = currentNode.Previous ?? currentNode;
         continue;
        }
        if (reader.NodeType == XmlNodeType.Element)
        {
         string currentElementName = reader.Name;
         Console.WriteLine("Considering element: {0}", reader.Name);

         if (String.Compare(reader.Name, currentNode.Value.ElementName, StringComparison.CurrentCultureIgnoreCase) != 0)
         {
          // don't want
          Console.WriteLine("Skipping");
          skip = true;
          continue;
         }
         if (!FindAttributes(reader))
         {
          // don't want
          Console.WriteLine("Skipping");
          skip = true;
          continue;
         }

         // is there more?
         if (currentNode.Next != null)
         {
          currentNode = currentNode.Next;
          continue;
         }

         // we're at the end, this is a match! :D
         Console.WriteLine("XPath match found!");
         Output(reader, currentElementName);
        }
       }
      }

      private bool FindAttributes(XmlReader reader)
      {
       foreach (AttributeOfInterest attributeOfInterest in currentNode.Value.Attributes)
       {
        if (String.Compare(reader.GetAttribute(attributeOfInterest.Name), attributeOfInterest.Value,
               StringComparison.CurrentCultureIgnoreCase) != 0)
        {
         return false;
        }
       }
       return true;
      }

      private static void Output(XmlReader reader, string name)
      {
       while (reader.Read())
       {
        // break condition
        if (reader.NodeType == XmlNodeType.EndElement
         && String.Compare(reader.Name, name, StringComparison.CurrentCultureIgnoreCase) == 0)
        {
         return;
        }

        if (reader.NodeType == XmlNodeType.Element)
        {
         Console.WriteLine("Element {0}", reader.Name);
         Console.WriteLine("Attributes");
         for (int i = 0; i < reader.AttributeCount; i++)
         {
          reader.MoveToAttribute(i);
          Console.WriteLine("Attribute: {0} Value: {1}", reader.Name, reader.Value);
         }
        }

        if (reader.NodeType == XmlNodeType.Text)
        {
         Console.WriteLine("Element value: {0}", reader.Value);
        }
       }
      }

      private void Parse(string query)
      {
       IList<string> elements = query.Split('/');
       foreach (string element in elements)
       {
        ElementOfInterest interestingElement = null;
        string elementName = element;
        int attributeQueryStartIndex = element.IndexOf('[');
        if (attributeQueryStartIndex != -1)
        {
         int attributeQueryEndIndex = element.IndexOf(']');
         if (attributeQueryEndIndex == -1)
         {
          throw new ArgumentException(String.Format("Argument: {0} has a [ without a corresponding ]", query));
         }
         elementName = elementName.Substring(0, attributeQueryStartIndex);
         string attributeQuery = element.Substring(attributeQueryStartIndex + 1,
            (attributeQueryEndIndex - attributeQueryStartIndex) - 2);
         string[] keyValPair = attributeQuery.Split('=');
         if (keyValPair.Length != 2)
         {
          throw new ArgumentException(String.Format("Argument: {0} has an attribute query that either has too many or insufficient = marks. We currently only support one", query));
         }
         interestingElement = new ElementOfInterest(elementName);
         interestingElement.Add(new AttributeOfInterest(keyValPair[0].Trim().Replace("'", ""),
          keyValPair[1].Trim().Replace("'", "")));
        }
        else
        {
         interestingElement = new ElementOfInterest(elementName);
        }

        list.AddLast(interestingElement);
       }
      }

      class ElementOfInterest
      {
       private readonly string elementName;
       private readonly List<AttributeOfInterest> attributes = new List<AttributeOfInterest>();

       public ElementOfInterest(string elementName)
       {
        this.elementName = elementName;
       }

       public string ElementName
       {
        get { return elementName; }
       }

       public List<AttributeOfInterest> Attributes
       {
        get { return attributes; }
       }

       public void Add(AttributeOfInterest attribute)
       {
        Attributes.Add(attribute);
       }
      }

      class AttributeOfInterest
      {
       private readonly string name;
       private readonly string value;

       public AttributeOfInterest(string name, string value)
       {
        this.name = name;
        this.value = value;
       }

       public string Value
       {
        get { return value; }
       }

       public string Name
       {
        get { return name; }
       }
      }
     }
    }
}

This is the test input I was using:

<?xml version="1.0" encoding="utf-8" ?>
<ConfigurationRelease>
  <Profiles>
    <Profile Name ="MyProfileName">
      <Screens>
        <Screen Id="MyScreenId">
          <Settings>
            <Setting Name="MySettingName">
              <Paydirt>Good stuff</Paydirt>
            </Setting>
          </Settings>
        </Screen>
      </Screens>
    </Profile>
    <Profile Name ="SomeProfile">
      <Screens>
        <Screen Id="MyScreenId">
          <Settings>
            <Setting Name="Boring">
              <Paydirt>NOES you should not find this!!!</Paydirt>
            </Setting>
          </Settings>
        </Screen>
      </Screens>
    </Profile>
    <Profile Name ="SomeProfile">
      <Screens>
        <Screen Id="Boring">
          <Settings>
            <Setting Name="MySettingName">
              <Paydirt>NOES you should not find this!!!</Paydirt>
            </Setting>
          </Settings>
        </Screen>
      </Screens>
    </Profile>
    <Profile Name ="Boring">
      <Screens>
        <Screen Id="MyScreenId">
          <Settings>
            <Setting Name="MySettingName">
              <Paydirt>NOES you should not find this!!!</Paydirt>
            </Setting>
          </Settings>
        </Screen>
      </Screens>
    </Profile>
  </Profiles>
</ConfigurationRelease>

And this is the output I got.

C:\Sandbox\XPathInCE\XPathInCE\bin\Debug>XPathInCE MyXmlFile.xml ConfigurationRe
lease/Profiles/Profile[Name='MyProfileName']/Screens/Screen[Id='MyScreenId']/Set
tings/Setting[Name='MySettingName']
Considering element: ConfigurationRelease
Considering element: Profiles
Considering element: Profile
Considering element: Screens
Considering element: Screen
Considering element: Settings
Considering element: Setting
XPath match found!
Element Paydirt
Attributes
Element value: Good stuff
Considering element: Profile
Skipping
Considering element: Profile
Skipping
Considering element: Profile
Skipping
Press ENTER to exit

I ran it on desktop but it was a CF 2.00 .exe I generated so it should work fine on CE. As you can see it skips when it doesn't match so it wont walk the whole file.

Feedback from anyone is appreciated, especially if people have pointers to make the code more concise.

Quibblesome 2009-02-23 22:44:27

Thanks for this. I came up with similar but yours is better becuase mine doesn't inculde parsing any xpath query (and is limited to only the particular search I need now).One comment I have is that once I get to the good stuff I want to completely stop.

J M 2009-02-25 19:18:20

It also became a mute point this morning when our tech architect decided xml was the wrong route in the first place and we should have implemented a custom file format. The whole issue is now on hold until we can do this. I am hence about to close this down.

J M 2009-02-25 19:19:47

Well to be honest I agree, life is bad when your xml file is too big to load. However you might want to suggest looking at a database as well. SqlLite is pretty damn fast and has a free ADO.NET wrapper available. (btw you wanted "moot" not "mute")

Quibblesome 2009-02-25 23:32:16

Lol, thanks for the correction. Embarrassing but now at least won't do that one again.

J M 2009-02-27 10:40:07

Answer 4

+2 A:

I'm adding this as the issue is now dead but the selected solution doesn't match anything listed so far.

Our technical architect took this issue over and decided that we should never have implemented Xml in the first place. This decision was partly due to this issue but also due to some complaints about the level of data transfer charges.

His verdict is that we should have implemented a custom file format (with indexing) optimised for size and speed of query.

So, the issue is on hold until that work is approved and properly specced.

The end for now.

J M 2009-02-25 19:29:22

Answer 5

A:

Loading it into a Dataset isn't gonna work - that will take up still more memory.

When faced with similar, I used an XmlReader and built an in-memory index at load time. I presented the index, and then when the user clicks on a link or activates a search, then i re-read the XML document, again with XmlReader, and load the appropriate subset.

This sounds laborious, and I guess it is, in some ways. It trades CPU cycles for memory. But it works, and the app is responsive enough. The data size is only 2mb, not that large. but I was getting OOM with a Dataset. Then I went to XmlSerializer and that worked for a while, but again I hit an OOM. So I finally stepped back to this custom index thing.

Cheeso 2009-03-09 06:43:27

Answer 6

A:

You could implement a sax based parser - so that you only take branches you are interested in when parsing the XML. This would be the best approach because it doesn't load the entire xml as a document.

Optimally you would design your custom parser around your needs on an as needed basis- and do all the parsing for everything on a single pass- for example, if you may be interested in specific nodes later save references to them so you can start there later rather than reparsing or traversing again.

The down side here is that its a bit of custom programming.

The upside is that you will only be reading what you are interested in and handling the xml document based on your requirements. You can also start processing results before finishing a pass through the document. This is great for kicking off worker threads based on the contents of the document. Example: you can take the entire contents of an element as the root of another XML document, and then load it as such separately (use xpath, etc). You can copy the contents into a buffer and then hand that off to a worker to process- etc.

I have used this a long time ago using libxml2 for C, but there are C# bindings as well (and many other languages).

Klathzazt 2009-03-09 07:35:41

ansaurus

tags:

views:

answers:

Large XML file, XmlDocument not feasible but need to be able to search

related questions