ansaurus

Question

Answer 1

A:

If the documents are large, then a stream-based parser (which is fine for your needs) will be faster than using XmlDocument, mostly because of the lower overhead. Check out the documentation for XmlReader.

Adrian 2010-06-15 15:25:30

Could you please give just a small piece of code for my document?

LifeH2O 2010-06-15 15:27:51

Answer 2

+7 A:

You can use an XmlReader for forward only, fast reading.

Carra 2010-06-15 15:25:57

Ok! i am going to try that! How to handle the exceptions when somenodes are missing in a document, i added a lot of try and catch statements to avoid nullrefrence exception on code like `stats["average"].InnerText`, wherever "average" is node name

LifeH2O 2010-06-15 15:30:01

Hmm ; if there are a lot of missing elements part of your performance problems may relate to the number of exceptions being thrown. Exceptions are expensive. Checking for the presence of a node before you reference it is much cheaper.

Adrian 2010-06-15 15:40:43

have a function like getAttribute(string statname) which uses stats[statname] inside a try/catch block and returns string.Empty whenever an exception is caught.

apoorv020 2010-06-15 15:41:53

If you get a nullreference here it's because stats["average"] is null. Just add a (if stats["average"] != null)) check.

Carra 2010-06-15 15:45:08

@apoorv020 i already did that, GetInnerString() does that task.

LifeH2O 2010-06-15 15:47:39

@Adrian Ok, i am going to make a function with checks and will try to avoid Exceptions and try catch statements, thanks

LifeH2O 2010-06-15 15:48:31

@Carra `stats["average"].InnerText` give null exception when the node "average" is not availible.

LifeH2O 2010-06-15 15:50:10

Indeed, just do if(stats["average"] != null) stats["average"].InnerText;

Carra 2010-06-16 07:38:18

@Carra thanks i did that

LifeH2O 2010-06-16 22:41:38

Answer 3

A:

I wouldn't say LINQ is the best approach. I searched Google and I saw some references to HTML Agility Pack .

I think that if your going to have a speed bottleneck, it will be with your download process. In other words, it appears that your performance problems are not with your XML code. I think there are ways to improve your download speeds maybe or your file i/o but I don't know what they would be.

djangofan 2010-06-15 15:32:28

No, i already said that for speed i already downloaded all files to PC, and now i am not getting them from internet.

LifeH2O 2010-06-15 15:36:04

HTML Agility pack is used to parse *html*. It's more forgiving than parsing xml. Still, checking if the bottleneck is in downloading the files is a good idea.

Carra 2010-06-15 15:36:59

Answer 4

+2 A:

You could try LINQ to XML. Or you can use this to figure out what to use.

Chandam 2010-06-15 16:05:45

Thanks, it means i must use XmlReader, still finding a good tutorial for a GUI app

LifeH2O 2010-06-15 16:32:24

Answer 5

A:

If you know that the XML is consistent and well formed, you can simply avoid doing real XML parsing and just process them as flat text files. This is risky, non-portable, and brittle.

But it'll be the fastest (to run, not to code) solution.

Joshua Muskovitz 2010-06-15 16:59:07

-1 for giving advice on how to create risky, non-portable and brittle solutions.

John Saunders 2010-06-15 17:42:44

+1..for being honest and giving the fastest solution

Luke101 2010-08-12 23:58:36

Answer 6

A:

An XmlReader is the solution for your problem. An XmlDocument stores lots of meta-information making the Xml easy to access, but it becomes too heavy on memory. I have seen some Xmls of size less than 50 KB being converted to few MBs (10 or something) of XmlDocument.

Sudesh Sawant 2010-06-15 17:00:38

Can you please give some code of XmlReader for my document? till now, i have done this `void Load(string url) { _reader = XmlReader.Create(url); while (_reader.Read()) { } }` Availible Methods for XmlReader are confusing. What i need is to get batting and bowling stats completly, batting and bowling stats are different, while odi,t2o,ipl etc are same inside bowling and batting.

LifeH2O 2010-06-15 17:16:12

void Load(string url) { _reader = XmlReader.Create(url); while (_reader.Read()) { _reader.Name; // Gives name _reader.Value; // Gives Value as string } } Please check MSDN for more details. You will have to check HasValues, HasAttributes, etc.

Sudesh Sawant 2010-06-16 06:57:16

XmlReader is difficult to implement, XmlDocument is easy, problem was not with XmlDocument, it was slow due to try catch statements. Thanks for help.

LifeH2O 2010-06-16 22:44:08

Answer 7

A:

If you are already converting that information into a DataSet to insert it into tables, just use DataSet.ReadXML() - and work with the default tables it creates from the data.

This toy app does that, and it works with the format you defined above.

Project file: http://www.dot-dash-dot.com/files/wtfxml.zip Installer: http://www.dot-dash-dot.com/files/WTFXMLSetup_1_8_0.msi

It lets you browse edit your XML file using a tree and grid format - the tables listed in the grid are the ones automatically created by the DataSet after ReadXML().

Ron Savage 2010-06-15 17:30:18

Thank you!. I have made DataSet for database, not for XML. I am parsing Xml files, extracting data, passing this data to TableAdapter.Insert hence saving it to databse, and then displaying by binding Gui Components with databse.

LifeH2O 2010-06-15 18:00:51

Answer 8

+2 A:

The overhead of throwing exceptions probably dwarfs the overhead of XML parsing. You need to rewrite your code so that it doesn't throw exceptions.

One way is to check for the existence of an element before you ask for its value. That will work, but it's a lot of code. Another way to do it would be to use a map:

Dictionary<string, string> map = new Dictionary<string, string>
{
  { "matchtype", null },
  { "matches", null },
  { "ballsbowled", null }
};

foreach (XmlElement elm in stats.SelectNodes("*"))
{
   if (map.ContainsKey(elm.Name))
   {
      map[elm.Name] = elm.InnerText;
   }
}

This code will handle all the elements whose names you care about and ignore the ones you don't. If the value in the map is null, it means that an element with that name didn't exist (or had no text).

In fact, if you're putting the data into a DataTable, and the column names in the DataTable are the same as the element names in the XML, you don't even need to build a map, since the DataTable.Columns property is all the map you need. Also, since the DataColumn knows what data type it contains, you don't have to duplicate that knowledge in your code:

foreach (XmlElement elm in stats.SelectNodes("*"))
{
   if (myTable.Columns.Contains(elm.Name))
   {
      DataColumn c = myTable.Columns[elm.Name];
      if (c.DataType == typeof(string))
      {          
         myRow[elm.Name] = elm.InnerText;
         continue;
      }
      if (c.DataType == typeof(double))
      {
         myRow[elm.Name] = Convert.ToDouble(elm.InnerText);
         continue;
      }
      throw new InvalidOperationException("I didn't implement conversion logic for " + c.DataType.ToString() + ".");
   }
}

Note how I'm not declaring any variables to store this information in, so there's no chance of me screwing up and declaring a variable of a data type different from the column it's stored in, or creating a column in my table and forgetting to implement the logic that populates it.

Edit

Okay, here's something that's a bit tricksy. This is a pretty common technique in Python; in C# I think most people still think there something weird about it.

If you look at the second example I gave, you can see that it's using the metainformation in the DataColumn to figure out what logic to use for converting an element's value from text to its base type. You can accomplish the same thing by building your own map, e.g.:

Dictionary<string, Type> typeMap = new Dictionary<string, Type>
{
   { "matchtype", typeof(string) },
   { "matches", typeof(int) },
   { "ballsbowled", typeof(int) }
}

and then do pretty much the same thing I showed in the second example:

if (typeMap[elm.Name] == typeof(int))
{
   result[elm.Name] = Convert.ToInt32(elm.Text);
   continue;
}

Your results can no longer be a Dictionary<string, string>, since now they can contain things that aren't strings; they have to be a Dictionary<string, object>.

But that logic seems a little ungainly; you're testing each item several times, there are continue statements to break out of it - it's not terrible, but it could be more concise. How? By using another map, one that maps types to conversion functions:

Dictionary<Type, Func<string, object>> conversionMap = 
   new Dictionary<Type, Func<string, object>>
{
   { typeof(string), (x => x) },
   { typeof(int), (x => Convert.ToInt32(x)) },
   { typeof(double), (x => Convert.ToDouble(x)) },
   { typeof(DateTime), (x => Convert.ToDateTime(x) }
};

That's a little hard to read, if you're not used to lambda expressions. The type Func<string, object> specifies a function that takes a string as its argument and returns an object. And that's what the values in that map are: they're lambda expressions, which is to say functions. They take a string argument (x), and they return an object. (How do we know that x is a string? The Func<string, object> tells us.)

This means that converting an element can take one line of code:

result[elm.Name] = conversionMap[typeMap[elm.Name]](elm.Text);

Go from the inner to the outer expression: this looks up the element's type in typeMap, and then looks up the conversion function in conversionMap, and calls that function, passing it elm.Text as an argument.

This may not be the ideal approach in your case. I really don't know. I show it here because there's a bigger issue at play. As Steve McConnell points out in Code Complete, it's easier to debug data than it is to debug code. This technique lets you turn program logic into data. There are cases where using this technique vastly simplifies the structure of your program. It's worth understanding.

Robert Rossney 2010-06-16 07:59:29

Thank You! I removed all try and catch statements and replaced them with a function that returns null or zero if element is null or dont exist. Currently i am saving all data to predefined variables, and then using insert(var1,var2,var3....). The method you told looks more convinient, trying to learn and understand that how to implement it.

LifeH2O 2010-06-16 22:40:26

Wow thats great, now i know the first method. It is a lot simpler, but i can only use the first method as i am using tableAdapter to store data.

LifeH2O 2010-06-16 23:56:16

Since the whole point of a TableAdapter is to simplify adapting DataTables to SQL, this comment doesn't seem to make sense to me.

Robert Rossney 2010-06-17 17:37:36

I am currently using the `Dictionary<string, string>` but it can store only one type of data. While XML has different type of data like int, double, DataTime, TimeSpan etc.. How can i use dictionary for that?

LifeH2O 2010-06-21 21:59:07

The short answer is that you should use `Dictionary<string, object>`. For the long answer, see my edit.

Robert Rossney 2010-06-22 02:39:29

ansaurus

tags:

views:

answers:

Fastest way to parse XML files in C#?

related questions