tags:

views:

236

answers:

8

I have to load many XML files from internet. But for testing with better speed i downloaded all of them (more than 500 files) of the following format.

<player-profile>
  <personal-information>
    <id>36</id>
    <fullname>Adam Gilchrist</fullname>
    <majorteam>Australia</majorteam>
    <nickname>Gilchrist</nickname>
    <shortName>A Gilchrist</shortName>
    <dateofbirth>Nov 14, 1971</dateofbirth>
    <battingstyle>Left-hand bat</battingstyle>
    <bowlingstyle>Right-arm offbreak</bowlingstyle>
    <role>Wicket-Keeper</role>
    <teams-played-for>Western Australia, New South Wales, ICC World XI, Deccan Chargers, Australia</teams-played-for>
    <iplteam>Deccan Chargers</iplteam>
  </personal-information>
  <batting-statistics>
    <odi-stats>
      <matchtype>ODI</matchtype>
      <matches>287</matches>
      <innings>279</innings>
      <notouts>11</notouts>
      <runsscored>9619</runsscored>
      <highestscore>172</highestscore>
      <ballstaken>9922</ballstaken>
      <sixes>149</sixes>
      <fours>1000+</fours>
      <ducks>0</ducks>
      <fifties>55</fifties>
      <catches>417</catches>
      <stumpings>55</stumpings>
      <hundreds>16</hundreds>
      <strikerate>96.95</strikerate>
      <average>35.89</average>
    </odi-stats>
    <test-stats>
      .
      .
      .
    </test-stats>
    <t20-stats>
      .
      .
      .    
    </t20-stats>
    <ipl-stats>
      .
      .
      . 
    </ipl-stats>
  </batting-statistics>
  <bowling-statistics>
    <odi-stats>
      <matchtype>ODI</matchtype>
      <matches>378</matches>
      <ballsbowled>58</ballsbowled>
      <runsgiven>64</runsgiven>
      <wickets>3</wickets>
      <fourwicket>0</fourwicket>
      <fivewicket>0</fivewicket>
      <strikerate>19.33</strikerate>
      <economyrate>6.62</economyrate>
      <average>21.33</average>
    </odi-stats>
    <test-stats>
      .
      .
      . 
    </test-stats>
    <t20-stats>
      .
      .
      . 
    </t20-stats>
    <ipl-stats>
      .
      .
      . 
    </ipl-stats>
  </bowling-statistics>
</player-profile>

I am using

XmlNodeList list = _document.SelectNodes("/player-profile/batting-statistics/odi-stats");

And then loop this list with foreach as

foreach (XmlNode stats in list)
  {
     _btMatchType = GetInnerString(stats, "matchtype"); //it returns null string if node not availible
     .
     .
     .
     .
     _btAvg = Convert.ToDouble(stats["average"].InnerText);
  }

Even i am loading all files offline, parsing is very slow Is there any good faster way to parse them? Or is it problem with SQL? I am saving all extracted data from XML to database using DataSets, TableAdapters with insert command.

EDIT: Now for using XmlReader please give some code of XmlReader for above document. for now, i have done this

void Load(string url) 
{
    _reader = XmlReader.Create(url); 
    while (_reader.Read()) 
    { 
    } 
} 

Availible Methods for XmlReader are confusing. What i need is to get batting and bowling stats completly, batting and bowling stats are different, while odi,t2o,ipl etc are same inside bowling and batting.

A: 

If the documents are large, then a stream-based parser (which is fine for your needs) will be faster than using XmlDocument, mostly because of the lower overhead. Check out the documentation for XmlReader.

Adrian
Could you please give just a small piece of code for my document?
LifeH2O
+7  A: 

You can use an XmlReader for forward only, fast reading.

Carra
Ok! i am going to try that! How to handle the exceptions when somenodes are missing in a document, i added a lot of try and catch statements to avoid nullrefrence exception on code like `stats["average"].InnerText`, wherever "average" is node name
LifeH2O
Hmm ; if there are a lot of missing elements part of your performance problems may relate to the number of exceptions being thrown. Exceptions are expensive. Checking for the presence of a node before you reference it is much cheaper.
Adrian
have a function like getAttribute(string statname) which uses stats[statname] inside a try/catch block and returns string.Empty whenever an exception is caught.
apoorv020
If you get a nullreference here it's because stats["average"] is null. Just add a (if stats["average"] != null)) check.
Carra
@apoorv020 i already did that, GetInnerString() does that task.
LifeH2O
@Adrian Ok, i am going to make a function with checks and will try to avoid Exceptions and try catch statements, thanks
LifeH2O
@Carra `stats["average"].InnerText` give null exception when the node "average" is not availible.
LifeH2O
Indeed, just do if(stats["average"] != null) stats["average"].InnerText;
Carra
@Carra thanks i did that
LifeH2O
A: 

I wouldn't say LINQ is the best approach. I searched Google and I saw some references to HTML Agility Pack .

I think that if your going to have a speed bottleneck, it will be with your download process. In other words, it appears that your performance problems are not with your XML code. I think there are ways to improve your download speeds maybe or your file i/o but I don't know what they would be.

djangofan
No, i already said that for speed i already downloaded all files to PC, and now i am not getting them from internet.
LifeH2O
HTML Agility pack is used to parse *html*. It's more forgiving than parsing xml. Still, checking if the bottleneck is in downloading the files is a good idea.
Carra
+2  A: 

You could try LINQ to XML. Or you can use this to figure out what to use.

Chandam
Thanks, it means i must use XmlReader, still finding a good tutorial for a GUI app
LifeH2O
A: 

If you know that the XML is consistent and well formed, you can simply avoid doing real XML parsing and just process them as flat text files. This is risky, non-portable, and brittle.

But it'll be the fastest (to run, not to code) solution.

Joshua Muskovitz
-1 for giving advice on how to create risky, non-portable and brittle solutions.
John Saunders
+1..for being honest and giving the fastest solution
Luke101
A: 

An XmlReader is the solution for your problem. An XmlDocument stores lots of meta-information making the Xml easy to access, but it becomes too heavy on memory. I have seen some Xmls of size less than 50 KB being converted to few MBs (10 or something) of XmlDocument.

Sudesh Sawant
Can you please give some code of XmlReader for my document? till now, i have done this `void Load(string url) { _reader = XmlReader.Create(url); while (_reader.Read()) { } }` Availible Methods for XmlReader are confusing. What i need is to get batting and bowling stats completly, batting and bowling stats are different, while odi,t2o,ipl etc are same inside bowling and batting.
LifeH2O
void Load(string url) { _reader = XmlReader.Create(url); while (_reader.Read()) { _reader.Name; // Gives name _reader.Value; // Gives Value as string } } Please check MSDN for more details. You will have to check HasValues, HasAttributes, etc.
Sudesh Sawant
XmlReader is difficult to implement, XmlDocument is easy, problem was not with XmlDocument, it was slow due to try catch statements. Thanks for help.
LifeH2O
A: 

If you are already converting that information into a DataSet to insert it into tables, just use DataSet.ReadXML() - and work with the default tables it creates from the data.

This toy app does that, and it works with the format you defined above.

Project file: http://www.dot-dash-dot.com/files/wtfxml.zip Installer: http://www.dot-dash-dot.com/files/WTFXMLSetup_1_8_0.msi

It lets you browse edit your XML file using a tree and grid format - the tables listed in the grid are the ones automatically created by the DataSet after ReadXML().

Ron Savage
Thank you!. I have made DataSet for database, not for XML. I am parsing Xml files, extracting data, passing this data to TableAdapter.Insert hence saving it to databse, and then displaying by binding Gui Components with databse.
LifeH2O
+2  A: 

The overhead of throwing exceptions probably dwarfs the overhead of XML parsing. You need to rewrite your code so that it doesn't throw exceptions.

One way is to check for the existence of an element before you ask for its value. That will work, but it's a lot of code. Another way to do it would be to use a map:

Dictionary<string, string> map = new Dictionary<string, string>
{
  { "matchtype", null },
  { "matches", null },
  { "ballsbowled", null }
};

foreach (XmlElement elm in stats.SelectNodes("*"))
{
   if (map.ContainsKey(elm.Name))
   {
      map[elm.Name] = elm.InnerText;
   }
}

This code will handle all the elements whose names you care about and ignore the ones you don't. If the value in the map is null, it means that an element with that name didn't exist (or had no text).

In fact, if you're putting the data into a DataTable, and the column names in the DataTable are the same as the element names in the XML, you don't even need to build a map, since the DataTable.Columns property is all the map you need. Also, since the DataColumn knows what data type it contains, you don't have to duplicate that knowledge in your code:

foreach (XmlElement elm in stats.SelectNodes("*"))
{
   if (myTable.Columns.Contains(elm.Name))
   {
      DataColumn c = myTable.Columns[elm.Name];
      if (c.DataType == typeof(string))
      {          
         myRow[elm.Name] = elm.InnerText;
         continue;
      }
      if (c.DataType == typeof(double))
      {
         myRow[elm.Name] = Convert.ToDouble(elm.InnerText);
         continue;
      }
      throw new InvalidOperationException("I didn't implement conversion logic for " + c.DataType.ToString() + ".");
   }
}

Note how I'm not declaring any variables to store this information in, so there's no chance of me screwing up and declaring a variable of a data type different from the column it's stored in, or creating a column in my table and forgetting to implement the logic that populates it.

Edit

Okay, here's something that's a bit tricksy. This is a pretty common technique in Python; in C# I think most people still think there something weird about it.

If you look at the second example I gave, you can see that it's using the metainformation in the DataColumn to figure out what logic to use for converting an element's value from text to its base type. You can accomplish the same thing by building your own map, e.g.:

Dictionary<string, Type> typeMap = new Dictionary<string, Type>
{
   { "matchtype", typeof(string) },
   { "matches", typeof(int) },
   { "ballsbowled", typeof(int) }
}

and then do pretty much the same thing I showed in the second example:

if (typeMap[elm.Name] == typeof(int))
{
   result[elm.Name] = Convert.ToInt32(elm.Text);
   continue;
}

Your results can no longer be a Dictionary<string, string>, since now they can contain things that aren't strings; they have to be a Dictionary<string, object>.

But that logic seems a little ungainly; you're testing each item several times, there are continue statements to break out of it - it's not terrible, but it could be more concise. How? By using another map, one that maps types to conversion functions:

Dictionary<Type, Func<string, object>> conversionMap = 
   new Dictionary<Type, Func<string, object>>
{
   { typeof(string), (x => x) },
   { typeof(int), (x => Convert.ToInt32(x)) },
   { typeof(double), (x => Convert.ToDouble(x)) },
   { typeof(DateTime), (x => Convert.ToDateTime(x) }
};

That's a little hard to read, if you're not used to lambda expressions. The type Func<string, object> specifies a function that takes a string as its argument and returns an object. And that's what the values in that map are: they're lambda expressions, which is to say functions. They take a string argument (x), and they return an object. (How do we know that x is a string? The Func<string, object> tells us.)

This means that converting an element can take one line of code:

result[elm.Name] = conversionMap[typeMap[elm.Name]](elm.Text);

Go from the inner to the outer expression: this looks up the element's type in typeMap, and then looks up the conversion function in conversionMap, and calls that function, passing it elm.Text as an argument.

This may not be the ideal approach in your case. I really don't know. I show it here because there's a bigger issue at play. As Steve McConnell points out in Code Complete, it's easier to debug data than it is to debug code. This technique lets you turn program logic into data. There are cases where using this technique vastly simplifies the structure of your program. It's worth understanding.

Robert Rossney
Thank You! I removed all try and catch statements and replaced them with a function that returns null or zero if element is null or dont exist. Currently i am saving all data to predefined variables, and then using insert(var1,var2,var3....). The method you told looks more convinient, trying to learn and understand that how to implement it.
LifeH2O
Wow thats great, now i know the first method. It is a lot simpler, but i can only use the first method as i am using tableAdapter to store data.
LifeH2O
Since the whole point of a TableAdapter is to simplify adapting DataTables to SQL, this comment doesn't seem to make sense to me.
Robert Rossney
I am currently using the `Dictionary<string, string>` but it can store only one type of data. While XML has different type of data like int, double, DataTime, TimeSpan etc.. How can i use dictionary for that?
LifeH2O
The short answer is that you should use `Dictionary<string, object>`. For the long answer, see my edit.
Robert Rossney