tags:

views:

172

answers:

8

My plan is to read in an XML document using my C# program, search for particular entries which I'd like to change, and then write out the modified document. However, I've become unstuck because it's hard to differentiate between elements, whether they start or end using XmlTextReader which I'm using to read in the file. I could do with a bit of advice to put me on the right track.

The document is a HTML document, so as you can imagine, it's quite complicated.

I'd like to search for an element id within the HTML document, so for example look for this and change the src;

<img border="0" src="bigpicture.png" width="248" height="36" alt="" id="lookforthis" />
A: 

this is a simple article on how to read and write xml files

http://www.c-sharpcorner.com/uploadfile/mahesh/readwritexmltutmellli2111282005041517am/readwritexmltutmellli21.aspx

this is a very simple introduction only to get you started on the concepts and namespaces

PaulStack
Thanks, but I've already read that, and based some of my code from it.
wonea
A: 

Just start by reading the documentation of the Xml namespace on the MSDN. Then if you have more specific questions, post them here...

md5sum
+1  A: 

If you have smaller documents which fit in computers memory you can use XmlDocument. Otherwise you can use XmlReader to iterate through the document.

Using XmlReader you can find out the elements type using:

while (xml.Read()) {
   switch xml.NodeType {
     case XmlNodeType.Element:
      //Do something
     case XmlNodeType.Text:
      //Do something
     case XmlNodeType.EndElement:  
      //Do something
   }
}
codymanix
A: 

Are the documents you are processing relatively small? If so, you could load them into memory using an XmlDocument object, modify it, and write the changes back out.

XmlDocument doc = new XmlDocument();
doc.Load("path_to_input_file");
// Make changes to the document.
XmlTextWriter xtw = new XmlTextWriter("path_to_output_file", Encoding.UTF8);
doc.WriteContentTo(xtw);

Depending on the structure of the input XML, this could make your parsing code a bit simpler.

Pat Daburu
A: 

One fairly easy approach would be to create a new XmlDocument, then use the Load() method to populate it. Once you've got the document, you can use CreateNavigator() to get an XPathNavigator object that you can use to find and alter elements in the document. Finally, you can use the Save() method on the XmlDocument to write the changed document back out.

ngroot
+4  A: 

If it's actually valid XML, and will easily fit in memory, I'd choose LINQ to XML (XDocument, XElement etc) every time. It's by far the nicest XML API I've used. It's easy to form queries, and easy to construct new elements too.

You can use XPath where that's appropriate, or the built-in axis methods (Elements(), Descendants(), Attributes() etc). If you could let us know what specific bits you're having a hard time with, I'd be happy to help work out how to express them in LINQ to XML.

If, on the other hand, this is HTML which isn't valid XML, you'll have a much harder time - because XML APIs generalyl expect to work with valid XML documents. You could use HTMLTidy first of course, but that may have undesirable effects.

For your specific example:

foreach (var img in doc.Descendants("img"))
{
    // src will be null if the attribute is missing
    string src = (string) img.Attribute("src");
    img.SetAttributeValue("src", src + "with-changes");
}
Jon Skeet
Bump XDocument for great justice.
annakata
I heartily agree! I had a couple of older apps I had to do the hard way with parsing and the like and L2X makes it so much easier and powerful.
Dillie-O
Jon, you may find HtmlAgilityPack very useful, instead of worrying about valid XML, you can use APIs similar to XDocument on dirty, real-world HTML.
Peter J
@Peter: Fortunately I've rarely had to work with dirty HTML - I've found myself using real XML more frequently. I'll bear it in mind though.
Jon Skeet
+1  A: 

For the task in hand - (read existing doc, write, and modify in a formalised way) I'd go with XPathDocument run through an XslCompiledTransform.

Where you can't formalise, don't have pre-existing docs or generally need more adaptive logic, I'd go with LINQ and XDocument like Skeet says.

Basically if the task is transformation then XSLT, if the task is manipulation then LINQ.

annakata
A: 

My favorite tool for this kind of thing is HtmlAgilityPack. I use it to parse complex HTML documents into LINQ-queryable collections. It is an extremely useful tool for querying and parsing HTML (which is often not valid XML).

For your problem, the code would look like:

var htmlDoc = HtmlAgilityPack.LoadDocument(stringOfHtml);
var images = htmlDoc.DocumentNode.SelectNodes("//img[id=lookforthis]");

if(images != null)
{
  foreach (HtmlNode node in images)  
  {  
      node.Attributes.Append("alt", "added an alt to lookforthis images.");  
  }  
}

htmlDoc.Save('output.html');
Peter J