views:

1268

answers:

5

What advantages are there for using either XSLT or Linq to XML for HTML parsing in C#? This is under the assumption that the html has been cleaned so it is valid xhtml. These values will eventually go into a c# object to be validated and processed.

Please let me know if these are valid and if there are other things to consider.

XSLT Advantages:

  • Easy to change quickly and deploy
  • Fairly well known

XSLT Disadvantages:

  • Not compiled, so is slower to process
  • String manipulation can be cumbersome
  • Will more challenging to get into the C# object at the end

Linq to XML Advantages:

  • Compiled, so it runs faster
  • Allows for better string manipulation

Linq to XML Disadvantages:

  • Must be compiled for update

Edit: I should clarify, I want these to run long term an the website may update their layout once a while. That was one of the bigger reason I thought I would use something that didn't require compiling.

+2  A: 

Since you're going to C#, at some point your data is going to go through Linq (or some other XML code for .NET) anyway, you may as well stick it all there.

Unless you have some compelling reason to go with XSLT, such as you already have a lot of experience or the deployment strongly favours rolling out the text files, keep it all in one place.

Adam Ruth
+9  A: 

Without further knowing your use case it is hard to give you general recommendations.

Anyhow, you are somewhat comparing apples and oranges. LINQ to XML (and LINQ in general) is a query language whereas XSLT is a programming language to transform XML tree structures. These are different concepts. You would use a query language whenever you want to extract a certain specific piece of information from a data source to do whatever you need to do with it (be it to set fields in a C# object). A transformation, in contrast, would be useful to convert one XML representation of your data into another XML representation.

So if your aim is to create C# objects from XML, you probably don't want to use XSLT but any of the other technologies offered by the .NET Framework to process XML data: the old XmlDocument, XmlReader, XPathDocument, XmlSerializer or XDocument. Each has it's special advantages and disadvantages, depending on input size, input complexity, desired output etc.

Since you are dealing with HTML only, you might also want to have a look at the HTML Agility Pack on CodePlex.

0xA3
Thanks, I have been using the agility pack. One of the examples uses XSLT, which led me to research it more.
BenMaddox
LINQ maybe a query language, but it's my understanding that microsoft is holding off on implementing XLST 2 support into .net because they want to "encourage" people to use linq instead
zeocrash
A: 

You shouldn't use either if you are just trying to parse HTML. HTML != XML and cannot be treated the same. For instance the escape sequence ' ' is perfectly valid in HTML but is not a valid entity within a valid XML document (without severe messing around with DTDs etc). This will bite you, believe me!

I would also recommend using the HTML Agility pack - brilliant library.

Dan Diplo
That was already recommended. I forgot to mention that I was using that pack already.
BenMaddox
A: 

HTML Agility pack ?

Let me try.

ariso
A: 

In my experience, XSLT is more concise and readable when you're primarily dealing with rearranging and selecting existing xml elements. XPath is short and easy to understand, and the xml syntax avoids littering your code with XElement and XAttribute statements. XSLT works fine as a xml-tree transform language.

However, it's string handling is poor, looping is unintuitive, and there's no meaningful concept of subroutines - you can't transform the output of another transform.

So, if you want to actually fiddle with element and attribute content, then it quickly falls short. There's no problem in using both, incidentally - XSLT to normalize the structure (say, to ensure that all table elements have tbody elements), and linq-to-xml to interpret it. The prioritized conditional matching possibilities mean XSLT is easier to use when dealing with many similiar but distinct matches. Xslt is good at document simplification, but it's just missing too many basic features to be sufficient on its own.

Having jumped whole-heartedly on the Linq-to-Xml bandwagon, I'd say that it has less overlap with XSLT that might seem at first glance. (And I'd positively love to see an XSLT 2.0/XQuery 1.0 implementation for .NET).

In terms of performance, both techs are speedy. In fact, since it's so hard to express slow operations, you're unlikely to accidentally trigger a slow case in XSLT (unless you start playing with recursion...). By contrast, LINQ to Xml power also can make it slow: just use any heavy-weight .NET object in some inner loop and you've got a budding performance problem.

Whatever you do, don't try to abuse XSLT by using it to perform anything but the simplest of logic: it's way more wordy and far less readable than the equivalent C#. If you need a bunch of logic (even simple things like date > DateTime.Now ? "will be" : "has" become huge bloated hacks in XSLT) and you don't want to use both XSLT and Linq to Xml, use Linq.

Eamon Nerbonne