views:

2635

answers:

8

I need to convert HTML documents into valid XML, preferably XHTML. What's the best way to do this? Does anybody know a toolkit/library/sample/...whatever that helps me to get that task done?

To be a bit more clear here, my application has to do the conversion automatically at runtime. I don't look for a tool that helps me to move some pages to XHTML manually.

+18  A: 

Convert from HTML to XML with HTML Tidy

Downloadable Binaries

JRoppert, For your need, i guess you might want to look at the Sources

c:\temp>tidy -help
tidy [option...] [file...] [option...] [file...]
Utility to clean up and pretty print HTML/XHTML/XML
see http://tidy.sourceforge.net/

Options for HTML Tidy for Windows released on 14 February 2006:

File manipulation
-----------------
 -output <file>, -o  write output to the specified <file>
 <file>
 -config <file>      set configuration options from the specified <file>
 -file <file>, -f    write errors to the specified <file>
 <file>
 -modify, -m         modify the original input files

Processing directives
---------------------
 -indent, -i         indent element content
 -wrap <column>, -w  wrap text at the specified <column>. 0 is assumed if
 <column>            <column> is missing. When this option is omitted, the
                     default of the configuration option "wrap" applies.
 -upper, -u          force tags to upper case
 -clean, -c          replace FONT, NOBR and CENTER tags by CSS
 -bare, -b           strip out smart quotes and em dashes, etc.
 -numeric, -n        output numeric rather than named entities
 -errors, -e         only show errors
 -quiet, -q          suppress nonessential output
 -omit               omit optional end tags
 -xml                specify the input is well formed XML
 -asxml, -asxhtml    convert HTML to well formed XHTML
 -ashtml             force XHTML to well formed HTML
 -access <level>     do additional accessibility checks (<level> = 0, 1, 2, 3).
                     0 is assumed if <level> is missing.

Character encodings
-------------------
 -raw                output values above 127 without conversion to entities
 -ascii              use ISO-8859-1 for input, US-ASCII for output
 -latin0             use ISO-8859-15 for input, US-ASCII for output
 -latin1             use ISO-8859-1 for both input and output
 -iso2022            use ISO-2022 for both input and output
 -utf8               use UTF-8 for both input and output
 -mac                use MacRoman for input, US-ASCII for output
 -win1252            use Windows-1252 for input, US-ASCII for output
 -ibm858             use IBM-858 (CP850+Euro) for input, US-ASCII for output
 -utf16le            use UTF-16LE for both input and output
 -utf16be            use UTF-16BE for both input and output
 -utf16              use UTF-16 for both input and output
 -big5               use Big5 for both input and output
 -shiftjis           use Shift_JIS for both input and output
 -language <lang>    set the two-letter language code <lang> (for future use)

Miscellaneous
-------------
 -version, -v        show the version of Tidy
 -help, -h, -?       list the command line options
 -xml-help           list the command line options in XML format
 -help-config        list all configuration options
 -xml-config         list all configuration options in XML format
 -show-config        list the current configuration settings

Use --blah blarg for any configuration option "blah" with argument "blarg"

Input/Output default to stdin/stdout respectively
Single letter options apart from -f may be combined
as in:  tidy -f errs.txt -imu foo.html
For further info on HTML see http://www.w3.org/MarkUp
Prakash
Sounds good, i will check this out.
JRoppert
I have it running in my C# application meanwhile. Great stuff.
JRoppert
+1  A: 

Have you tried this one and does it suffice?

Ólafur Waage
A: 

The easiest way is to set your Visual Studio IDE to identify the changes you need to make. You can do this in Visual Studio 2008 by going to: Tools, Options, Text Editor, HTML, Validation and choosing the appropriate target. Possibly XHTML 1.1 or XHTML 1.0 Transitional.

For some information on the different types, read: http://msdn.microsoft.com/en-us/library/aa479043.aspx

Then you need to work through the points highlighted on your page.

Bravax
Sorry, i was not clear in my question. I need to do the conversion automatically at runtime.
JRoppert
+3  A: 

You can use a HTML Agility Pack. Its open-source project from CodePlex.

TcKs
Sounds good, i will check this out.
JRoppert
+3  A: 

The Validator.nu HTML Parser comes with an HTML2XML sample program that does the conversion using the HTML5 parsing algorithm and infoset coercion rules.

hsivonen
A: 

Use Html2Xhtml for .NET 4.0:

In-memory string-to-string conversion:

var xhtml = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToEnd();

In-memory string-to-XDocument conversion:

var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument();

See http://corsis.sourceforge.net/index.php/Html2Xhtml for more information.

Cetin Sert
I had the exact same question and used this answer, it works beautifully. Especially for the conversion to the XElement.
Beaker
A: 

I have written a tutorial at http://www.bejoy.in/Techzone/Convert-html-to-xhtml from my learning on converting HTML to XHTML.

~Bejoy

This doesn't address the question (which is doing it automatically, at run time), has some bad practice examples, confuses "general good practice" and "things that have always been mandatory" with "differences between HTML and XHTML", seems focused on converting some specific low quality documents to XHTML, gets comments and CDATA rules inside script elements hopelessly wrong, claims that W3Schools is a good resource (it isn't), and thinks Tidy is better than it is (it is useful, but not that useful).
David Dorward
A: 

Another option is xmllint.

reinierpost