views:

1263

answers:

5

What's the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I'd like to be able to know where the failures occur, as in a web-based validator, except in a native Python app.

A: 

XHTML is easy, use lxml.

HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmls or OpenJade, and then parse their output.

John Millikin
+1  A: 

I think that HTML tidy will do what you want. There is a Python binding for it.

Neall
+1  A: 

Try tidylib. You can get some really basic bindings as part of the elementtidy module (builds elementtrees from HTML documents). http://effbot.org/downloads/#elementtidy

>>> import _elementtidy
>>> xhtml, log = _elementtidy.fixup("<html></html>")
>>> print log
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 7 - Warning: discarding unexpected </html>
line 1 column 14 - Warning: inserting missing 'title' element

Parsing the log should give you pretty much everything you need.

Aaron Maenpaa
+3  A: 

Starting with html5, you can try to use html5lib.

You can also decide to install the HTML validator locally and create a client to request the validation.

Here I had made a program to validate a list of urls in a txt file. I was just checking the HEAD to get the validation status, but if you do a GET you would get the full results. Look at the API of the validator, there are plenty of options for it.

import httplib2
import time

h = httplib2.Http(".cache")

f = open("urllistfile.txt", "r")
urllist = f.readlines()
f.close()

for url in urllist:
   # wait 10 seconds before the next request - be nice with the validator
   time.sleep(10)
   resp= {}
   url = url.strip()
   urlrequest = "http://qa-dev.w3.org/wmvs/HEAD/check?doctype=HTML5&amp;uri="+url
   try:
      resp, content = h.request(urlrequest, "HEAD")
      if resp['x-w3c-validator-status'] == "Abort":
         print url, "FAIL"
      else:
         print url, resp['x-w3c-validator-status'], resp['x-w3c-validator-errors'], resp['x-w3c-validator-warnings']
   except:
      pass
karlcow
+1  A: 

http://countergram.com/software/pytidylib is a nice python binding for HTML Tidy. Their example:

from tidylib import tidy_document
document, errors = tidy_document('''<p>f&otilde;o <img src="bar.jpg">''',
    options={'numeric-entities':1})
print document
print errors
Dave Brondsema