I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.
Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.
However I cant find any python binding for the same. Can anyone suggest a way ?
I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.