views:

668

answers:

2

I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.

Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.

However I cant find any python binding for the same. Can anyone suggest a way ?

I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.

A: 

Perhaps pywebkitgtk would do what you need.

vezult
no it won't. pywebkitgtk is "merely a page displayer". you want http://www.gnu.org/software/pythonwebkit which is a heavily-modified version that incorporates webkit (!) and allows access to the DOM. all 3,000 functions and all 20,000 properties.
A: 

see http://wiki.python.org/moin/WebBrowserProgramming

there are quite a lot of options - i'm maintaining the page above so that i don't keep repeating myself.

you should look at pyjamas-desktop: see the examples/uitest example because we use exactly this trick to get copies of the HTML page "out", so that the python-to-javascript compiler can be tested by comparing the page results after each unit test.

each of the runtimes supported and used by pyjamas-desktop is capable of allowing access to the "innerHTML" property of the document's body element (and a hell of a lot more).

bottom line: it is trivial to do what you want to do, but you have to know where to look to find out how to do it.

l.