views:

312

answers:

3

I'm looking for a way to process a web page and associated Javascript from the command-line, so that the resulting DOM model can be outputted.

The purpose for this is to identify forms within the page without doing any nasty HTML (and Javascript) parsing with regular expressions.

Are there any command-line tools that will do this? So hypothetically speaking, a command-line web browser that downloads the content and outputs the DOM as text rather than producing a pretty page.

+2  A: 

I don't know of any, but I wanted to highlight one difficulty with what you've suggested:

process a web page and associated Javascript

When would the output be? Many webpages have time-sensitive javascripts, or onclick/onhover scripts which would affect the DOM. Would you want these to be executed? All of them, or only some? It's not trivial to decide when the page is "finished" and ready for the DOM to be output after javascript manipulation. (Before javascript manipulation, it's an easier problem; just wait till the document.DOMReady event...)

Edit: I'm not saying that you don't need javascript execution at all: you might want to handle any document.write sections during loading, as they might write out a form... I'm saying it's hard to know when you've done "enough" javascript...

Stobor
Good point, I guess "close enough is good enough" in this case. I really just need something that will give me a best effort listing of form elements.
Steve M
+1  A: 

PyKHTML "handles JavaScript" and lets you traverse the DOM.

Mitja
+1  A: 

For java, I've had fairly good experiences with htmlunit.

I've also used the BeautifulSoup python library to parse forms and formdata. No need to specify regexps, as it'll let you traverse the DOM tree without much effort.

Steen