tags:

views:

55

answers:

2

Hello,

I have been using the java.net crawler for a custom built crawler. The problem is with dynamically generated content, like comments on a blog for example. Consider the following page http://www.avc.com/a_vc/2010/09/contrarian-investing.html . If you crawl the page and get the source, you can't view the entire content of the page. The reason I need the content is because I'm performing some keyword density calculations. Hence, I need my app to be able to see exactly what the browser would see. Any suggestions?

I've looked at apache's httpclient, however, that's the same as the above crawler, just returns the source. I think that particular page has a javascript piece that returns the comments from another domain, so I suppose what I need is to parse the source after downloading it, then getting the text. Any help is appreciated.

thanks

Sam

+1  A: 

Web testing APIs have JS support in them. HTTPUnit has some capacity to execute Javascript with Rhino I think. It's been a while since I've used it though and I seem to remember it not working as well. Alternatively, you can try Selenium RC, which I think is pretty powerful for that sort of thing but again, not sure if it solves your problem specifically.

Selenium - http://seleniumhq.org/projects/remote-control/ HTTPUnit - http://httpunit.sourceforge.net/

Anthony Bishopric
A: 

Try to use existent javascript engine (V8 from google or Rhino from mozilla) with timeout on execution time. But it is may be very hard. May be easier try to detect request url in javascript text and request it with crawler

whalebot.helmsman
I can certainly do the latter for this page, but how can I make the process human-free for all other pages? Is it correct to assume that all other pages will only have javascript type page generation as well? Presumably, anything else will be serverside, correct?
Sam Mohamed