Hello,
I have been using the java.net crawler for a custom built crawler. The problem is with dynamically generated content, like comments on a blog for example. Consider the following page http://www.avc.com/a_vc/2010/09/contrarian-investing.html . If you crawl the page and get the source, you can't view the entire content of the page. The reason I need the content is because I'm performing some keyword density calculations. Hence, I need my app to be able to see exactly what the browser would see. Any suggestions?
I've looked at apache's httpclient, however, that's the same as the above crawler, just returns the source. I think that particular page has a javascript piece that returns the comments from another domain, so I suppose what I need is to parse the source after downloading it, then getting the text. Any help is appreciated.
thanks
Sam