Hello Experts,
I know this kind of question must have been asked here before but by searching I didnt find a solution:
My question is: What are the best Java libraries to "fully download any wepage and render the built in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programatically and get the DOM Tree as an "HTML-Source".
(Something similarly what firebug does in the end, it renders the page and I get access to the fully rendered DOM Tree, as the page looks like in the browser! In contrast if I click "show source" I only get the JavaScript source code. This it not what I want. I need to have access to the rendered page...)
(With rendering I mean only rendering the DOM Tree not a visual rendering...)
This does not have to be one single library, it's ok to have several libraries that can accomplish this together (one will download, one render...)but due to the dynamic nature of JavaScript most likely the JavaScript library will also have to have some kind of downloader to fully render any asynchronous JS...
Background: In the "good old days" HttpClient (Apache Library) was everything required to build your own very simple crawler. (A lot of cralwers like Nutch or Heretrix are still build around this core princible, mainly focussing on Standard HTML parsing, so I can't learn from them) My problem is that I need to crawl some websites that rely heavily on JavaScript and that I can't parse with HttpClient as I defenitely need to execute the JavaScripts before...
Thank you very much!! Tim