views:

131

answers:

2

I need a scalable, automated, method of dumping the contents of "view page source", after manipulation, to a file. This non-interactive method would be (more or less) identical to an army of humans navigating my list of URLs and dumping "view page source" to a file. Programs such as wget or curl will non-interactively retrieve a set of URLs, but do not execute javascript or any of that 'fancy stuff'.

My ideal solution looks like any of the following (fantasy solutions):

cat urls.txt | google-chrome --quiet --no-gui \
--output-sources-directory=~/urls-source  
(fantasy command line, no idea if flags like these exist)

or

cat urls.txt | python -c "import some-library; \
... use some-library to process urls.txt ; output sources to ~/urls-source"    

As a secondary concern, I also need:

  • dump all included javascript source to file (a la firebug)
  • dump pdf/image of page to file (print to file)
+1  A: 

HTML Unit does execute javascript. Not sure if you can obtain the HTML code after DOM manipulation, but give it a try.

You could write a little Java program that fits your requirements, and execute it through command line like in your examples.

I haven't tried the below code, just had a look at the JavaDoc :

public static void main(String[] args) {

    String pageURL = args[1];

    WebClient webClient = new WebClient();
    HtmlPage page = webClient.getPage(pageURL);

    String pageContents = page.asText();

    // Save the resulting page to a file

}

EDIT :

Selenium (another web testing framework) can take page screenshots it seems.

Search for selenium.captureScreenshot.

mexique1
+1 for the selenium
Konerak
A: 

You can use IRobotSoft web scraper to automate this. The source code is in UpdatedPage variable. You only need to save the variable to a file.

It has a function CapturePage() to capture the web page to an image file too.

seagulf