views:

4923

answers:

6

How do you screen scrape ajax pages?

+2  A: 

Depends on the ajax page. The first part of screen scraping is determining how the page works. Is there some sort of variable you can iterate through to request all the data from the page? Personally I've used Web Scraper Plus for a lot of screen scraping related tasks because it is cheap, not difficult to get started, non-programmers can get it working relatively quickly.

Side Note: Terms of Use is probably somewhere you might want to check before doing this. Depending on the site iterating through everything may raise some flags.

wonderchook
+4  A: 

If you can get at it, try examining the DOM tree. Selenium does this as a part of testing a page. It also has functions to click buttons and follow links, which may be useful.

sblundy
+14  A: 

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.


Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.


Advanced scraping with C++:

For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkey to execute the javascript on a page.

Advanced scraping with Java:

For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino

Advanced scraping with .NET:

For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.

Brian R. Bondy
A: 

If you haven't already, you should really check out Jaxer. The ability of constructing a DOM on the server is just amazingly powerful.

Rakesh Pai
A: 

I am currently trying to retrieve only visible text using BeautifulSoup... but I keep getting javascript and html cruft that I didn't really want...

As alternative, I am playing with the idea of spawning a browser programmatically, sending cntr-c key presses to the process, and retrieving the text from the clipboard.

I really like the idea working with rendered data as opposed to the source data. However, this might not help you if you need interact with the dom tree. Personally, I am not happy with the clipboard killing all of my concurrency.. My next step is looking at the Chrome source code...

A: 

I have to scrape data from AJAX often. Most of the time the dynamic elements are just making separate HTTP requests, and one must simply watch for each response. Most of the time the responses are plain HTML or XML, so little parsing is needed. I use screen-scraper for this.

Jason Bellows