views:

405

answers:

2

Hi,

I'm trying to parse a bunch of webpages one after the next with PHP, but I noticed that when I fopen the first page, the links to the following pages are hidden in javascript.

Is there anyway I can continue on to parse the next webpages? If the url had a variable like "page=2" encrypted into it I would go through them that way, but the urls are encrypted.

-LPG

A: 

The only way would be to write a regular expression which parses out the javascript links and follows them. This would probably only work if the url to the page was in the javascript code, e.g:

<a href="javascript:open('something/some_page.html');">Something</a>

instead of just

<a href="javascript:open(someField.value);">Something</a>

Because with the second example, you would actually have to process the javascript link using PHP, which can be very challenging.

Keep in mind also that you would have to create website-specific regular expressions because each site formats their URLs differently. So Cnn.com might format their urls differently than Reddit.com

Click Upvote
I'm unsure how I would do that? When I load the 1st page, the links to the other pages are:<a href="javascript:NextPage(1)"><a href="javascript:NextPage(2)">..and the JS for the NextPage function is:function NextPage(page){ document.PageForm.page.value = page; document.PageForm.submit();}
If on clicking next page you are redirected to a page, you can note the general url of the page (it should be something.php?page=$id). However if it loads the results from ajax there wouldn't be any way
Click Upvote
+1  A: 

Basically you've got two choices:

  1. emulate their logic
  2. emulate a valid client

If you want to go with #1 you'll have to read their Javascript code and figure out how it works. I can't really explain it any better than that since it depends so much on their code; you just have to know Javascript and "grok" their code. Then, make your code do the same logic to generate the "next page" URL.

If their system uses AJAX you can still emulate it (contrary to what click-upvote said). To do so you just use a tool like the Firebug Firefox extension, so that you can watch what your browser is sending to their server "behind the scenes". Then, make your code send a fake HTTP request that mimics their AJAX request. You could actually do this even without a tool like Firebug: just infer what your browser will send by looking at the Javascript code. However, if you use something like Firebug it will makes things a lot easier (instead of inferring, you can just see what is being sent).

If you want to go with #1 instead, you will need to use either an actual browser (and control it programatically using something like Selenium), or use something like Rhino to run the Javascript. Using an actual browser with a control system like Selenium is probably the easiest way to go; however, it will be slow, as it is limited by the time it takes your browser to render the pages and such. A solution using Rhino or something similar will be faster, but it will also involve a lot more work (you'll have to parse the HTML, include all the relevant JS files, etc.), so I'd recommend that only as a last resort.

machineghost