views:

81

answers:

3

Hi,

I am scraping data from facebook page for the wall posts, here is the url:

http://www.facebook.com/GMHTheBook?v=wall&ref=ts#!/GMHTheBook?v=wall&ref=ts

I sucessfully scraped all the visible wall posts using CURL.

Problem:

At the end of visible wall posts, there is Older Posts link which shows more wall posts once you click on that link. Now how do I sort of manually click that link to show more wall posts and scrap those posts as well?

Any solution using any method for that? I am using CURL though but I hope there is just about any solution to deal with such situation?

Update:

Now I am using this code to get all the data, find the next link and fetch the data for that url and so on, here is the code:

ini_set('display_errors', true);
error_reporting(E_ALL);

$data = json_decode(file_get_contents(($url)), true);

$names = array();
$stories = array();

foreach($data['data'] as $post)
{
    $names[] = $post['from']['name'];
    $stories[] = $post['message'];
}

$url = $data['paging']['next'];

// this is meant to scrap data recurssively from the next links
while($url !== '')
{
    $url = $data['paging']['next'];
    $data = json_decode(file_get_contents(($url)), true);

    foreach($data['data'] as $post)
    {
        $names[] = $post['from']['name'];
        $stories[] = $post['message'];
    }

    $url = urldecode($data['paging']['next']);
    echo $url . '<br />';
}


for($j = 0; $j < count($names); $j++)
{
  $data .= $names[$j] . '|' . $stories[$j] . "\n";
}

$h = fopen("data.txt", "a+");
fwrite($h, $data);
fclose($h);

But the problem is that script keeps on running with no output at all, also no file is created. I have set the script time settings to higher value too. allow_url_fopen is also set to on. Is there anything wrong in the script or probably I am not doing the recurssion in the right way? Any solution/alternative to this?

+2  A: 

The button/link probably starts a XMLHttpRequest, so look in your browser with firebug/developer console/whatever you use, to see what url it is requesting and with what HTTP headers etc. Then just do the same request with cURL and you've got it?

CharlesLeaf
+4  A: 

You should use the Graph API. The data you are scraping is available in JSON format at

and contains links for getting previous/next pages, e.g. paging.

Example:

$data = json_decode(file_get_contents(($url)));
foreach($data->data as $post) {
    echo $post->from->name, ': ',
         $post->message,
         PHP_EOL;
}

The above will output all the posts on the wall. For paging do

echo $data->paging->previous;
echo $data->paging->next;

This will output two URLs. All you have to do is load them again.

Gordon
@Gordon: Great, did not know about modifying url that way for the graph api. Thanks
Sarfraz
@Gordon: Please see my update :)
Sarfraz
@Sarfraz should probably be a followup question than an update
Gordon
A: 
http://www.facebook.com/ajax/stream/profile.php?__a=1&amp;profile_id=139878432710216&amp;viewer_id=(your facebook id)&filter=1&max_time=1283023194&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=1

It is loaded via ajax. You also need to figure out these variables. Max time is probably from what point on to show posts.

Ok, upper link can be shorter (same output)...

http://www.facebook.com/ajax/stream/profile.php?__a=1&amp;profile_id=139878432710216&amp;max_time=1283023194
Webarto