views:

21

answers:

1

For example, I wish to mine http://stackoverflow.com/privileges/user/3 and get the data that is in the div <div class="summarycount al">6,525</div> so that I can add the reputation to a local db along with the usernumber. I think I can use file_get_contents

 $data = file_get_contents('http://stackoverflow.com/privileges/user/3');

How do I extract the required data i.e 6,525 in the above example?

+2  A: 
  1. You'll need to login (through PHP) to see relevant information. This isn't very straightforward and will require some work.

  2. You can use *shrugs* regex to parse data, or use an XML parser like PHP Simple HTML DOM Parser. With regex...:

    preg_match('!<div class="summarycount al">(.+?)</div>!', $contents, $matches);
    $rep = $matches[1];
    
  3. If you are scraping SO, you can use the SO API instead.

Code:

$url = 'http://api.stackoverflow.com/1.0/users/3';

$tuCurl = curl_init(); 
curl_setopt($tuCurl, CURLOPT_URL, $url); 
curl_setopt($tuCurl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($tuCurl, CURLOPT_ENCODING, 'gzip'); 

$data = curl_exec($tuCurl); 
$parse = json_decode($data, true);
$rep = $parse['users'][0]['reputation'];

echo $rep;
Rogue
thanks for the attempt. I am really bad at regex. I will go through it.The curent page does not need login so no worries. And this was a generic question with SO as an example. The code works! Thanks
abel
Time taken 2.11 seconds. Getting 10000 users will take 5.6 hrs. Can I complete the entire thing in one script without timeouts?
abel
@abel Yes, you can change the `max_execution_time` setting. I would strongly recommend using the SO API though, or downloading a [data-dump](http://blog.stackoverflow.com/2010/10/creative-commons-data-dump-oct-10/) and getting info from there.
Rogue
@Rogue This isn't about SO per se, I have played with the execution time setting, can I get Burstable output more here http://stackoverflow.com/questions/3884008/burstable-output-to-long-running-scripts
abel