ansaurus

Question

How do I fetch another websites info from a URL like Digg's submit button?

Answer 1

A:

You grab the source of the page in question (cURL library or file_get_contents() if fopen() URL wrappers are enabled) and parse it for those details.

Title can be the title element.

Description can be the meta description.

Image can be the largest image (a lot of different ways to look for it).

You can also look for The Open Graph Protocol...

<meta name="og:site_name" content="Stack Overflow" />
<meta name="og:url" content="http://www.stackoverflow.com/" />
<meta name="og:title" content="Hello" />
<meta name="og:image" content="http://www.gravatar.com/avatar/5a9f58455ea36c880bc46820255fb084?s=32&amp;d=identicon&amp;r=PG" />

alex 2010-10-15 04:21:39

Answer 2

A:

I'm not too familiar with cake PHP, but I can give you a general idea of what you'll need to do.

First step would be to use AJAX to submit the URL to your server.

Then, the server will need to grab the html source. In php you can do:

$source = file_get_contents('http://www.example.com/')

There are probably other functions, but that one should work.

Once you have the source, you'll have to parse out the data you want. You can use regex or something else to do this part.

Then, you'll probably want to set the data you need to a php array, use

json_encode($my_array)

and return json. Then, do what you wish with it.

Hope this helps

Andy Groff 2010-10-15 04:29:46

`$my array` isn't a valid PHP variable.

alex 2010-10-15 04:32:36

haha, of course. Thanks.

Andy Groff 2010-10-15 04:34:56

Answer 3

A:

You'll need to do a few simple things:

You'll need to use the curl functions of PHP to get the source for the webpage. The php.net site provides a great example of this.
From that source, you'll need to find the title of the page, and any images. The easiest way would probably be through a simple regular expression.

Here's a simple script example which does both:

<?php 
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, "stackoverflow.com"); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);

$titles = array();
preg_match_all("/<title>(.*)<\/title>/im", $output, &$titles, PREG_PATTERN_ORDER);

$images = array();
preg_match_all("/<img *src= *['\"](.*)['\"](.*)\/*>/iU", $output, &$images, PREG_PATTERN_ORDER);

$page_title = $titles[1][0];
$images_found = $images[1];

echo "Page title was: {$page_title}\n";
foreach($images_found as $image_src) echo "Image: {$image_src}\n";
?>

The regular expressions I included are imperfect, and won't catch all titles or all images in every case, but they're both good starts.

You'll also need to pick which image you want to use from the array $images. You can do this randomly, or based on the largest image on the page, or the first one you find, etc.

Jack Shedd 2010-10-15 05:02:04

ansaurus

tags:

views:

answers:

How do I fetch another websites info from a URL like Digg's submit button?

related questions