views:

343

answers:

3

I'd like to retrieve a page's content and reformat it to my liking...

For example:

  • Go to example.com
  • Get content within tags with class "x"
  • Pass content to specific variables
  • Spit out the content in some pretty form..array, csv, xml...

Not too hard, right? I'm a PHP noob! :)

A: 

XSD might do the trick for you. I'd also consider wget + CSS...

sangretu
+2  A: 

Try using PHP Simple HTML DOM Parser.

You can do nice stuff like this:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links with class=x
foreach($html->find('a[class=x]') as $element)
       echo $element->href . '<br>';
ryeguy
+1  A: 

For getting the data, there are three levels of difficulty:

file_get_contents($url); //easy

Unfortunately a lot of sites aren't very responsive to the proper user agent. You've got two options, here. One's a little harder than the other. Intermediate is Zend HTTP Client

$client = Zend_Http_Client(); //make sure to include Zend_Http, etc.
$client->setConfig($params); // params will include proper user agent
$client->setUri($aUrl);
$html = $client->request()->getBody();

Option three, which you might not even want to consider unless you really want to keep it more scripting than object-oriented, is to explore PHP's cURL functionality

There are a few PHP-native ways to access HTML data via a DOM object, but my favorite is the Simple HTML DOM Parser. It's very similar to jQuery/CSS style DOM navigation.

$domObject = new Simple_HTML_Dom($html);
foreach ($domobject->find('div#theDataYouWant p') as $sentence)
{
    echo "<h3>{$sentence}</h3>";
}
Robert Elwell