views:

433

answers:

5

I have the contents of a web page assigned to a variable $html

Here's an example of the contents of $html:

<div class="content">something here</div>
<span>something random thrown in <strong>here</strong></span>
<div class="content">more stuff</div>

How, using PHP can I create an array from that that finds the contents of <div class="content"></div> regions like this (for the example above) so:

echo $array[0] . "\n" . $array[1]; //etc

outputs

something here
more stuff
A: 

You probaly need to use preg_match_all()

$matches = array();
preg_match_all('`\<div(.*?)class\=\"content\"(.*?)\>(.*?)\<\/div\>`iUsm',$html,$matches,PREG_SET_ORDER);
foreach($matches as $m){
  // $m[3] represents the content in <div class="content">
}
thephpdeveloper
-1 Regexes to process serverside HTML is an awful suggestion.
cletus
What happens if the xml contains two spaces between `div` and `class`, or an extra `id` field? If find this solution rather brittle.
xtofl
It's good enough solution depending on the task. Converting HTML to XML also has its pitfalls.
serg
Who said anything about converting HTML to XML? Dealing with an HTML DOM has **way less** "pitfalls than regexes, which are for this task nothing more than a dirty hack.
cletus
A: 

There not much you can do short of using string manipulations function or regular expressions. you can load your HTML as XML using the DOM library and use that to traverse to your div, but that can become cumbersome if your not careful or if the structure is complex.

http://ca3.php.net/manual/en/book.dom.php

Laurent Bourgault-Roy
'could', 'cumbersome', ... think positive, man! There's a solution to every problem!
xtofl
A: 

It looks like Kalem13 beat me to it, but I agree. You could use the DOMDocument class. I haven't used it personally, but I think it would work for you. First you instantiate a DOMDocument object, then you load your $html variable using the loadHTML() function. Then you can use the getElementsByTagName() function.

Abinadi
+2  A: 

Assuming this is just a simplified case in the OP and the real situation is more complicated, you'll want to use XPath.

If it's really complex, then you may want to use DOMDocument (with DOMXPath), but here's a simple example using SimpleXML

$xml = new SimpleXMLElement($html);

$result = $xml->xpath('//div[@class="content"]');

while(list( , $node) = each($result)) {
    echo $node,"\n";
}

Since you explicitly asked about creating an array for this, you could use:

$res_Arr = array();
while(list( , $node) = each($result)) {
    $res_Arr[] = $node;
}

and $res_Arr would be an array with the contents you're looking for.

See http://php.net/manual/en/simplexmlelement.xpath.php for php SimpleXML Xpath info and http://www.w3.org/TR/xpath for the XPath specifications

Jonathan Fingland
heck, you can even use an `XSLTransform` to get the output directly! But that, of course, lifts you out of PHP completely...
xtofl
A: 

PHP has several means of processing HTML, including DomDocument and SimpleXML. See Parse HTML With PHP And DOM. Here is an example:

$dom = new DomDocument; 
$dom->loadHTML($html); 
$dom->preserveWhiteSpace = false; 
$divs = $dom->getElementsByTagName('div'); 
foreach ($divs as $div) {
  $class = $div->getAttribute('class');
  if ($class == 'content') {
    echo $div->nodeValue . "\n";
  }
}

Technically the class attribute could be multiple classes so you might want to use:

$classes = explode(' ', $class);
if (in_array('content', $classes)) {
  ...
}

The SimpleXML/XPath approach is more concise but if you don't want to go the XPath route (and learning another technology, at least enough to do these sorts of tasks) then the above is a programmatic alternative.

cletus