views:

39

answers:

2

Hi,

I'm trying to get the code of a html document in specific tags.

My method works for some tags, but not all, and it not work for the tag's content I want to get.

Here is my code:

<html>
<head></head>
<body>
<?php 
     $url = "http://sf.backpage.com/MusicInstruction/";   
     $data = file_get_contents($url);
     $pattern = "/<div class=\"cat\">(.*)<\/div>/";
     preg_match_all($pattern, $data, $adsLinks, PREG_SET_ORDER);
     var_dump($adsLinks);
     foreach ($adsLinks as $i) {
         echo "<div class='ads'>".$i[0]."</div>";
     } 

?>
</body>
</html>

The above code doesn't work, but it works when I change the $pattern into:

$pattern = "/<div class=\"date\">(.*)<\/div>/";

or

$pattern = "/<div class=\"sponsorBoxPlusImages\">(.*)<\/div>/";

I can't see any different between these $pattern. Please help me find the error. Thanks.

+4  A: 

Use PHP DOM to parse HTML instead of regex.

For example in your case (code updated to show HTML):

$doc = new DOMDocument();
@$doc->loadHTML(file_get_contents("http://sf.backpage.com/MusicInstruction/"));
$nodes = $doc->getElementsByTagName('div');

for ($i = 0; $i < $nodes->length; $i ++)
{
    $x = $nodes->item($i);

    if($x->getAttribute('class') == 'cat');
        echo htmlspecialchars($x->nodeValue) . "<hr/>"; //this is the element that you want
}
shamittomar
Thanks for the advice, will try.
Henry
+2  A: 

The reason your regex fails is that you are expecting . to match newlines, and it won't unless you use the s modifier, so try

$pattern = "/<div class=\"cat\">(.*)<\/div>/s";

When you do this, you might find the pattern a little too greedy as it will try to capture everything up to the last closing div element. To make it non-greedy, and just match up the very next closing div, add a ? after the *

$pattern = "/<div class=\"cat\">(.*?)<\/div>/s";

This just serves to illustrate that for all but the simplest cases, parsing HTML with regexes is the road to madness. So try using DOM functions for parsing HTML.

Paul Dixon
I'm new to PHP, and in middle of work with regex now. This answer help me solve the problem so I'll pick it. Though I'll try DOM for future tasks. Thank you all.
Henry