tags:

views:

409

answers:

1

Hi, I'm having trouble with grep.. Which four patterns should I use with PHP's preg_grep to extract all instances the "____" stuff in the strings below?

1. <h2><a ....>_____</a></h2>
2. <cite><a href="_____" .... >...</a></cite>
3. <cite><a .... >________</a></cite>
4. <span>_________</span>

The dots denote some arbitrary characters while the underscores denote what I want.

An example string is:

     </style></head>
<body><div id="adBlock"><h2><a href="https://www.google.com/adsense/support/bin/request.py?contact=afs_violation&amp;amp;hl=en" target="_blank">Ads by Google</a></h2>
<div class="ad"><div><a href="http://www.google.com/aclk?sa=L&amp;amp;ai=C4vfT4Sa3S97SLYO8NN6F-ckB5oq5sAGg6PKlDaT-kwUQASCF4p8UKARQtobS9AVgyZbRhsijoBnIAQGqBBxP0OSEnIsuRIv3ZERDm8GiSKZSnjrVf1kVq-_Y&amp;amp;num=1&amp;amp;sig=AGiWqtwG1qHnwpZ_5BNrjrzzXO5Or6EDMg&amp;amp;q=http://www.crackle.com/c/Spider-Man_The_New_Animated_Series/%3Futm_source%3Dgoogle%26utm_medium%3Dcpc%26utm_campaign%3DGST_10016_CRKL_US_PRD_S_TeleV_SPID_Tele_Spider-Man%26utm_term%3Dspiderman%26utm_content%3Ds264Yjg9f_3472685742_487lrz1638" class="titleLink" target="_parent">Spider-<b>Man</b> Animated Serie</a></div>
<span>See Your Favorite Spiderman
<br>
Episodes for Free. Only on Crackle.</span>
<cite><a href="http://www.google.com/aclk?sa=L&amp;amp;ai=C4vfT4Sa3S97SLYO8NN6F-ckB5oq5sAGg6PKlDaT-kwUQASCF4p8UKARQtobS9AVgyZbRhsijoBnIAQGqBBxP0OSEnIsuRIv3ZERDm8GiSKZSnjrVf1kVq-_Y&amp;amp;num=1&amp;amp;sig=AGiWqtwG1qHnwpZ_5BNrjrzzXO5Or6EDMg&amp;amp;q=http://www.crackle.com/c/Spider-Man_The_New_Animated_Series/%3Futm_source%3Dgoogle%26utm_medium%3Dcpc%26utm_campaign%3DGST_10016_CRKL_US_PRD_S_TeleV_SPID_Tele_Spider-Man%26utm_term%3Dspiderman%26utm_content%3Ds264Yjg9f_3472685742_487lrz1638" class="domainLink" target="_parent">www.Crackle.com/Spiderman</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&amp;amp;ai=CnQFi4Sa3S97SLYO8NN6F-ckB3M7nQtyU2PQEq6bCBRACIIXinxQoBFCm15KB-f____8BYMmW0YbIo6AZoAHiq_X-A8gBAaoEIU_Q9JKLiy1MiwdnHpZoBnmpR1J8pP2jpTwMx2uj2nN4WA&amp;amp;num=2&amp;amp;sig=AGiWqtwDrI5pWBCncdDc80FKt32AJMAQ6A&amp;amp;q=http://www.costumeexpress.com/browse/TV-Movies/_/N-1z141uu/Ntt-batman/results1.aspx%3FREF%3DKNC-CEgoogle" class="titleLink" target="_parent">Kids <b>Batman</b> Costumes</a></div>

<span>Great Selection of <b>Batman</b> &amp; Batgirl
<br>
Costumes For Kids. Ships Same Day!</span>
<cite><a href="http://www.google.com/aclk?sa=l&amp;amp;ai=CnQFi4Sa3S97SLYO8NN6F-ckB3M7nQtyU2PQEq6bCBRACIIXinxQoBFCm15KB-f____8BYMmW0YbIo6AZoAHiq_X-A8gBAaoEIU_Q9JKLiy1MiwdnHpZoBnmpR1J8pP2jpTwMx2uj2nN4WA&amp;amp;num=2&amp;amp;sig=AGiWqtwDrI5pWBCncdDc80FKt32AJMAQ6A&amp;amp;q=http://www.costumeexpress.com/browse/TV-Movies/_/N-1z141uu/Ntt-batman/results1.aspx%3FREF%3DKNC-CEgoogle" class="domainLink" target="_parent">www.CostumeExpress.com</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&amp;amp;ai=CAMYT4Sa3S97SLYO8NN6F-ckB3ZnWmgGdoNLrDaumwgUQAyCF4p8UKARQrqSVxwdgyZbRhsijoBmgAZH77uwDyAEBqgQYT9DU7oqLLEyLB2dHlxZFnQzyeg-yHt88&amp;amp;num=3&amp;amp;sig=AGiWqtzqAphZ9DLDiEFBJlb0Ou_1HyEyyA&amp;amp;q=http://www.OfficialBatmanCostumes.com" class="titleLink" target="_parent"><b>Batman</b> Costume</a></div>
<span>Official <b>Batman</b> Costumes.

<br>
Huge Selection &amp; Same Day Shipping!</span>
<cite><a href="http://www.google.com/aclk?sa=l&amp;amp;ai=CAMYT4Sa3S97SLYO8NN6F-ckB3ZnWmgGdoNLrDaumwgUQAyCF4p8UKARQrqSVxwdgyZbRhsijoBmgAZH77uwDyAEBqgQYT9DU7oqLLEyLB2dHlxZFnQzyeg-yHt88&amp;amp;num=3&amp;amp;sig=AGiWqtzqAphZ9DLDiEFBJlb0Ou_1HyEyyA&amp;amp;q=http://www.OfficialBatmanCostumes.com" class="domainLink" target="_parent">www.OfficialBatmanCostumes.com</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&amp;amp;ai=C767t4Sa3S97SLYO8NN6F-ckBkZfSfoOppaMHq6bCBRAEIIXinxQoBFDX2bw6YMmW0YbIo6AZoAHpprP8A8gBAaoEG0_QhJSMiytMiwdnHpZoF3g0Uj8_Vl2r4TpI_g&amp;amp;num=4&amp;amp;sig=AGiWqtyGO2DnFq_jMhP6ufj8pufT9sWQWA&amp;amp;q=http://www.discountsuperherocostumes.com/batman-costumes.html" class="titleLink" target="_parent">Discount <b>Batman</b> Costumes</a></div>
<span>Discount adult and kids <b>batman</b>
<br>
superhero costumes.</span>

<cite><a href="http://www.google.com/aclk?sa=l&amp;amp;ai=C767t4Sa3S97SLYO8NN6F-ckBkZfSfoOppaMHq6bCBRAEIIXinxQoBFDX2bw6YMmW0YbIo6AZoAHpprP8A8gBAaoEG0_QhJSMiytMiwdnHpZoF3g0Uj8_Vl2r4TpI_g&amp;amp;num=4&amp;amp;sig=AGiWqtyGO2DnFq_jMhP6ufj8pufT9sWQWA&amp;amp;q=http://www.discountsuperherocostumes.com/batman-costumes.html" class="domainLink" target="_parent">www.discountsuperherocostumes.com</a></cite></div></div></body>
<script type="text/javascript">
      var relay = "";
    </script>
<script type="text/javascript" src="/uds/?file=ads&amp;v=1&amp;packages=searchiframe&amp;nodependencyload=true"></script></html>

Thanks!

+3  A: 

First of all, you should not use regex to extract data from an HTML string.
Instead, you should use a DOM Parser !

Here, you could use :


For example, you could load your document, and instanciate the DOMXpath class this way :

$html = <<<HTML
....
....
HTML;

$dom = new DOMDocument();
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

And, then, use XPath to find the elements you are looking for.


For example, in the first case, you could use something like this, to find all <a> tags that are children of <h2> tags :

// <h2><a ....>_____</a></h2>
$tags = $xpath->query('//h2/a');
foreach ($tags as $tag) {
    var_dump($tag->nodeValue);
}
echo '<hr />';


Then, for the second and third case, you are searching for <a> tags that are children of <cite> tags -- and when you've found them, you want to check if they have a href attribute or not :

// <cite><a href="_____" .... >...</a></cite>
// <cite><a .... >________</a></cite>
$tags = $xpath->query('//cite/a');
foreach ($tags as $tag) {
    if ($tag->hasAttribute('href')) {
        var_dump($tag->getAttribute('href'));
    } else {
        var_dump($tag->nodeValue);
    }
}
echo '<hr />';


And, finally, for the last one, you just want <span> tags :

// <span>_________</span>
$tags = $xpath->query('//span');
foreach ($tags as $tag) {
    var_dump($tag->nodeValue);
}


Not that hard -- and much easier to read that regexes, isn't it ? ;-)

Pascal MARTIN