views:

205

answers:

2

Hello,

Let's suppose that we have such HTML code. We need to get all <a href=""></a> tags which DO NOT contain img tag inside it.

<a href="http://domain1.com"&gt;&lt;span&gt;Here is link</span></a>
<a href="http://domain2.com" title="">Hello</a>
<a href="http://domain3.com" title=""><img src="" /></a>
<a href="http://domain4" title=""> I'm the image <img src="" /> yeah</a>

I'm using this regular expression to find out all links

preg_match_all("!<a[^>]+href=\"?'?([^ \"'>]+)\"?'?[^>]*>(.*?)</a>!is", $content, $out);

I can modify it

preg_match_all("!<a[^>]+href=\"?'?([^ \"'>]+)\"?'?[^>]*>([^<>]+?)</a>!is", $content, $out);

But how can I tell to exclude results containing <img substring inside of <a href=""></a>?

Thank you

+4  A: 

You need to use a HTML parser like the Simple DOM parser. You cannot parse HTML with regular expressions.

DisgruntledGoat
Sometimes I think this should be in the SO FAQ...
richsage
... or, a valid reason to vote to close the question.
ChrisW
+1  A: 

Dom is the way to go, but for the sake of interest here is the solution:

The easiest way too exclude certain matches in regular expressions is to use a 'negative look-ahead' or a 'negative look-behind'. If the negative expression is found anywhere in the string, the match fails.

Example:

^(?!.+<img.+)<a href=\"?\'?.+\"?\'?>.+</a>$

Matches:

<a href="http://domain1.com"&gt;&lt;span&gt;Here is link</span></a>
<a href="http://domain2.com" title="">Hello</a>

But does not match:

<a href="http://domain3.com" title=""><img src="" /></a>
<a href="http://domain4" title=""> I'm the image <img src="" /> yeah</a>

The negative look forward is this part of the string:

(?!.+<img.+)

This says don't match any strings that have any chars followed by <img, followed by any chars.

<a href=\"?\'?.+\"?\'?>.+</a>

The rest is my general match for anchor tags in html, you might want to use an alternate match expression.

You may need to omit the start and end ^ $ chars depending on your useage.

More info on look ahead / behind

http://www.codinghorror.com/blog/2005/10/excluding-matches-with-regular-expressions.html

BombDefused
<facepalm /> ...
DisgruntledGoat
..........................
BombDefused