tags:

views:

142

answers:

3

Hello, I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl> tags in the string, but I am looking to match the last <dl>(.?)+</dl> combo that comes before a </div>.

The way I've devised to do this is to make sure that there aren't any <dl's inside the <dl></dl> combo I'm matching. I don't care what else is there, including other tags and line breaks.

I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.

Here is my current regex that only returns me an array with two NULL indicies:

preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)

As you can see I use negative lookahead to try and see if there is another <dl> within this one. I've also tried negative lookbehind here with the same results. I've also tried using +? instead of just + to no avail. Keep in mind that there's no pattern <dl><dl></dl> or anything, but that my regex is either matching the first <dl> and the last </dl> or nothing at all.

Now I realize . won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl to </dl></div>, which includes several other occurances of <dl>, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.

Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.

+1  A: 

Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.

soulmerge
I've posted this answer so often, I wonder when google will start providing a link to that answer when someone searches for 'pain' on their site.
soulmerge
Thanks for your response, you must have that response as a template because I've seen it other locations as well. I would certainly consider a parser, but I know exactly how the HTML is formatted as I, myself, generate it in another file. So since I know the general form the HTML is going to take, I took regex to be an acceptable solution. Also, I didn't want to slow down the execution any more than necessary since I already consider the load-time borderline of this particular page
Ryan
A: 

I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.

In general I would not recommend to write a parser using regex.

See http://www.php.net/tidy

Pierre
A: 

As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.

preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);

The [^\z] craziness is just a way I used to say "match all characters, even line breaks"

Ryan