tags:

views:

77

answers:

2

Hi all,

I'm trying to match attributes from a html tag, but I can't get it working :)

Let's take this tag for example:

<a href="ddd" class='sw ' w'>

Obviously the last part is not quite right.

Now I tried to match the attributes part with this piece of code:

preg_match('/(\s+\w+=(?P<quote>(\'|\"))[^(?P=quote)]*(?P=quote))*/U', " href=\"bla\" class='sw'sw'", $a);

Here $a is empty, and that's what I expect. But if I now take my complete expression it does match the last class part, which puzzles me. It looks like this:

preg_match('/<(?P<c>[\/]?)(?P<tag>\w+)(?P<atts>(\s+\w+=(?P<quote>(\'|\"))[^(?P=quote)]*(?P=quote))*)\s*(?P<sc>[\/]?)>/U', $tag, $a);

Now $a returns:

Array
(
[0] => <a href="ddd" class='sw ' w'>
[c] => 
[1] => 
[tag] => a
[2] => a
[atts] =>  href="ddd" class='sw ' w'
[3] =>  href="ddd" class='sw ' w'
[4] =>  class='sw ' w'
[quote] => '
[5] => '
[6] => '
[sc] => 
[7] => 
)

Notice the key 4 which contains the class part including the last 'w, while I did use the (U)ngreedy switch at the end.

Any clues?

+1  A: 

It's really a bad idea to try and regex HTML - there is a DOM Inspector for PHP that can do this.

squeeks
I know, but I'm fixing an existing library.
acidtv
Attempting to patch up a bad practice is in itself, bad practice. Instead of trying to fix the expression, replace it with a DOM handler - you'll get your data and chances are it won't break with variances in markup fed to it.
squeeks
hmm ok, I'll have a look into it.
acidtv
A: 

[^(?P=quote)]

You can't do that. Character classes only contain single characters, backslash-escapes and - ranges; this character class matches any of the literal characters (, ), ?, P and so on.

Moreover, (?P=quote) is not a backreference, it's a recursive expression. It takes the regex from the earlier definition:

(?P<quote>(\'|\"))

and so matches either ' or " regardless of which quote was used at the start of the attribute value. Backrefs are done with expressions like \1 matching the numbered () match group.

But anyway, squeeks is right: parsing [X][HT]ML with regex is a total losing game. You will never come up with an expression that treats all possible markup correctly. Stop wasting your time and use an XML or HTML parser.

bobince
Ok, but what you said about (?P=quote) not being a backreference, I can't find anything about this in the documentation. What I can find is: "Back references to the named subpatterns can be achieved by (?P=name)", and "A back reference matches whatever actually matched the capturing subpattern in the current subject string, rather than anything matching the subpattern itself."Can you explain that? I'm trying to learn here :)
acidtv