tags:

views:

130

answers:

3

I'm just trying my hand at crafting my very first regex. I want to be able to match a pseudo HTML element and extract useful information such as tag name, attributes etc.:

$string = '<testtag alpha="value" beta="xyz" gamma="abc"  >';

if (preg_match('/<(\w+?)(\s\w+?\s*=\s*".*?")+\s*>/', $string, $matches)) {
    print_r($matches);
}

Except, I'm getting:

Array ( [0] =>  [1] => testtag [2] => gamma="abc" )

Anyone know how I can get the other attributes? What am I missing?

+2  A: 

Try this regular expression:

/<(\w+)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*'|[^'">\s]*))*)\s*>/

But you really shouldn’t use regular expressions for a context free language like HTML. Use a real parser instead.

Gumbo
Care to elaborate on what you mean my 'real parser'?
Tim Lytle
@Tim Lytle: Regexes are no parsers. They are *part of parsers*, at most. A real parser is an XML DOM parser, for example - it can parse languages, whereas regexes can only find patterns.
Tomalak
@Tomalak Ah, did not understand what he meant. Makes perfect sense now.
Tim Lytle
+1  A: 

As has been said, don't use RegEx for parsing HTML documents.

Try this PHP parser instead: http://simplehtmldom.sourceforge.net/

Peter Boughton
A: 

Your second capturing group matches the attributes one at a time, each time overwriting the previous one. If you were using .NET regexes, you could use the Captures array to retrieve the individual captures, but I don't know of any other regex flavor that has that feature. Usually you have to do something like capture all of the attributes in one group, then use another regex on the captured text to break out the individual attributes.

This is why people tend to either love regexes or hate them (or both). You can do some truly amazing things with them, but you also keep running into simple tasks like this one that are ridiculously hard, if not impossible.

Alan Moore