ansaurus

Question

Answer 1

+2 A:

Try this regular expression:

/<(\w+)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*'|[^'">\s]*))*)\s*>/

But you really shouldn’t use regular expressions for a context free language like HTML. Use a real parser instead.

Gumbo 2009-07-06 15:50:00

Care to elaborate on what you mean my 'real parser'?

Tim Lytle 2009-07-06 15:56:00

@Tim Lytle: Regexes are no parsers. They are *part of parsers*, at most. A real parser is an XML DOM parser, for example - it can parse languages, whereas regexes can only find patterns.

Tomalak 2009-07-06 16:03:40

@Tomalak Ah, did not understand what he meant. Makes perfect sense now.

Tim Lytle 2009-07-27 17:08:20

Answer 2

+1 A:

As has been said, don't use RegEx for parsing HTML documents.

Try this PHP parser instead: http://simplehtmldom.sourceforge.net/

Peter Boughton 2009-07-06 17:57:58

Answer 3

A:

Your second capturing group matches the attributes one at a time, each time overwriting the previous one. If you were using .NET regexes, you could use the Captures array to retrieve the individual captures, but I don't know of any other regex flavor that has that feature. Usually you have to do something like capture all of the attributes in one group, then use another regex on the captured text to break out the individual attributes.

This is why people tend to either love regexes or hate them (or both). You can do some truly amazing things with them, but you also keep running into simple tasks like this one that are ridiculously hard, if not impossible.

Alan Moore 2009-07-06 18:01:32

ansaurus

tags:

views:

answers:

PHP RegEx Grouping Multiple Matches

related questions