ansaurus

Question

Need regular expr. for html element where order of attributes doesn´t matter

Answer 1

+7 A:

Don't parse HTML with regular expressions. HTML isn't regular.

Paul Tomblin 2010-02-20 21:29:52

Thank you, i agree with your answer. But there are no other html tags present except the span-tags, and its assured that no span can be inside another span, so the span simply represent a bracket with a id. Maybe later I process this with javascript and DOM.

hansa 2010-02-23 12:00:38

Answer 2

A:

I would do a two step version, first finding the span tag with:

<span[^>]*class=\"classname\"[^>]*>

And then dig out the id from the tags that match the first pattern with

id=\"(\d+)\"

As others have pointed out, it's not a good idea to parse HTML with regular expressions. But for dirty data processing, this is how i would do it.

Hans W 2010-02-20 21:32:29

Thank you, I used this solution and it works. For more see answer to Paul Tomblin.

hansa 2010-02-23 12:02:06

Answer 3

+1 A:

This should do it:

String r = "<span (?=[^<>]*\\bclass=\"className\")[^<>]*\\bid=\"(\\d+)\"[^<>]*>";

The lookahead confirms that the span is of the desired class without consuming any characters. Then the rest of the regex, starting from the same position, searches for the id attribute and captures its value. The [^<>]* takes care of any other attributes that might be present, while ensuring that all matching occurs within the tag. (Technically, angle brackets can appear in attribute values, but you probably don't have to worry about that.)

Alan Moore 2010-02-20 22:25:39

+1 nice one, even though a parser is more recommend for this task.

BalusC 2010-02-20 22:43:23

@BalusC: No argument there, but I think it's more helpful to give a regex answer if it's possible without invoking the name of Cthulhu. That way I can explain in concrete terms why the task is more complicated than the OP expected. "HTML isn't regular" is no help at all.

Alan Moore 2010-02-20 23:21:43

"HTML isn't regular" is plenty help if you understand (or bother to look up) what "regular" means in terms of parsing computer syntaxes. If you understand (or look up) what regular means, you'll understand immediately that you CANNOT write a regex that parses html with 100% accuracy. It's not possibly, by definition.

Paul Tomblin 2010-02-21 04:30:42

Most programmers are not computer scientists and don't want to be. They know regexes can be used to solve problems like this one, and that many people do in fact use them to process (not "parse") HTML. So they post their questions here, only to be told that regexes can't do that. How is that helpful? For the majority of specific, limited problems people come here with, regexes *can* do that. Anyway, if we don't give people love here at SO, they'll just go looking for it on RegExLib.com, and you wouldn't want that on your conscience. :P

Alan Moore 2010-02-21 07:45:36

ansaurus

tags:

views:

answers:

Need regular expr. for html element where order of attributes doesn´t matter

related questions