tags:

views:

51

answers:

3

Hello,

I need a regular expression to detect a span-element where the order of id and class doesn´t matter. The name of the class is always the same, the id is always a fixed number of digits, for example:

<span class="className" id="123">

or

<span id="321" class="className" >

My approach for a regular expression in java was:

String pattern = "<span class=\"className\" id=\"\\d*\">";

but so i can get only one version. Can sombody help?

Thanks, hansa

+7  A: 

Don't parse HTML with regular expressions. HTML isn't regular.

Paul Tomblin
Thank you, i agree with your answer. But there are no other html tags present except the span-tags, and its assured that no span can be inside another span, so the span simply represent a bracket with a id. Maybe later I process this with javascript and DOM.
hansa
A: 

I would do a two step version, first finding the span tag with:

<span[^>]*class=\"classname\"[^>]*>

And then dig out the id from the tags that match the first pattern with

id=\"(\d+)\"

As others have pointed out, it's not a good idea to parse HTML with regular expressions. But for dirty data processing, this is how i would do it.

Hans W
Thank you, I used this solution and it works. For more see answer to Paul Tomblin.
hansa
+1  A: 

This should do it:

String r = "<span (?=[^<>]*\\bclass=\"className\")[^<>]*\\bid=\"(\\d+)\"[^<>]*>";

The lookahead confirms that the span is of the desired class without consuming any characters. Then the rest of the regex, starting from the same position, searches for the id attribute and captures its value. The [^<>]* takes care of any other attributes that might be present, while ensuring that all matching occurs within the tag. (Technically, angle brackets can appear in attribute values, but you probably don't have to worry about that.)

Alan Moore
+1 nice one, even though a parser is more recommend for this task.
BalusC
@BalusC: No argument there, but I think it's more helpful to give a regex answer if it's possible without invoking the name of Cthulhu. That way I can explain in concrete terms why the task is more complicated than the OP expected. "HTML isn't regular" is no help at all.
Alan Moore
"HTML isn't regular" is plenty help if you understand (or bother to look up) what "regular" means in terms of parsing computer syntaxes. If you understand (or look up) what regular means, you'll understand immediately that you CANNOT write a regex that parses html with 100% accuracy. It's not possibly, by definition.
Paul Tomblin
Most programmers are not computer scientists and don't want to be. They know regexes can be used to solve problems like this one, and that many people do in fact use them to process (not "parse") HTML. So they post their questions here, only to be told that regexes can't do that. How is that helpful? For the majority of specific, limited problems people come here with, regexes *can* do that. Anyway, if we don't give people love here at SO, they'll just go looking for it on RegExLib.com, and you wouldn't want that on your conscience. :P
Alan Moore