views:

1070

answers:

3

I am trying to find data within a HTML document. I don't need a full blown parser as it is just the data between one tag.

But, I want to detect the 'select' tag and the data in between.

return Pattern.compile(pattern, 
                       Pattern.CASE_INSENSITIVE | Pattern.MULTILINE |
                       Pattern.DOTALL);

/// End right angle bracket left off intentionally:
track_pattern_buf.append("<select");
track_pattern_buf.append("(.*?)");
track_pattern_buf.append("</select");

Is this the 'regex' that you would use?

A: 

I would use something that looked like:

"<select>([^<>]+)</select>"

I'm not sure why you left off the '>'s and I wouldn't want to match other tags (here I'm assuming we're looking for textual data and not a document fragment).

That being said, I'd really look into getting a DOM and using XPath (or similar) to do your queries as regex's are not well known for their ability to deal with trees.

Aaron Maenpaa
Leave off the '>' from the opening tag in case there are attributes. I don't think there's a reason for leaving it off the closing tag.
Bill the Lizard
This would fail to match any <option> tags inside the <select> since you are stopping at the first <
Sean Bright
Those are some of the many reasons why I'd highly recommend you use XPath rather than putting together a nasty regex that works provided you don't actually care about attribute values, namespaces, entities, etc.
Aaron Maenpaa
A: 

I think more safer would be to have something like:

"<\s*select\s*>(.*?)<\s*/select\s*>"

For more security you should probably add \w* after the first select in case any other select options appear.

Also the 3rd \s* could be probably skipped if your HTML is standard compliant.

hyperboreean
If by "other select options" you mean SELECT tag attributes, \w* won't cut it. Also, I don't see any need to allow whitespace after the opening angle bracket. Unless the OP comes up with more detailed requirements, @Gumbo's regex is the way to go.
Alan Moore
+2  A: 

If you really want to stich with regular expressions (which are not the best choice) I’d use:

"<select[^>]*>(.+?)</select\s*>"
Gumbo