ansaurus

Question

How to change this regex to properly extract tag attributes - should be simple

Answer 1

+2 A:

Your search is greedy. Use "\<\!-- \<editable name=\"(.*?)\"\> --\>.*?\<\!-- \<\/editable\> --\>" (added ?). Please note that this one will not work correctly with nested <editable> elements.

If you don't want to perform syntax checking, you could also simply go with: "\<\!-- \<editable name=\"(.*?)\"\> --\>" or even "\<editable name=\"(.*?)\"\>" for better simplicity and performance.

Edit: should be

Pattern re = Pattern.compile( "\\<editable name=\"(.*?)\"\\>" );

streetpc 2009-06-17 08:36:47

That doesn't work either. What are the \ for in \?\> -- why would you escape the ? and > characters?

Ankur 2009-06-17 08:41:29

Because those characters can be special characters in a regex. The ? is incorrect though, removed it. And actually in a Java string, I should escape the backslash as well => \\>.

streetpc 2009-06-17 08:52:49

Fixed it for use in Java.

streetpc 2009-06-17 08:57:27

'<', '>' and '!' don't need to be escaped.

Alan Moore 2009-06-17 12:51:59

! is used in negative look-ahead pattern and < in look-behind. Indeed, > does not need to be escaped (yet). But it doesn't harm AFAIK, so I often do it anyway.

streetpc 2009-06-17 15:32:11

Answer 2

+2 A:

I use JavaScript, but it should help to make the expression non-greedy where possible and use not matches instead of any character matches. Not sure how similar regexps are with Java, but instead of using the expression \".*\" try using \"[^\"]*\". That will search for any character within the attribute value that isn't a quote, meaning the expression can't match beyond the attribute value.

Hope that helps

Andy E 2009-06-17 08:37:08

+1 for the not-quotes approach. FYI, Java regexes can do everything the JavaScript flavor can, plus a lot more.

Alan Moore 2009-06-17 12:59:32

Thanks. Yeah, I know Javascript's regexes are lacking in some areas lookbehinds, for example. Hopefully that will improve in time.

Andy E 2009-06-18 08:21:05

Answer 3

+2 A:

I don't think you need the (.)?s at the beginning and end of your regex. And you need to put in a capturing group for getting only the content-goes-here bit:

This worked for me:

String xml = "RANDOM STUFF<!-- <editable name=\"nameValue\"> --> - content goes here - <!-- </editable> -->RANDOM STUFF";
Pattern p = Pattern.compile("<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->");
Matcher m = p.matcher(xml);
if (m.find()) {
    System.out.println(m.group(2));
} else {
    System.out.println("no match found");
}

This prints:

 - content goes here -

Zarkonnen 2009-06-17 08:42:01

Answer 4

A:

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the answer using TagSoup helpful.

Chas. Owens 2009-06-17 13:58:23

ansaurus

tags:

views:

answers:

How to change this regex to properly extract tag attributes - should be simple

related questions