tags:

views:

348

answers:

4

I need to "grab" an attribute of a custom HTML tag. I know this sort of question has been asked many times before, but regex really messes with my head, and I can't seem to get it working.

A sample of XML that I need to work with is

<!-- <editable name="nameValue"> --> - content goes here - <!-- </editable> -->

I want to be able to grab the value of the name attribute, which in this case is nameValue. What I have is shown below but this returns a null value.

My regex string (for a Java app, hence the \ to escape the ") is:
"(.)?<!-- <editable name=(\".*\")?> -->.*<!-- </editable> -->(.)?"

I am trying to grab the attribute with quotation marks I figure this is the easiest and most general pattern to match. Well it just doesn't work, any help will help me keep my hair.

+2  A: 

Your search is greedy. Use "\<\!-- \<editable name=\"(.*?)\"\> --\>.*?\<\!-- \<\/editable\> --\>" (added ?). Please note that this one will not work correctly with nested <editable> elements.

If you don't want to perform syntax checking, you could also simply go with: "\<\!-- \<editable name=\"(.*?)\"\> --\>" or even "\<editable name=\"(.*?)\"\>" for better simplicity and performance.

Edit: should be

Pattern re = Pattern.compile( "\\<editable name=\"(.*?)\"\\>" );
streetpc
That doesn't work either. What are the \ for in \?\> -- why would you escape the ? and > characters?
Ankur
Because those characters can be special characters in a regex. The ? is incorrect though, removed it. And actually in a Java string, I should escape the backslash as well => \\>.
streetpc
Fixed it for use in Java.
streetpc
'<', '>' and '!' don't need to be escaped.
Alan Moore
! is used in negative look-ahead pattern and < in look-behind. Indeed, > does not need to be escaped (yet). But it doesn't harm AFAIK, so I often do it anyway.
streetpc
+2  A: 

I use JavaScript, but it should help to make the expression non-greedy where possible and use not matches instead of any character matches. Not sure how similar regexps are with Java, but instead of using the expression \".*\" try using \"[^\"]*\". That will search for any character within the attribute value that isn't a quote, meaning the expression can't match beyond the attribute value.

Hope that helps

Andy E
+1 for the not-quotes approach. FYI, Java regexes can do everything the JavaScript flavor can, plus a lot more.
Alan Moore
Thanks. Yeah, I know Javascript's regexes are lacking in some areas lookbehinds, for example. Hopefully that will improve in time.
Andy E
+2  A: 

I don't think you need the (.)?s at the beginning and end of your regex. And you need to put in a capturing group for getting only the content-goes-here bit:

This worked for me:

String xml = "RANDOM STUFF<!-- <editable name=\"nameValue\"> --> - content goes here - <!-- </editable> -->RANDOM STUFF";
Pattern p = Pattern.compile("<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->");
Matcher m = p.matcher(xml);
if (m.find()) {
    System.out.println(m.group(2));
} else {
    System.out.println("no match found");
}

This prints:

 - content goes here -
Zarkonnen
A: 

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the answer using TagSoup helpful.

Chas. Owens