tags:

views:

72

answers:

2

Given that the following string is embedded in text, how can I extract the whole line but not matching on the inner "<" and ">"?

<test type="yippie<innertext>" />

EDIT:
Being more specific, we need to handle both use cases below where "type" has or does not have "<" and ">" chars.

<h:test type="yippie<innertext>" />
<h:test type="yippie">

Group 1: 'h:test'
Group 2: ' type="yippie<innertext>" '  -or-  ' type="yippie"'   (ie, remaining content before ">" or "/>")

So far, I have something like this, but it's a little off how it Group 2 stops at the first ">". Tweaking first part of Group 2's condition.

(<([a-zA-Z0-9_:-]+)([^>"]*|[^>]*?)\s*(/)?>)

Thanks for your help.

+1  A: 

Try this:

<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>

Example usage (Python):

>>> x = '<h:test type="yippie<innertext>" />'
>>> re.search('<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>', x).groups()
('h:test', ' type="yippie<innertext>" ')

Also note that if your document is HTML or XML then you should use an HTML or XML parser instead of trying to do this with regular expressions.

Mark Byers
Yep, you're on it. I should have been more clear and complete. I need to group the matching splitting the tag name and the remaining lot. See above.
cwall
A: 

It looks like you are trying to parse XML/HTML with a regex. I would say that your approach is fundamentally wrong. A sufficiently advanced regex is not indistinguishable from an XML parser. After all, what if you needed to parse:

<test type="yippie<inner\"text\"_with_quotes,_literal_slash_and_quote\\\">" />

Furthermore, you probably need to escape the inner < and > as &lt; and &gt;

For further reasons why you should not parse XML with a regex, I can only yield to this superior answer:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

eaolson
I wish I could. Existing implementation forces my hand.
cwall