tags:

views:

70

answers:

4
<tag value='botafogo'> botafogo is the best </tag>

Needs match only botafogo (...is the best) and not 'botafogo' value

my program "annotates" automatically the term in a pure text:

botafogo is the best 

to

<team attr='best'>botafogo</team> is the best 

and when i "replace all" the "best" word, i have a big problem...

<team attr='<adjective>best</adjective>'>botafogo</team> is the <adjective>best</adjective>

Ps.: Java language

+5  A: 

The best way to accomplish this is to NOT use regular expression and use a proper HTML parser. HTML is not a regular language and doing this with regular expression will be tedious, hard to maintain, and more than likely still contain various errors.

HTML parsers, on the hand, are well-suited for the job. Many of them are mature and reliable, and they take care of every little details for you and makes your life much easier.

polygenelubricants
*"While you can hack around these problems with more and more regular expression cleverness, you eventually paint yourself into a corner with complexity. Regular expressions don't truly understand the code that they are colorizing-- but parsers do."* -- http://www.codinghorror.com/blog/2005/04/parsing-beyond-regex.html
John K
+4  A: 

Have you considered to use DOM functions instead of regex?

document.getElementsByTagName('tag')[0].innerHTML.match('botafogo')
S.Mark
+1  A: 

HTML parser is best, then cycle through text contents. (See other answers.)

If you're in PHP, you can do a quick solution by running strip_tags() on the content to remove HTML first. It depends on if you're doing a replace, in which case stripping first is not an option, or if you're just matching, in which case content that is not part of a match can be removed without concern.

Matchu
my program "annotates" automatically the term in a pure text:botafogo is the best<team attr='best'>botafogo</team> is the bestand when i "replace all" the "best" word, i have a big problem...<team attr='<adjective>best</adjective>'>botafogo</team> is the <adjective>best</adjective>
celsowm
Well. No good stripping, then. But I'll leave the answer for reference.
Matchu
A: 

@OP, in your favourite language, do a split on </tag>, then do another split on >. eg Python

>>> s="<tag value='botafogo'> botafogo is the best </tag>"
>>> for item in s.split("</tag>"):
...  if "<tag" in item:
...      print item.split(">")[-1]
...
 botafogo is the best

No regex needed

ghostdog74