tags:

views:

2542

answers:

2

ex: <a><strike>example data in here</strike></a>

I want everything inside the a tag, to the end

/<a>([^<]*)<\/a>/

It works when there are no additional tags within the <a> tag, but what if there are?

I want to know if you can tell it to grab everything up to [^</a>] instead of [^<] only.

Doing it with /<a>(.*)<\/a>/ doesn't work well. Sometimes I get everything in the <a> tag and other times I get tons of lines included in that call.

+17  A: 
/<a>(.*?)<\/a>/

should work. The ? makes it lazy, so it grabs as little as possible before matching the </a> part. but using . will mean that it matches everything until it finds </a>.

Kibbee
Yes, this is much better than my response. This works.
Jeff Yates
this will work until you have an <a> inside an <a>:<a><a></a></a>and this is identical to the famous parentheses matching regex problem. there is no solution to this problem with conventional regex. you're better off with a plain old stack.
wilhelmtell
In what messed up version of HTML does an <a> occur within another <a> tag?
Kibbee
@Kibbee: the messed up version you find scattered all over the place on the world wide web ;-)
webmat
However, even if you used a nice HTML Parser like Beautiful Soup, what kind of results could you expect in a situation with nested links? For bad input the results are undefined. Does it throw an exception, or does it make a best guess? There is no right output when bad data is fed in.
Kibbee
+4  A: 

Don't use regular expressions to parse HTML, as they are notoriously difficult to get right and HTML is notoriously unreliable to be properly structured. This is yet one more place that reinventing another wheel doesn't save you anything. In the Python world we always suggest the popular BeautifulSoup library and I'm sure your language as a good, super-easy parser, as well.

ironfroggy
Nobody here is trying to parse an entire document. This question is specifically about getting the text between the <a> </a> tags. For that small problem set, regular expressions work perfectly fine. Loading up and parsing the entire document just for this small task is overkill.
Kibbee