ansaurus

Question

REGEX: Grabbing everything until a specific word

Answer 1

+17 A:

/<a>(.*?)<\/a>/

should work. The ? makes it lazy, so it grabs as little as possible before matching the </a> part. but using . will mean that it matches everything until it finds </a>.

Kibbee 2008-09-29 00:17:36

Yes, this is much better than my response. This works.

Jeff Yates 2008-09-29 00:30:50

this will work until you have an <a> inside an <a>:<a><a></a></a>and this is identical to the famous parentheses matching regex problem. there is no solution to this problem with conventional regex. you're better off with a plain old stack.

wilhelmtell 2008-09-30 00:17:06

In what messed up version of HTML does an <a> occur within another <a> tag?

Kibbee 2008-09-30 00:42:46

@Kibbee: the messed up version you find scattered all over the place on the world wide web ;-)

webmat 2008-09-30 02:06:46

However, even if you used a nice HTML Parser like Beautiful Soup, what kind of results could you expect in a situation with nested links? For bad input the results are undefined. Does it throw an exception, or does it make a best guess? There is no right output when bad data is fed in.

Kibbee 2008-09-30 12:39:54

Answer 2

+4 A:

Don't use regular expressions to parse HTML, as they are notoriously difficult to get right and HTML is notoriously unreliable to be properly structured. This is yet one more place that reinventing another wheel doesn't save you anything. In the Python world we always suggest the popular BeautifulSoup library and I'm sure your language as a good, super-easy parser, as well.

ironfroggy 2008-09-29 00:31:44

Nobody here is trying to parse an entire document. This question is specifically about getting the text between the <a> </a> tags. For that small problem set, regular expressions work perfectly fine. Loading up and parsing the entire document just for this small task is overkill.

Kibbee 2008-09-30 12:44:58

ansaurus

tags:

views:

answers:

REGEX: Grabbing everything until a specific word

related questions