ansaurus

Question

Answer 1

+2 A:

You're looking to parse HTML with regexps, and this won't work in all but the simplest cases, since HTML isn't regular. A much more reliable solution is to use an HTML parser. Numerous exist, for many different languages.

Brian Agnew 2010-07-04 23:11:25

This is a pretty simple case though. You're not parsing the HTML so much as stripping a specific string (`</a>`) and any strings matching a specific pattern (`<a ... >`) from a block of text. This type of manipulation is precisely what regexp is designed for. None of aspects of HTML which make it _non-regular_ come into play (i.e. anchors can't be nested within one another, and we don't care about any other tags).

Lèse majesté 2010-07-05 01:25:17

But do you care about tags in comments etc.?

Brian Agnew 2010-07-05 07:06:12

Answer 2

A:

try with:

sed -e 's/<a[^>]*>.*<\/a>//g' test.txt

patrick 2010-07-04 23:12:00

This would produce "Foo bar bar foo" instead of "Foo bar blah bar foo" for the example in question. See danlei's solution for the correct version.

Bolo 2010-07-04 23:28:36

Answer 3

+2 A:

sed -re 's|<a [^>]*>([^<]*)</a>|\1|g'

But Brian's answer is right: This should only be used in very simple cases.

danlei 2010-07-04 23:23:32

Answer 4

A:

$ echo 'Foo bar <a href="http://www.foo.com">blah</a> bar foo' | awk 'BEGIN{RS="</a>"}/<a href/{gsub(/<a href=\042.*\042>/,"")}1'

Foo bar blah bar foo

ghostdog74 2010-07-04 23:47:16

ansaurus

tags:

views:

answers:

Remove links from text file

related questions