views:

68

answers:

4

Hi,

how can I remove links from a raw html text? I've got:

Foo bar <a href="http://www.foo.com"&gt;blah&lt;/a&gt; bar foo 

and want to get:

Foo bar blah bar foo

afterwards.

+2  A: 

You're looking to parse HTML with regexps, and this won't work in all but the simplest cases, since HTML isn't regular. A much more reliable solution is to use an HTML parser. Numerous exist, for many different languages.

Brian Agnew
This is a pretty simple case though. You're not parsing the HTML so much as stripping a specific string (`</a>`) and any strings matching a specific pattern (`<a ... >`) from a block of text. This type of manipulation is precisely what regexp is designed for. None of aspects of HTML which make it _non-regular_ come into play (i.e. anchors can't be nested within one another, and we don't care about any other tags).
Lèse majesté
But do you care about tags in comments etc.?
Brian Agnew
A: 

try with:

sed -e 's/<a[^>]*>.*<\/a>//g' test.txt
patrick
This would produce "Foo bar bar foo" instead of "Foo bar blah bar foo" for the example in question. See danlei's solution for the correct version.
Bolo
+2  A: 
sed -re 's|<a [^>]*>([^<]*)</a>|\1|g'

But Brian's answer is right: This should only be used in very simple cases.

danlei
A: 

$ echo 'Foo bar <a href="http://www.foo.com"&gt;blah&lt;/a&gt; bar foo' | awk 'BEGIN{RS="</a>"}/<a href/{gsub(/<a href=\042.*\042>/,"")}1'

Foo bar blah bar foo

ghostdog74