tags:

views:

127

answers:

3

i have following code that strip all tags. now i want to strip only anchor tags.

x = re.compile(r'<[^<]*?/?>')

how to modify so that only anchor tags stripped.

+1  A: 
x = re.compile(r'<[aA]\>[^<]*?/?>')

This will match the 'a' or 'A' followed by a word boundary. Note that it won't clean out the closing tag.

x = re.compile(r'</?[aA]\>[^<]*?/?>')

will remove the closing tag as well.

EDIT:
Actually, it feels more reliable to switch the [^<] to [^>], like so.

x = re.compile(r'</?[aA]\>[^>]*?/?>')
roe
@roe i have run this but not working
A: 

I'm not sure if this Python is correct (I'm a PHP guy but am just starting to learn python in my own time).

re.sub('<[aA][^>]*>([^<]+)</[aA]>','\1','<html><head> .... </body></html>')

This won't remove all anchor tags in one shot, so you may have to loop over the html string. It matches the anchor tags and replaces the match with the contents of the tags. So ...

<a href="/">homepage</a> -> homepage

Might not be the most efficient on a large body of text but works.

Greg K
+6  A: 

following code that strip all tags.

Not really. <div title="a>b"> is valid HTML and gets mangled. <div title="<" onmouseover="script()" class="<">"> is invalid HTML but the kind of thing you will often find on real web pages. Your regexp leaves an active tag with dangerous scripting in it.

You can't do an HTML-processing task like tag-stripping with regex, unless your possible input set is heavily restricted. Better to use a real HTML parser and walk across the resulting document removing unwanted elements as you go.

eg. with BeautifulSoup:

def replaceWithContents(element):
    ix= element.parent.contents.index(element)
    for child in reversed(element.contents):
        element.parent.insert(ix, child)
    element.extract()

doc= BeautifulSoup(html) # maybe fromEncoding= 'utf-8'
for link in doc.findAll('a'):
    replaceWithContents(link)
str(doc)
bobince
are you sure `<div title="a>b">` is valid? I know most browsers will accept it, but I don't think it's actually valid (it should be `<div title="a>b">`)
roe
Yep, it's perfectly valid in both HTML and XHTML. Chuck it into the validator and see! The `>` escape is never actually needed in HTML. (It is in XML, in one specific and unusual case, but never in attribute values.)
bobince
You're right.. how odd. '<' is not valid though.
roe