ansaurus

Question

Answer 1

+1 A:

x = re.compile(r'<[aA]\>[^<]*?/?>')

This will match the 'a' or 'A' followed by a word boundary. Note that it won't clean out the closing tag.

x = re.compile(r'</?[aA]\>[^<]*?/?>')

will remove the closing tag as well.

EDIT:
Actually, it feels more reliable to switch the [^<] to [^>], like so.

x = re.compile(r'</?[aA]\>[^>]*?/?>')

roe 2010-04-07 10:07:13

@roe i have run this but not working

2010-04-07 10:21:30

Answer 2

A:

I'm not sure if this Python is correct (I'm a PHP guy but am just starting to learn python in my own time).

re.sub('<[aA][^>]*>([^<]+)</[aA]>','\1','<html><head> .... </body></html>')

This won't remove all anchor tags in one shot, so you may have to loop over the html string. It matches the anchor tags and replaces the match with the contents of the tags. So ...

<a href="/">homepage</a> -> homepage

Might not be the most efficient on a large body of text but works.

Greg K 2010-04-07 10:55:38

Answer 3

+6 A:

following code that strip all tags.

Not really. <div title="a>b"> is valid HTML and gets mangled. <div title="<" onmouseover="script()" class="<">"> is invalid HTML but the kind of thing you will often find on real web pages. Your regexp leaves an active tag with dangerous scripting in it.

You can't do an HTML-processing task like tag-stripping with regex, unless your possible input set is heavily restricted. Better to use a real HTML parser and walk across the resulting document removing unwanted elements as you go.

eg. with BeautifulSoup:

def replaceWithContents(element):
    ix= element.parent.contents.index(element)
    for child in reversed(element.contents):
        element.parent.insert(ix, child)
    element.extract()

doc= BeautifulSoup(html) # maybe fromEncoding= 'utf-8'
for link in doc.findAll('a'):
    replaceWithContents(link)
str(doc)

bobince 2010-04-07 11:09:50

are you sure `<div title="a>b">` is valid? I know most browsers will accept it, but I don't think it's actually valid (it should be `<div title="a>b">`)

roe 2010-04-07 13:04:00

Yep, it's perfectly valid in both HTML and XHTML. Chuck it into the validator and see! The `>` escape is never actually needed in HTML. (It is in XML, in one specific and unusual case, but never in attribute values.)

bobince 2010-04-07 13:14:27

You're right.. how odd. '<' is not valid though.

roe 2010-04-08 09:03:13

ansaurus

tags:

views:

answers:

strip only html anchor tags.

related questions