i have following code that strip all tags. now i want to strip only anchor tags.
x = re.compile(r'<[^<]*?/?>')
how to modify so that only anchor tags stripped.
i have following code that strip all tags. now i want to strip only anchor tags.
x = re.compile(r'<[^<]*?/?>')
how to modify so that only anchor tags stripped.
x = re.compile(r'<[aA]\>[^<]*?/?>')
This will match the 'a' or 'A' followed by a word boundary. Note that it won't clean out the closing tag.
x = re.compile(r'</?[aA]\>[^<]*?/?>')
will remove the closing tag as well.
EDIT:
Actually, it feels more reliable to switch the [^<]
to [^>]
, like so.
x = re.compile(r'</?[aA]\>[^>]*?/?>')
I'm not sure if this Python is correct (I'm a PHP guy but am just starting to learn python in my own time).
re.sub('<[aA][^>]*>([^<]+)</[aA]>','\1','<html><head> .... </body></html>')
This won't remove all anchor tags in one shot, so you may have to loop over the html string. It matches the anchor tags and replaces the match with the contents of the tags. So ...
<a href="/">homepage</a> -> homepage
Might not be the most efficient on a large body of text but works.
following code that strip all tags.
Not really. <div title="a>b">
is valid HTML and gets mangled. <div title="<" onmouseover="script()" class="<">">
is invalid HTML but the kind of thing you will often find on real web pages. Your regexp leaves an active tag with dangerous scripting in it.
You can't do an HTML-processing task like tag-stripping with regex, unless your possible input set is heavily restricted. Better to use a real HTML parser and walk across the resulting document removing unwanted elements as you go.
eg. with BeautifulSoup:
def replaceWithContents(element):
ix= element.parent.contents.index(element)
for child in reversed(element.contents):
element.parent.insert(ix, child)
element.extract()
doc= BeautifulSoup(html) # maybe fromEncoding= 'utf-8'
for link in doc.findAll('a'):
replaceWithContents(link)
str(doc)