tags:

views:

131

answers:

3

I need to replace links in html:

<a href="http://example.com"&gt;&lt;/a&gt;

To just plain-text url address:

http://example.com

UPD. Some clarification here, i need this to strip down html tags from text but preserve link locations. It's purely for internal use, so there won't be any crazy edge-case code. Language is python in this case, but i don't see how's that relevant.

+1  A: 
>>> s="""blah <a href="http://example.com"&gt;&lt;/a&gt; blah <a href="http://www.google.com"&gt;test&lt;/a&gt;"""
>>> import re
>>> pat=re.compile("<a\s+href=\"(.*?)\">.*?</a>",re.M|re.DOTALL|re.I)
>>> pat.findall(s)
['http://example.com', 'http://www.google.com']
>>> pat.sub("\\1",s)
'blah http://example.com blah http://www.google.com'

for more complex operations, use BeautifulSoup

ghostdog74
This isn't going to work if there are any other attributes in your anchor tags... and if you try to accomodate them, your regex will quickly get out of control.
Renesis
simple and it works. Thanks
Dmitry Shevchenko
A: 

Instead of using regex, you could try to use unlink with minidom

S.Mark
um, how's that would work here? :)
Dmitry Shevchenko
+1  A: 

As I said before if you are ok with some mistakes and/or have some amount of control over the input, you can make some compromises in completeness and use Regex. Since your update says this is the case, here's a regex that should work for you:

/<a\s(?:.(?!=href))*?href="([^"]*)"[^>]*?>(.*?)</a>/gi
  • $1: The HREF
  • $2: Everything inside the tag.

This will handle all the test cases below except the last three lines:

Hello this is some text <a href="/test">This is a link</a> and this is some more text.
<a href="/test">Just a link on this line.</a>
There are <a href="/test">two links </a> on <a href="http://www.google.com"&gt;this line</a>!
Now we need to test some <a href="http://www.google.com" class="test">other attributes.</a>. They can be <a class="test" href="http://www.google.com"&gt;before&lt;/a&gt; or after.
Or they can be <a rel="nofollow" href="http://www.google.com" class="myclass">both</a>
Also we need to deal with <a href="/test" class="myclass" style=""><span class="something">Nested tags and empty attributes</span></a>.
Make sure that we don't do anything with <a name="marker">anchors with no href</a>
Make sure we skip other <address href="/test">tags that start with a even if they are closed with an a</a>
Lastly try some other <a href="#">types</a> of <a href="">href</a> attributes.

Also we need to skip <a malformed tags.  </a>.  But <a href="#">this</a> is where regex fails us.
We will also fail if the user has used <a href='javascript:alert("the reason"))'>single quotes for some reason</a>
Other invalid HTML such as <a href="/link1" href="/link2">links with two hrefs</a> will have problems for obvious reasons.
Renesis
excellent answer, thank you.
Dmitry Shevchenko