ansaurus

Question

Answer 1

+2 A:

You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:

/(https?:[;\/?\\@&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\@\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g

I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.

Edit: Try using the answers here: http://stackoverflow.com/questions/203859/how-do-i-get-python-markdown-to-additionally-urlify-links-when-formatting-plain

import re

urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")

def urlify2(value):
    return urlfinder.sub(r'<a href="\1">\1</a>', value)

call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.

meder 2009-11-13 06:43:18

So, what is not allowed in a url?

Amarghosh 2009-11-13 06:45:37

Btw, what if the link is already inside href attribute of an anchor tag?

Amarghosh 2009-11-13 06:46:36

When you're inside the text node, make sure the parent or ancestor isn't an anchor.

meder 2009-11-13 06:49:26

@Amarghosh - The coder's regex dictates what a "url" is, of course I agree with you that it's going to be difficult making one that's compatible for all types of urls but even the most basic will support 95% of urls.

meder 2009-11-13 06:52:16

Answer 2

A:

When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.

Matching the actual URLs is probably best done with the urlparse module. Full discussion here: http://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python

steveha 2009-11-13 06:48:54

Answer 3

A:

Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.

See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.

John Leidegren 2009-11-13 06:51:02

Answer 4

A:

/\w+:\/\/[^\s]+/

echo 2009-11-13 07:45:19

ansaurus

tags:

views:

answers:

replace URLs in text with links to URLs

related questions