ansaurus

Question

Regular expression to find URLs not inside a hyperlink

Answer 1

+2 A:

You can do it in two steps instead of trying to come up with a single regular expression:

Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).
Match the URL

In Perl it could be:

my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
  print "Matched an URL outside a HTML anchor !: $_\n";
}

Peter Mortensen 2009-08-22 10:06:11

If I remove (blend out) the HTML anchors, I won't be able to determine if the URL was originally inside a hyperlink, right? I'm only looking for URLs that are outside hyperlink tags.

Ben Amada 2009-08-22 10:09:53

I mean: remove *everything* from the opening anchor tag till the closing anchor tag.

Peter Mortensen 2009-08-22 10:13:57

Ah, great solution. I got it working. At first I thought you meant to just remove the beginning and ending tags, but removing the whole tag was the trick. Thank you!!

Ben Amada 2009-08-22 10:57:11

-1 You should remove the <a> elements through a proper parser, since HTML is not a regular language.

Svante 2009-08-22 11:40:06

@Svante: I don't think this is fair. Shouldn't it be directed towards the question instead? The question was about matching with regular expressions.

Peter Mortensen 2009-08-22 17:12:28

@Svante - boo to you... +1

jons911 2010-08-05 14:29:19

Answer 2

A:

You can do that using a single regular expression that matches both anchor tags and hyperlinks:

# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'

Then loop over the results and only process matches where the second sub-pattern was found.

Ferdinand Beyer 2009-08-22 10:38:19

This only works for those URLs that are inside the tag, not for those inside an <a> element. Also, it tries to parse a non-regular language with regular expressions.

Svante 2009-08-22 11:38:44

@Svante: First, you can easily extend the example to match everything within <a...> and </a>. Then it does the same as the accepted answer, only in a single pass. Second, no, "it" does not try to parse anything but a regular language based on occurrences of HTML-ish strings. There is no need to use a full-featured HTML parser if all you want is find simple pattern in the string.

Ferdinand Beyer 2009-08-22 23:09:30

Answer 3

A:

Peter has a great answer: first, remove anchors so that

Some text <a href="http://page.net"&gt;TeXt&lt;/a&gt; and some more text with link http://a.net

is replaced by

Some text  and some more text with link http://a.net

THEN run a regexp that finds urls:

http://a.net

Paxinum 2009-08-22 10:55:20

Answer 4

A:

Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.

Svante 2009-08-22 10:59:49

ansaurus

tags:

views:

answers:

Regular expression to find URLs not inside a hyperlink

related questions