tags:

views:

1090

answers:

4

There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a> hyperlink tag (HREF, inner value, etc.). So NONE of the URLs in these should match:

<a href="http://www.example.com/"&gt;something&lt;/a&gt;
<a href="http://www.example.com/"&gt;http://www.example2.com&lt;/a&gt;
<a href="http://www.example.com/"&gt;&lt;b&gt;something&lt;/b&gt;http://www.example.com/&lt;span&gt;test&lt;/span&gt;&lt;/a&gt;

Any URL outside of <a></a> should be matched.

One approach I tried was to use a negative lookahead to see if the first <a> tag after the URL was an opening <a> or a closing </a>. If it is a closing </a> then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.

+2  A: 

You can do it in two steps instead of trying to come up with a single regular expression:

  1. Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).

  2. Match the URL

In Perl it could be:

my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
  print "Matched an URL outside a HTML anchor !: $_\n";
}
Peter Mortensen
If I remove (blend out) the HTML anchors, I won't be able to determine if the URL was originally inside a hyperlink, right? I'm only looking for URLs that are outside hyperlink tags.
Ben Amada
I mean: remove *everything* from the opening anchor tag till the closing anchor tag.
Peter Mortensen
Ah, great solution. I got it working. At first I thought you meant to just remove the beginning and ending tags, but removing the whole tag was the trick. Thank you!!
Ben Amada
-1 You should remove the <a> elements through a proper parser, since HTML is not a regular language.
Svante
@Svante: I don't think this is fair. Shouldn't it be directed towards the question instead? The question was about matching with regular expressions.
Peter Mortensen
@Svante - boo to you... +1
jons911
A: 

You can do that using a single regular expression that matches both anchor tags and hyperlinks:

# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'

Then loop over the results and only process matches where the second sub-pattern was found.

Ferdinand Beyer
This only works for those URLs that are inside the tag, not for those inside an <a> element. Also, it tries to parse a non-regular language with regular expressions.
Svante
@Svante: First, you can easily extend the example to match everything within <a...> and </a>. Then it does the same as the accepted answer, only in a single pass. Second, no, "it" does not try to parse anything but a regular language based on occurrences of HTML-ish strings. There is no need to use a full-featured HTML parser if all you want is find simple pattern in the string.
Ferdinand Beyer
A: 

Peter has a great answer: first, remove anchors so that

Some text <a href="http://page.net"&gt;TeXt&lt;/a&gt; and some more text with link http://a.net

is replaced by

Some text  and some more text with link http://a.net

THEN run a regexp that finds urls:

http://a.net
Paxinum
A: 

Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.

Svante