ansaurus

Question

Regular expression for parsing links from a webpage?

Answer 1

+2 A:

Look at the URI specification. That could help you a lot. And as far as performance goes, you can pretty much extract all the HTTP links in a modest web page. When I say modest I definitely do not mean one page all encompassing HTML manuals like that of ELisp manual. Also performance is a touchy topic. My advice would be to measure your performance and then decide if you are going to extract all the links using one single regex or with multiple simpler regex expressions.

http://gbiv.com/protocols/uri/rfc/rfc3986.html

Vagmi Mudumbai 2008-08-08 17:29:50

Answer 2

A:

URL's? As in images/scripts/css/etc.?

%href="(.["]*)"%

Kevin 2008-08-08 17:29:57

Answer 3

A:

This will capture the URLs from all a tags as long as the author of the HTML used quotes:

<a[^>]+href="([^"]+)"[^>]*>

I made an example here.

Neall 2008-08-08 17:31:35

Answer 4

+1 A:

I don't have time to try and think of a regex that probably won't work, but I wanted to comment that you should most definitely break up your regex, at least if it gets to this level of ugliness:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
....*SNIP*....
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

(this supposedly matches email addresses)

Edit: I can't even fit it on one post it's so nasty....

Mike Stone 2008-08-08 17:32:01

Answer 5

+7 A:

((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)

I took this from regexlib.com

[editor's note: the {1} has no real function in this regex; see this post]

csmba 2008-08-08 17:32:12

Answer 6

+1 A:

All HTTP's and MAILTO's

(["'])(mailto:|http:).*?\1

All links, including relative ones, that are called by href or src.

#Matches things in single or double quotes, but not the quotes themselves
(?<=(["']))((?<=href=['"])|(?<=src=['"])).*?(?=\1)

#Maches thing in either double or single quotes, including the quotes.
(["'])((?<=href=")|(?<=src=")).*?\1

The second one will only get you links that use double quotes, however.

Grant 2008-08-08 17:52:29

Answer 7

+1 A:

With Html Agility Pack, you can use:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
Response.Write(link["href"].Value;
}
doc.Save("file.htm");

Jon Galloway 2008-08-17 01:08:30

Syntax error on line 5

Charlie Somerville 2009-11-22 09:42:38

Answer 8

+6 A:

from the RegexBuddy library:

URL: Find in full text

The final character class makes sure that if an URL is part of some text, punctuation such as a comma or full stop after the URL is not interpreted as part of the URL.

\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]

Jeff Atwood 2008-08-17 03:00:34

ansaurus

tags:

views:

answers:

Regular expression for parsing links from a webpage?

URL: Find in full text

related questions