tags:

views:

240

answers:

3

I am trying to find all of the links in source code on a website, could anyone tell me the expression i would need to put in my Regex to find these?


Duplicate of (among others): Regular expression for parsing links from a webpage?

Google finds more: html links regex site:stackoverflow.com

+1  A: 

Here's a good resource for it: How to match links in HTML with regex.

PEZ
Thanks for fixing the link, Tomalek.
PEZ
Every damn day a new user asks how to parse HTML with regexen, and naturally every clever-trousers jumps in with their own completely broken attempts to square the circle. Can't SO have an automatic "DON'T DO THAT, YOU FOOL" popup any time you enter a question with 'regex' and 'html' tags?
bobince
A: 

Probably not bullet proof, but this one will do the trick I think (.NET regex).

<a href[^>]*>(.*?)</a>

Some code to match all links from the HTML file:

StringCollection resultList = new StringCollection();
try {
    Regex regexObj = new Regex("<a href[^>]*>(.*?)</a>", RegexOptions.IgnoreCase);
    Match matchResult = regexObj.Match(subjectString);
    while (matchResult.Success) {
     resultList.Add(matchResult.Value);
     matchResult = matchResult.NextMatch();
    } 
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}
Patrik
A: 

I'm not certain how these would translate to C# (I haven't done any development in C# myself yet), but here's how I might do it in JavaScript or ColdFusion. It might give you an idea about how you want to do it in C#.

In JavaScript I think this would work:

rex = /.*href="([^"]+)"/; 
a = source.replace(rex,'\n$1').split('\n');

after which a would be an array containing the links... though I'm not certain if that will work exactly the way I think it will. The idea here is that the replace creates a line-break-delimited list (because you can't have a line-break in a URL) and then you can break apart the list with split() to get your array.

By comparison in ColdFusion you would have to do something slightly different:

a = REMatch('href="[^"]+"',source); 
for (i = 1; i < ArrayLen(a); i++) {
  a[i] = mid(a[i],6,len(a[i])-1); 
}

Again, I haven't tested it, but rematch returns an array of instances of the expression and then the for-next loop removes the href="" around the actual URL.

Isaac Dealey
The question is tagged with C#.
Patrik
D'oh! Thanks Patrik. :)
Isaac Dealey