tags:

views:

5455

answers:

6

Is there a built in dll that will give me a list of links from a string. I want to send in a string with valid html and have it parse all the links. I seem to remember there being something built into either .net or an unmanaged library.

I found a couple open source projects that looked promising but I thought there was a built in module. If not I may have to use one of those. I just didn't want an external dependency at this point if it wasn't necessary.

A: 

Nothing i know built in, other than Regular Expressions. I'm sure a bit of googling will find you a regular expression to find all links in a pages html.

Check out these ones @ RegexLib.com

Darren Kopp
A: 

Google gives me this module: http://www.majestic12.co.uk/projects/html_parser.php

Seems to be a HTML parser for .NET.

Armin Ronacher
A: 

A simple regular expression -

@"<a.*?>"

passed in to Regex.Matches should do what you need. That regex may need a tiny bit of tweaking, but it's pretty close I think.

Harper Shelby
+2  A: 

I don't think there is a built-in library, but the Html Agility Pack is popular for what you want to do.

The way to do this with the raw .NET framework and no external dependencies would be use a regular expression to find all the 'a' tags in the string. You would need to take care of a lot of edge cases perhaps. eg href = "http://url" vs href=http://url etc.

BrianLy
+1  A: 

SubSonic.Sugar.Web.ScrapeLinks seems to do part of what you want, however it grabs the html from a url, rather than from a string. You can check out their implementation here.

Forgotten Semicolon
That is actually what I want to do so this will work great for me. Not quite built in but at least SubSonic has probably had some level of testing/use.
Shaun Bowe
+3  A: 

I'm not aware of anything built in and from your question it's a little bit ambiguous what you're looking for exactly. Do you want the entire anchor tag, or just the URL from the href attribute?

If you have well-formed XHtml, you might be able to get away with using an XmlReader and an XPath query to find all the anchor tags (<a>) and then hit the href attribute for the address. Since that's unlikely, you're probably better off using RegEx to pull down what you want.

Using RegEx, you could do something like:

List<Uri> findUris(string message)
{
    string anchorPattern = "<a[\\s]+[^>]*?href[\\s]?=[\\s\\\"\']+(?<href>.*?)[\\\"\\']+.*?>(?<fileName>[^<]+|.*?)?<\\/a>";
    MatchCollection matches = Regex.Matches(message, anchorPattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.Compiled);
    if (matches.Count > 0)
    {
        List<Uri> uris = new List<Uri>();

        foreach (Match m in matches)
        {
            string url = m.Groups["url"].Value;
            Uri testUri = null;
            if (Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out testUri))
            {
                uris.Add(testUri);
            }
        }
        return uris;
    }
    return null;
}

Note that I'd want to check the href to make sure that the address actually makes sense as a valid Uri. You can eliminate that if you aren't actually going to be pursuing the link anywhere.

Jacob Proffitt
+1 for the providing an example. However, I'd like to point out that RegEx you provided on the sample `"<a.*href=[\"'](?<url>[^\"]+[.\\s]*)[\"'].*>(?<name>[^<]+[.\\s]*)</a>"` fails in the following case `<DIR> <A HREF="..">..</A><BR>03/02/10 04:42PM [GMT] <DIR> <A HREF="/Incoming/tmp/">tmp</A>` (it captures only one hyperlink, I got this example from a FTP directory listing). Changing it to the following RegEx: `string anchorPattern = @"<a[\s]+[^>]*?href[\s]?=[\s\""\']+(?<href>.*?)[\""\']+.*?>(?<fileName>[^<]+|.*?)?<\/a>";` worked in any case I tested.
Carlos Loth