ansaurus

Question

Answer 1

A:

This code might be of some help :) Taken from http://www.vogella.de/articles/JavaRegularExpressions/article.html.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LinkGetter {
    private Pattern htmltag;
    private Pattern link;
    private final String root;

    public LinkGetter(String root) {
        this.root = root;
        htmltag = Pattern.compile("<a\\b[^>]*href=\"[^>]*>(.*?)</a>");
        link = Pattern.compile("href=\"[^>]*\">");
    }

    public List<String> getLinks(String url) {
        List<String> links = new ArrayList<String>();
        try {
            BufferedReader bufferedReader = new BufferedReader(
                    new InputStreamReader(new URL(url).openStream()));
            String s;
            StringBuilder builder = new StringBuilder();
            while ((s = bufferedReader.readLine()) != null) {
                builder.append(s);
            }

            Matcher tagmatch = htmltag.matcher(builder.toString());
            while (tagmatch.find()) {
                Matcher matcher = link.matcher(tagmatch.group());
                matcher.find();
                String link = matcher.group().replaceFirst("href=\"", "")
                        .replaceFirst("\">", "");
                if (valid(link)) {
                    links.add(makeAbsolute(url, link));
                }
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return links;
    }

    private boolean valid(String s) {
        if (s.matches("javascript:.*|mailto:.*")) {
            return false;
        }
        return true;
    }

    private String makeAbsolute(String url, String link) {
        if (link.matches("http://.*")) {
            return link;
        }
        if (link.matches("/.*") && url.matches(".*$[^/]")) {
            return url + "/" + link;
        }
        if (link.matches("[^/].*") && url.matches(".*[^/]")) {
            return url + "/" + link;
        }
        if (link.matches("/.*") && url.matches(".*[/]")) {
            return url + link;
        }
        if (link.matches("/.*") && url.matches(".*[^/]")) {
            return url + link;
        }
        throw new RuntimeException("Cannot make the link absolute. Url: " + url
                + " Link " + link);
    }
}

Chris Dennett 2010-03-20 01:44:15

Answer 2

+1 A:

Something like this should help:

    private List<string> GetUrlStrings(string text)
    {
        List<string> listURL = new List<string>();
        Regex regex = new Regex("href\\s*=\\s*(?:(?:\\\"(?<url>[^\\\"]*)\\\")|(?<url>[^\\s]* ))");
        MatchCollection mathColl = regex.Matches(text);

        foreach (Match match in mathColl)
        {
            foreach (Group group in match.Groups)
            {
                if (!group.Value.StartsWith("href")) // workaround regex issue
                {
                    listURL.Add(group.Value);
                }
            }
        }
        return listURL;
    }

BrianLy 2010-03-20 01:50:29

Answer 3

+1 A:

Here are 2 approaches using LINQ to XML and regex. Although some people frown upon parsing HTML with regex, this particular case doesn't have nested elements so it is a reasonable solution. LINQ to XML would only work if your HTML is well-formed. Otherwise take a look at the HTML Agility Pack.

EDIT: for your sample Elements() works with LINQ to XML. However, if you have many nested HTML tags then you may want to use Descendants() to reach all desired tags.

string input = @"<p>This is some html text.  my favourite website is <a href=""http://www.google.com""&gt;google&lt;/a&gt; and my favourite help site is <a href=""http://www.stackoverflow.com""&gt;stackoverflow&lt;/a&gt; and i check my email at <a href=""http://www.gmail.com""&gt;gmail&lt;/a&gt;.  the url to my site is http://www.mysite.com.   <img src=""http://www.someserver.com/someimage.jpg"" alt=""""/></p>";
var xml = XElement.Parse(input);
var result = xml.Elements()
                .Where(e => e.Name == "img" || e.Name == "a")
                .Select(e => e.Name == "img" ?
                            e.Attribute("src").Value : e.Attribute("href").Value);
foreach (string item in result)
{
    Console.WriteLine(item);
}

string pattern = @"<(?:a|img).+?(?:href|src)=""(?<Url>.+?)"".*?>";
foreach (Match m in Regex.Matches(input, pattern))
{
    Console.WriteLine(m.Groups["Url"].Value);
}

EDIT #2: in response to your update about RegexBuddy, I wanted to point out the tool I use. Expresso is a good free tool (email registration only, but it's free). The author also wrote The 30 Minute Regex Tutorial which you can use to follow along and is included in the help file of Expresso.

Ahmad Mageed 2010-03-20 01:50:42

ansaurus

tags:

views:

answers:

extract all urls from a string

related questions