tags:

views:

60

answers:

3

I have a string of text that contains html, and I need to extract each url (most likely in img or a tags) to create a generic list of string objects. I only want the urls from inside html tags, not in the text. Is there an easy way to do this or will I have to resort to regular expressions?

If I have to resort to regular expressions, would you mind helping me out with that as well? :)

UPDATE: To answer Seph, the input will be standard html.

<p>This is some html text.  my favourite website is <a href="http://www.google.com"&gt;google&lt;/a&gt; and my favourite help site is <a href="http://www.stackoverflow.com"&gt;stackoverflow&lt;/a&gt; and i check my email at <a href="http://www.gmail.com"&gt;gmail&lt;/a&gt;.  the url to my site is http://www.mysite.com.   <img src="http://www.someserver.com/someimage.jpg" alt=""/></p>

And I want

the end result should be All urls in any html tag, ignoring those are are "plain text"

UPPERDATE Although he deleted his answer, I want to thank Jerry Bullard for bringing to my attention Regex Buddy (http://www.regexbuddy). I wanted to upvote your answer but its gone. Bring it back and you get a vote!

A: 

This code might be of some help :) Taken from http://www.vogella.de/articles/JavaRegularExpressions/article.html.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LinkGetter {
    private Pattern htmltag;
    private Pattern link;
    private final String root;

    public LinkGetter(String root) {
        this.root = root;
        htmltag = Pattern.compile("<a\\b[^>]*href=\"[^>]*>(.*?)</a>");
        link = Pattern.compile("href=\"[^>]*\">");
    }

    public List<String> getLinks(String url) {
        List<String> links = new ArrayList<String>();
        try {
            BufferedReader bufferedReader = new BufferedReader(
                    new InputStreamReader(new URL(url).openStream()));
            String s;
            StringBuilder builder = new StringBuilder();
            while ((s = bufferedReader.readLine()) != null) {
                builder.append(s);
            }

            Matcher tagmatch = htmltag.matcher(builder.toString());
            while (tagmatch.find()) {
                Matcher matcher = link.matcher(tagmatch.group());
                matcher.find();
                String link = matcher.group().replaceFirst("href=\"", "")
                        .replaceFirst("\">", "");
                if (valid(link)) {
                    links.add(makeAbsolute(url, link));
                }
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return links;
    }

    private boolean valid(String s) {
        if (s.matches("javascript:.*|mailto:.*")) {
            return false;
        }
        return true;
    }

    private String makeAbsolute(String url, String link) {
        if (link.matches("http://.*")) {
            return link;
        }
        if (link.matches("/.*") && url.matches(".*$[^/]")) {
            return url + "/" + link;
        }
        if (link.matches("[^/].*") && url.matches(".*[^/]")) {
            return url + "/" + link;
        }
        if (link.matches("/.*") && url.matches(".*[/]")) {
            return url + link;
        }
        if (link.matches("/.*") && url.matches(".*[^/]")) {
            return url + link;
        }
        throw new RuntimeException("Cannot make the link absolute. Url: " + url
                + " Link " + link);
    }
}
Chris Dennett
+1  A: 

Something like this should help:

    private List<string> GetUrlStrings(string text)
    {
        List<string> listURL = new List<string>();
        Regex regex = new Regex("href\\s*=\\s*(?:(?:\\\"(?<url>[^\\\"]*)\\\")|(?<url>[^\\s]* ))");
        MatchCollection mathColl = regex.Matches(text);

        foreach (Match match in mathColl)
        {
            foreach (Group group in match.Groups)
            {
                if (!group.Value.StartsWith("href")) // workaround regex issue
                {
                    listURL.Add(group.Value);
                }
            }
        }
        return listURL;
    }
BrianLy
+1  A: 

Here are 2 approaches using LINQ to XML and regex. Although some people frown upon parsing HTML with regex, this particular case doesn't have nested elements so it is a reasonable solution. LINQ to XML would only work if your HTML is well-formed. Otherwise take a look at the HTML Agility Pack.

EDIT: for your sample Elements() works with LINQ to XML. However, if you have many nested HTML tags then you may want to use Descendants() to reach all desired tags.

string input = @"<p>This is some html text.  my favourite website is <a href=""http://www.google.com""&gt;google&lt;/a&gt; and my favourite help site is <a href=""http://www.stackoverflow.com""&gt;stackoverflow&lt;/a&gt; and i check my email at <a href=""http://www.gmail.com""&gt;gmail&lt;/a&gt;.  the url to my site is http://www.mysite.com.   <img src=""http://www.someserver.com/someimage.jpg"" alt=""""/></p>";
var xml = XElement.Parse(input);
var result = xml.Elements()
                .Where(e => e.Name == "img" || e.Name == "a")
                .Select(e => e.Name == "img" ?
                            e.Attribute("src").Value : e.Attribute("href").Value);
foreach (string item in result)
{
    Console.WriteLine(item);
}

string pattern = @"<(?:a|img).+?(?:href|src)=""(?<Url>.+?)"".*?>";
foreach (Match m in Regex.Matches(input, pattern))
{
    Console.WriteLine(m.Groups["Url"].Value);
}

EDIT #2: in response to your update about RegexBuddy, I wanted to point out the tool I use. Expresso is a good free tool (email registration only, but it's free). The author also wrote The 30 Minute Regex Tutorial which you can use to follow along and is included in the help file of Expresso.

Ahmad Mageed