tags:

views:

967

answers:

6

Using regex, how is the simplest way to fetch a websites HTML and find the value inside this tag (or any attribute's value for that matter):

<html>
  <head>
  [snip]
  <meta name="generator" value="thevalue i'm looking for" />
  [snip]
+4  A: 

Depends on how sophisticated of an Http request you need to build (authentication, etc). Here's one simple way I've seen used in the past.

StringBuilder html = new StringBuilder();
java.net.URL url = new URL("http://www.google.com/");
BufferedReader input = null;
try {
    input new BufferedReader(
        new InputStreamReader(url.openStream()));

    String htmlLine;
    while ((htmlLine=input.readLine())!=null) {
        html.appendLine(htmlLine);
    }
}
finally {
    input.close();
}

Pattern exp = Pattern.compile(
    "<meta name=\"generator\" value=\"([^\"]*)\" />");
Matcher matcher = exp.matcher(html.toString());
if(matcher.find())
{
    System.out.println("Generator: "+matcher.group(1));
}

Probably plenty of typos here to be found when compiled. (hope this wasn't homework)

Mike Haboustak
What if the meta tag is commented out? This will still read it. Is that right? What if there are two spaces between meta and name? Or a tab? Or a newline? What if the word generator is not surrounded by quotes? Because of these issues and plenty more, I suggest not writing this yourself but finding a library that will do it for you.
Steve McLeod
A: 

You may want to check the documentation for Apache's org.apache.commons.HttpClient package and the related packages here. Sending an HTTP request from a Java application is pretty easy to do. Poking through the documentation should get you off in the right direction.

Justin Bennett
A: 

I haven't tried this, but wouldn't the basic framework be

  1. Open a java.net.HttpURLConnection
  2. Get an input stream using getInputStream
  3. Use the regular expression in Mike's answer to parse out the bit you want
Paul Tomblin
A: 

Strictly speaking you can't really be sure you got the right value, since the meta tag may be commented out, or the meta tag may be in uppercase etc. It depends on how certain you are that the HTML can be considered as "nice".

Eek
+1  A: 

You should be using XPath query. It'ls as simple as getting value of "/html/head/meta[@name=generator]/@value".

a good tutorial

Vardhan Varma
How do you suggest we execute XPath against Html, when Html is not Xml? You can't guarantee that Html can be loaded as an Xml document for XPath navigation. Now an Html DOM is a great tool for this, but RegEx works and is straight-forward.
Mike Haboustak
The example in the question is obviously XHTML and therefore XML, because it has a self-closing tag.
Ben James
A: 

It depends.

If you are extracting information from a site or sites that are guaranteed to be well-formed HTML, and you know that the <meta> won't be obfuscated in some way then a reading the <head> section line by line and applying a regex is a good approach.

On the other hand, if the HTML may be mangled or "tricky" then you need to use a proper HTML parser, possibly a permissive one like HTMLTidy. Beware of using a strict HTML or XML parser on stuff trawled from random websites. Lots of so-called HTML you find out there is actually malformed.

Stephen C
Well-formed HTML is even more of a reason to try and use a proper parser instead of regex. Regex should never be used to parse HTML, period.
Ben James