ansaurus

Question

Quick way to find a value in HTML (Java)

Answer 1

+4 A:

Depends on how sophisticated of an Http request you need to build (authentication, etc). Here's one simple way I've seen used in the past.

StringBuilder html = new StringBuilder();
java.net.URL url = new URL("http://www.google.com/");
BufferedReader input = null;
try {
    input new BufferedReader(
        new InputStreamReader(url.openStream()));

    String htmlLine;
    while ((htmlLine=input.readLine())!=null) {
        html.appendLine(htmlLine);
    }
}
finally {
    input.close();
}

Pattern exp = Pattern.compile(
    "<meta name=\"generator\" value=\"([^\"]*)\" />");
Matcher matcher = exp.matcher(html.toString());
if(matcher.find())
{
    System.out.println("Generator: "+matcher.group(1));
}

Probably plenty of typos here to be found when compiled. (hope this wasn't homework)

Mike Haboustak 2008-08-28 00:38:16

What if the meta tag is commented out? This will still read it. Is that right? What if there are two spaces between meta and name? Or a tab? Or a newline? What if the word generator is not surrounded by quotes? Because of these issues and plenty more, I suggest not writing this yourself but finding a library that will do it for you.

Steve McLeod 2009-11-22 09:27:37

Answer 2

A:

You may want to check the documentation for Apache's org.apache.commons.HttpClient package and the related packages here. Sending an HTTP request from a Java application is pretty easy to do. Poking through the documentation should get you off in the right direction.

Justin Bennett 2008-08-28 01:22:32

Answer 3

A:

I haven't tried this, but wouldn't the basic framework be

Open a java.net.HttpURLConnection
Get an input stream using getInputStream
Use the regular expression in Mike's answer to parse out the bit you want

Paul Tomblin 2008-08-28 01:26:26

Answer 4

A:

Strictly speaking you can't really be sure you got the right value, since the meta tag may be commented out, or the meta tag may be in uppercase etc. It depends on how certain you are that the HTML can be considered as "nice".

Eek 2008-09-19 11:07:23

Answer 5

+1 A:

You should be using XPath query. It'ls as simple as getting value of "/html/head/meta[@name=generator]/@value".

a good tutorial

Vardhan Varma 2008-09-26 01:09:28

How do you suggest we execute XPath against Html, when Html is not Xml? You can't guarantee that Html can be loaded as an Xml document for XPath navigation. Now an Html DOM is a great tool for this, but RegEx works and is straight-forward.

Mike Haboustak 2009-01-31 04:12:19

The example in the question is obviously XHTML and therefore XML, because it has a self-closing tag.

Ben James 2009-11-22 09:39:20

Answer 6

A:

It depends.

If you are extracting information from a site or sites that are guaranteed to be well-formed HTML, and you know that the <meta> won't be obfuscated in some way then a reading the <head> section line by line and applying a regex is a good approach.

On the other hand, if the HTML may be mangled or "tricky" then you need to use a proper HTML parser, possibly a permissive one like HTMLTidy. Beware of using a strict HTML or XML parser on stuff trawled from random websites. Lots of so-called HTML you find out there is actually malformed.

Stephen C 2009-11-22 09:23:58

Well-formed HTML is even more of a reason to try and use a proper parser instead of regex. Regex should never be used to parse HTML, period.

Ben James 2009-11-22 09:35:49

ansaurus

tags:

views:

answers:

Quick way to find a value in HTML (Java)

related questions