Using regex, how is the simplest way to fetch a websites HTML and find the value inside this tag (or any attribute's value for that matter):
<html>
<head>
[snip]
<meta name="generator" value="thevalue i'm looking for" />
[snip]
Using regex, how is the simplest way to fetch a websites HTML and find the value inside this tag (or any attribute's value for that matter):
<html>
<head>
[snip]
<meta name="generator" value="thevalue i'm looking for" />
[snip]
Depends on how sophisticated of an Http request you need to build (authentication, etc). Here's one simple way I've seen used in the past.
StringBuilder html = new StringBuilder();
java.net.URL url = new URL("http://www.google.com/");
BufferedReader input = null;
try {
input new BufferedReader(
new InputStreamReader(url.openStream()));
String htmlLine;
while ((htmlLine=input.readLine())!=null) {
html.appendLine(htmlLine);
}
}
finally {
input.close();
}
Pattern exp = Pattern.compile(
"<meta name=\"generator\" value=\"([^\"]*)\" />");
Matcher matcher = exp.matcher(html.toString());
if(matcher.find())
{
System.out.println("Generator: "+matcher.group(1));
}
Probably plenty of typos here to be found when compiled. (hope this wasn't homework)
You may want to check the documentation for Apache's org.apache.commons.HttpClient package and the related packages here. Sending an HTTP request from a Java application is pretty easy to do. Poking through the documentation should get you off in the right direction.
I haven't tried this, but wouldn't the basic framework be
Strictly speaking you can't really be sure you got the right value, since the meta tag may be commented out, or the meta tag may be in uppercase etc. It depends on how certain you are that the HTML can be considered as "nice".
You should be using XPath query. It'ls as simple as getting value of "/html/head/meta[@name=generator]/@value".
a good tutorial
It depends.
If you are extracting information from a site or sites that are guaranteed to be well-formed HTML, and you know that the <meta> won't be obfuscated in some way then a reading the <head> section line by line and applying a regex is a good approach.
On the other hand, if the HTML may be mangled or "tricky" then you need to use a proper HTML parser, possibly a permissive one like HTMLTidy. Beware of using a strict HTML or XML parser on stuff trawled from random websites. Lots of so-called HTML you find out there is actually malformed.