The fastest way will be parsing the specific information yourself. You seem to know the HTML structure precisely beforehand. The BufferedReader
, String
and StringBuilder
methods should suffice. Here's a kickoff example which displays the first paragraph of your own question:
public static void main(String... args) throws Exception {
URL url = new URL("http://stackoverflow.com/questions/2971155");
BufferedReader reader = null;
StringBuilder builder = new StringBuilder();
try {
reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
for (String line; (line = reader.readLine()) != null;) {
builder.append(line.trim());
}
} finally {
if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
}
String start = "<div class=\"post-text\"><p>";
String end = "</p>";
String part = builder.substring(builder.indexOf(start) + start.length());
String question = part.substring(0, part.indexOf(end));
System.out.println(question);
}
Parsing is in practically all cases definitely faster than pattern matching. Pattern matching is easier, but there is a certain risk that it may yield unexpected results, certainly when using complex regex patterns.
You can also consider to use a more flexible 3rd party HTML parser instead of writing one yourself. It will not be as fast as parsing yourself with beforehand known information. It will however be more concise and flexible. With decent HTML parsers the difference in speed is pretty negligible. I strongly recommend Jsoup for this. It supports jQuery-like CSS selectors. Extracting the firsrt paragraph of your question would then be as easy as:
public static void main(String... args) throws Exception {
URL url = new URL("http://stackoverflow.com/questions/2971155");
Document document = Jsoup.parse(url, 3000);
String question = document.select("#question .post-text p").first().text();
System.out.println(question);
}
It's unclear what web page you're talking about, so I can't give a more detailed example how you could select the specific information from the specific page using Jsoup. If you still can't figure it at your own using Jsoup and CSS selectors, then feel free to post the URL in a comment and I'll suggest how to do it.