ansaurus

Question

Answer 1

+2 A:

Have a read of http://code.google.com/apis/ajaxsearch/, it's going to be a lot easier to get the data out of a JSON string than digging through acres of HTML. There's an open source Java class for digesting JSON: http://www.json.org/java/. Transferring the JSON will require a lot less bandwidth too!

fredley 2010-08-11 16:02:18

Hi fredley, I not familiar to JSON a link with a "proof of concept" to what you advised might be helpful...THNX

Roey 2010-08-11 18:33:29

The great thing about it is you don't need to know how it works. Once you've retrieved your JSON String using the appropriate call, you initialize it: JSONObject j = new JSONObject(jsonString); and then everything is in a nicely formatted data structure under j, so you can make calls like:int myInt = j.getInt['someTag'];JSONObject[] myArray = j.getJSONArray['Results'];String Title = myArray.getJSONObject(0).getString['title'];All you need to do is read the docs on the api to learn the data structure, then there's only a few methods you actually need to use.

fredley 2010-08-11 22:41:35

Answer 2

+1 A:

If you want to do it in Java, you should consider using XPath to extract all links from the response. Therefore you first have to convert the response to XML. Then you can apply an XPath query like

//a/@href

to extract all href attributes for links. You can modify the query to only include links from the Google results and not from advertisements etc.

Here is another Tutorial to get you started.

Happy coding.

BTW: To avoid mistakes when you create your HTTP request and (even more important) to avoid unnecessary work, you could use a library like Apache Commons HTTPClient. This would reduce your work to:

HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("http://www.google.com/search?q=" + query);
int statusCode = client.executeMethod(method);
if (statusCode != HttpStatus.SC_OK) {
  System.err.println("Method failed: " + method.getStatusLine());
}
String response = new String(method.getResponseBody());

moxn 2010-08-11 16:16:49

If you're going to parse HTML anyway instead of using a lightweight JSON webservice, then I'd recommend Jsoup over HttpClient. HttpClient is nice, but it gives nothing to parse HTML with. You could as good use java.net.URLConnection.

BalusC 2010-08-11 16:27:25

@BalusC Hey, that's cool. Haven't heard of JSoup before. Thanks for the hint.

moxn 2010-08-12 09:10:53

ansaurus

tags:

views:

answers:

Parse HTML links from a google query

related questions