tags:

views:

52

answers:

2

Hi,

First the revised code which throws javax.swing.text.ChangedCharSetException:

import java.io.*;
import java.net.*;

public class Main
{
    public static void main(String[] args) throws IOException, Exception
    {
        String query = "#pragma";
        Socket s = new Socket("google.com",80);
        PrintStream p = new PrintStream(s.getOutputStream());
        p.print("GET /search?q="+query+" HTTP/1.0\r\n");
        p.print("User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)\r\n");
        p.print("Connection: close\r\n\r\n");

        InputStreamReader in = new InputStreamReader(s.getInputStream());
        BufferedReader buffer = new BufferedReader(in);
//        String line;
//
//        while ((line = buffer.readLine()) != null)
//        {  System.out.println(line); }
        HTMLUtils.ParseLinks (buffer);
        in.close();
    }
}


import java.io.BufferedReader;
import java.io.IOException;
//import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;

import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.MutableAttributeSet;

public class HTMLUtils
{
  private HTMLUtils() {}

  public static List<String> extractLinks(Reader reader) throws IOException
  {
    final ArrayList<String> list = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback()
    {
      public void handleText(final char[] data, final int pos) { }
      public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos)
      {
        if (tag == Tag.A) {
          String address = (String) attribute.getAttribute(Attribute.HREF);
          list.add(address);
        }
      }
      public void handleEndTag(Tag t, final int pos) {  }
      public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
      public void handleComment(final char[] data, final int pos) { }
      public void handleError(final java.lang.String errMsg, final int pos) { }
    };
    parserDelegator.parse(reader, parserCallback, false);
    return list;
  }

  public static void ParseLinks(BufferedReader buffer) throws Exception{
    //FileReader reader = new FileReader("buffer");
    List<String> links = HTMLUtils.extractLinks(buffer);
    for (String link : links) {
      System.out.println(link);
    }
  }
}

Notice that the user agent is for IE in this example.

Now I Have 3 problems:

  1. How/can I pass the HTMLUtils.ParseLinks method a "raw buffer" instead of an HTML file she's expecting (I can write the buffer to a file but I guess that is unnecessary)
  2. I don't know how to enter inverted commas (" ") inside the query statment in order to get the whole string i.e.: query=" "New York Yankees" "
  3. Is it so complicated to get the User-Agent string from the host machine ??? link text

I have to say that it is imported class that I use and I don't really understand whats going on there. I'll try to understand when it will work [-8

THNX

+2  A: 

Have a read of http://code.google.com/apis/ajaxsearch/, it's going to be a lot easier to get the data out of a JSON string than digging through acres of HTML. There's an open source Java class for digesting JSON: http://www.json.org/java/. Transferring the JSON will require a lot less bandwidth too!

fredley
Hi fredley, I not familiar to JSON a link with a "proof of concept" to what you advised might be helpful...THNX
Roey
The great thing about it is you don't need to know how it works. Once you've retrieved your JSON String using the appropriate call, you initialize it: JSONObject j = new JSONObject(jsonString); and then everything is in a nicely formatted data structure under j, so you can make calls like:int myInt = j.getInt['someTag'];JSONObject[] myArray = j.getJSONArray['Results'];String Title = myArray.getJSONObject(0).getString['title'];All you need to do is read the docs on the api to learn the data structure, then there's only a few methods you actually need to use.
fredley
+1  A: 

If you want to do it in Java, you should consider using XPath to extract all links from the response. Therefore you first have to convert the response to XML. Then you can apply an XPath query like

//a/@href

to extract all href attributes for links. You can modify the query to only include links from the Google results and not from advertisements etc.

Here is another Tutorial to get you started.

Happy coding.

BTW: To avoid mistakes when you create your HTTP request and (even more important) to avoid unnecessary work, you could use a library like Apache Commons HTTPClient. This would reduce your work to:

HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("http://www.google.com/search?q=" + query);
int statusCode = client.executeMethod(method);
if (statusCode != HttpStatus.SC_OK) {
  System.err.println("Method failed: " + method.getStatusLine());
}
String response = new String(method.getResponseBody());
moxn
If you're going to parse HTML anyway instead of using a lightweight JSON webservice, then I'd recommend Jsoup over HttpClient. HttpClient is nice, but it gives nothing to parse HTML with. You could as good use java.net.URLConnection.
BalusC
@BalusC Hey, that's cool. Haven't heard of JSoup before. Thanks for the hint.
moxn