tags:

views:

139

answers:

4

Hi

There's a web page with a search engine:

http://www.nukat.edu.pl/cgi-bin/gw_48_1_12/chameleon?sessionid=2010010122520520752&skin=default&lng=pl&inst=consortium&search=KEYWORD&function=SEARCHSCR&SourceScreen=NOFUNC&elementcount=1&pos=1&submit=TabData

I want to use its search engine from a java application.

Currently I'm trying to send a very simple request - only one field filled and no logical operators.

This is my code:

try {
    URL url = new URL( nukatSearchUrl );
    URLConnection urlConn = url.openConnection();
    urlConn.setDoInput( true );
    urlConn.setDoOutput( true );
    urlConn.setUseCaches( false );
    urlConn.setRequestProperty( "Content-Type", "application/x-www-form-urlencoded" );
    BufferedWriter out = new BufferedWriter( new OutputStreamWriter( urlConn.getOutputStream() ) );
    String content = "t1=" + URLEncoder.encode( "Duma Key", "UTF-8" );
    out.write( content );
    out.flush();
    out.close();
    BufferedReader in = new BufferedReader( new InputStreamReader( urlConn.getInputStream() ) );

    String rcv = null;
    while ( ( rcv = in.readLine() ) != null ) {
        System.out.println( rcv );
    }
    fd.close();
    in.close();
} catch ( Exception ex ) {
    throw new SearchEngineException( "NukatSearchEngine.search() : " + ex.getMessage() );
}

Unfortunateley what I keep getting is the main site - looks like this:

<cant post the link to the main site :/>

Not the search results I'm expecting.

What could be wrong here?

+2  A: 

The URL may be wrong or your request is likely incomplete. You need to check the HTML source (rightclick page > View Source) and use the same URL as definied in the <form action> and gather all request parameters (including those from hidden input fields and the button which you intend to "press"!) for use in your query string.

That said, doing so is in most cases a policy violation and may result in your IP become blacklisted. Please check their robots.txt and the "Terms of use" -if any, I don't understand Polish. Their robots.txt at least says that everyone is disallowed to access the entire website programmatically. Use it on your own risks. You've been warned. Better contact them and ask if they have any public webservice and then use it instead.

You can always spoof the user-agent request header with a real-looking string as extracted from a real webbrowser to minimize the risk to get recognized as a bot as pointed out by Bozho here, but you can still get caught on based on the visitor patterns/statistics.

BalusC
+1  A: 

An easy way to see all activity that you need to replicate is the Live HTTP Headers Firefox Extension. To see all form elements on the page, Firebug is useful. Finally, I often use a fake server that I control to see what the browser is sending, and compare to my application. I rolled my own, just a small Java server that prints out everything sent to it - inverse telnet, if you will.

Another note is that some sites deny access based on the User-Agent, i.e. you might need to get your application to pretend it's Firefox. This is very bad practice, and a little dishonest. As BalusC mentioned, check their usage policy and robots.txt! I would also recommend asking permission if you intend to spread your application around.

Finally, I happen to be working on something similar and you might find the following code useful (it writes a mapping of key -> lists of values to the correct POST format):

        StringBuilder builder = new StringBuilder();
        try {
            boolean first = false;
            for(Entry<String,List<String>> entry : data.entrySet()) {
                for(String value : entry.getValue()) {
                    if(first) {
                        first = false;
                    }
                    else {
                        builder.append("&");
                    }
                    builder.append(URLEncoder.encode(entry.getKey(), "UTF-8") + "=" + URLEncoder.encode(value, "UTF-8"));   
                }
            }
        } catch (UnsupportedEncodingException e1) {
            return false;
        }
        conn.setDoOutput(true);
        try {
            OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
            wr.write(builder.toString());
            wr.flush();
            conn.connect();
        } catch (IOException e) {
            return(false);
        }
ZoFreX
+2  A: 

I wound't go any further with this after reading BalusC's answer. Here are, however, a few pointers, if you don't worry of being blacklisted:

  • set the User-Agent header to pretend being a browser. for example

    urlConn.setRequestProperty("User-Agent", 
       "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB6");
    
  • you can use a simulation of a human user in firefox, using Selenium WebDriver

Bozho
Nice information there :)
James P.
A: 

As well as the user-agent it could also be using cookies to check that the search is being sent from the search page.

HttpClient is good for automating form submission including handling any cookies and pretending to be a browser.

objects