views:

900

answers:

4

I've made two versions of a script that submits a (https) web page form and collects the results. One version uses Snoopy.class in php, and the other uses urllib and urllib2 in python. Now I would like to make a java version.

Snoopy makes the php version exceedingly easy to write, and it runs fine on my own (OS X) machine. But it allocated too much memory, and was killed at the same point (during curl execution), when run on the pair.com web hosting service. Runs fine on dreamhost.com web hosting service.

So I decided to try a python version while I looked into what could cause the memory problem, and urllib and urllib2 made this very easy. The script runs fine. Gets about 70,000 database records, using several hundred form submissions, saving to a file of about 10MB, in about 7 minutes.

Looking into how to do this with java, I get the feeling it will not be the same walk-in-the-park as it was with php and python. Is form submission in java not for mere mortals?

I spent most of the day just trying to figure out how to set up Apache HttpClient. That is, before I gave up. If it takes me more than a few more days to sort that out, then it will be the subject of another question, I suppose.

HttpClient innovation.ch does not support https.

And WebClient looks like it will take me at least a few days to figure out.

So, php and python versions were a breeze. Can a java version be made in a few simple lines as well? If not, I'll leave it for a later day since I'm only a novice. If so, can some kind soul please point me toward the light?

Thanks.

For comparison, the essential lines of code from the two versions:


python version

import urllib
import urllib2

submitVars['firstName'] = "John"
submitVars['lastName'] = "Doe"
submitUrl = "https URL of form action goes here"
referer = "URL of referring web page goes here"

submitVarsUrlencoded = urllib.urlencode(submitVars)
req = urllib2.Request(submitUrl, submitVarsUrlencoded)
req.add_header('Referer', referer)
response = urllib2.urlopen(req)
thePage = response.read()


php version

require('Snoopy.class.php');
$snoopy = new Snoopy;

$submit_vars["first_name"] = "John";
$submit_vars["last_name"] = "Doe";
$submit_url = "https URL of form action goes here";
$snoopy->referer = "URL of referring web page goes here"; 

$snoopy->submit($submit_url,$submit_vars);
$the_page = $snoopy->results;
+2  A: 

Use HttpComponents http://hc.apache.org/. You need:

Example code:

import org.apache.http.message.BasicNameValuePair;
import org.apache.http.NameValuePair;
import org.apache.http.HttpResponse;
import org.apache.http.HttpEntity;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.HttpClient;

import java.util.ArrayList;
import java.util.List;
import java.io.OutputStream;
import java.io.ByteArrayOutputStream;

public class HttpClientTest {
    public static void main(String[] args) throws Exception {

        // request parameters
        List<NameValuePair> formparams = new ArrayList<NameValuePair>();
        formparams.add(new BasicNameValuePair("q", "quality"));
        UrlEncodedFormEntity entity = new UrlEncodedFormEntity(formparams, "UTF-8");
        HttpPost httppost = new HttpPost("http://stackoverflow.com/search");
        httppost.setEntity(entity);

        // execute the request
        HttpClient httpclient = new DefaultHttpClient();
        HttpResponse response = httpclient.execute(httppost);

        // display the response status code
        System.out.println(response.getStatusLine().getStatusCode());

        // display the response body
        HttpEntity responseEntity = response.getEntity();
        OutputStream out = new ByteArrayOutputStream();
        responseEntity.writeTo(out);
        System.out.println(out);
    }
}

Save it to HttpClientTest.java. Have this java file, httpcore-4.0.1.jar and httpclient-4.0-alpha4.jar in the same directory Supposing you have the sun java 1.6 jdk installed, compile it:

javac HttpClientTest.java -cp httpcore-4.0.1.jar;httpclient-4.0-alpha4.jar;commons-logging-1.1.1.jar

Execute it

java HttpClientTest.class -cp httpcore-4.0.1.jar;httpclient-4.0-alpha4.jar;commons-logging-1.1.1.jar

I would argue that is as simple in java as it is in php or python (your examples). In all cases you need:

  • the sdk configured
  • a library (with dependencies)
  • sample code
Mercer Traieste
Thanks for the tips. As mentioned in my post, it was my lack of success with that product that led to my question.On the page you cite, there is a link "Download" that leads to a page with only one relevant link, "4.0-beta2". This link leads to a page with no fewer 18 possible downloads. As a novice, I could not figure out which I needed, so I tried some at random. By the end of the day, I had given up.
Lasoldo Solsifa
Ok, i've updated the answer with a direct download link, and some tips on how you can use the library.
Mercer Traieste
Thank you very much for the effort.I would rather not have to learn an IDE for this task, though. My program is a page long, and I have always gotten by quite well in java with a text editor.Your link is to HttpCore. As a novice, I did not anticipate that, when searching for the correct version of HttpClient, I should choose HttpCore.I am fairly certain the answer to my posted question is No, this is obviously not a task for a novice. Apache HttpClient (Core, whatever) is the recommended solution, and it remains, even now, a quagmire just getting it on my machine.
Lasoldo Solsifa
Instead of using Eclipse, you can just write your files as usual, and then pass '-cp httpcore-4.0.1.jar' to javac when you compile.
notnoop
Thanks, and in fact that's precisely what I've been doing. Only it's not enough to have httpcore-4.0.1.jar in the CLASSPATH. httpclient-4.0-beta2.jar appears to be required as well, and I suspect this is what Tarnschaf meant in his response by "make sure you add the dependencies also to classpath". I'll know more when I've finally sorted out the NoClassDefFoundError I've been getting for the last couple of hours. An Apache example file compiles after I added httpclient-4.0-beta2.jar to the CLASSPATH, but I get NoClassDefFoundError when I try to run it.
Lasoldo Solsifa
What does your CLASSPATH variable look like? Where are the required JAR files?
laz
$ echo $CLASSPATH.:/Users/lasoldo/java:/Users/lasoldo/java/httpcore-4.0.1.jar:/Users/lasoldo/java/httpclient-4.0-beta2.jar:/Users/lasoldo/java/commons-logging-1.1.1.jarI now know that the required files are httpcore-4.0.1.jar, httpclient-4.0-beta2.jar, and commons-logging-1.1.1.jarNow I can use HttpClient.
Lasoldo Solsifa
Many thanks, MercerTraieste, for all the extra effort you've put into this question.Your code compiled, but would not run. NoClassDefFoundError: org/apache/commons/logging/LogFactoryGoogle led to: http://commons.apache.org/downloads/download_logging.cgiI added the jar file to my classpath. On retrying, TestHTTPClient ran fine.I believe that if you add a link to commons-logging-1.1.1.jar under "You need:", and make it clear that this jar file must also be added to the classpath, then your answer will enable a novice to get a start submitting forms with java.
Lasoldo Solsifa
Ok, i'll do that :)
Mercer Traieste
As I novice, I would notice that direct links were provided for HttpComponents Core and HttpComponents Client, but not for Commons Logging, and I would assume there was something significant about that, probably really throwing me off the track since there is no significance there. Luckily, the words together are unusual enough that at a Google search displays commons.apache.org/logging at the top, and I am spared one more wild goose chase.
Lasoldo Solsifa
+1  A: 

What would be so wrong with Apache HttpClient?

Just make sure you add the dependencies also to classpath, that is HttpComponents.

PostMethod post = new PostMethod("https URL of form action goes here");
NameValuePair[] data = {
  new NameValuePair("first_name", "joe"),
  new NameValuePair("last_name", "Doe")
};
post.setRequestBody(data);

post.addRequestHeader("Referer", "URL of referring web page goes here");

// TODO: execute method and handle any error responses.
...
InputStream inPage = post.getResponseBodyAsStream();
// handle response.
Tarnschaf
Thanks. As mentioned, I could not figure out how to install Apache HttpClient. Hence my post.
Lasoldo Solsifa
+1  A: 

Using HttpClient is certainly the more robust solution, but this can be done without an external library dependency. See here for an example of how.

laz
Thanks for that perspective. I might be willing to trade robustness for simplicity. But this example is so not simple -- 50 lines of code, compared to the few lines required by the python and php scripts. My thought is that a novice shouldn't have to do without an external library because it's too wacky an experience to get it working on his machine. With python, urllib is included, and with php, Snoopy.class just goes in your current directory. Both require a few easy lines of code to submit a form. I call this novice-friendly. Hours later, HttpClient is still not working...
Lasoldo Solsifa
That's pretty much why everyone recommends using HttpClient instead of the core Java libraries. There's some other areas where the core Java libraries (I'm looking at you Calendar) have much better 3rd party replacements. This is certainly one of those areas.
laz
+1  A: 

MercerTraieste and Tarnschaf kindly offered partial solutions to the problem. It took me a few more days, and untold hours of brain-splitting nightmare, before I gave up trying to figure out how to add a referer to the http post, and sent a new question to stackoverflow.

Jon Skeet answered instantly that I only needed...

httppost.addHeader("Referer", referer);

...which makes me look pretty dumb. How did I overlook that one?

Here is the resulting code, based almost entirely on MercerTraieste's suggestion. In my case, I needed to download, and place in my classpath:

HttpComponents

  • httpclient-4.0-beta2.jar
  • httpcore-4.0.1.jar

Apache Commons

  • commons-logging-1.1.1.jar


import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HttpRequestInterceptor;
import org.apache.http.HttpRequest;
import org.apache.http.HttpException;
import org.apache.http.NameValuePair;
import org.apache.http.HttpResponse;
import org.apache.http.HttpEntity;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.HttpClient;
import org.apache.http.protocol.HttpContext;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.impl.client.DefaultHttpClient;

import java.util.ArrayList;
import java.util.List;
import java.io.OutputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;

public class HttpClientTest
{
    public static void main(String[] args) throws Exception
    {
        // initialize some variables
        String referer = "URL of referring web page goes here";
        String submitUrl = "https URL of form action goes here";
        List<NameValuePair> formparams = new ArrayList<NameValuePair>();
        formparams.add(new BasicNameValuePair("firstName", "John"));
        formparams.add(new BasicNameValuePair("lastName", "Doe"));

        // set up httppost
        UrlEncodedFormEntity entity = new UrlEncodedFormEntity(formparams, "UTF-8");
        HttpPost httppost = new HttpPost(submitUrl);
        httppost.setEntity(entity);

        // add referer
        httppost.addHeader("Referer", referer);

        // create httpclient
        DefaultHttpClient httpclient = new DefaultHttpClient();

        // execute the request
        HttpResponse response = httpclient.execute(httppost);

        // display the response body
        HttpEntity responseEntity = response.getEntity();
        OutputStream out = new ByteArrayOutputStream();
        responseEntity.writeTo(out);
        System.out.println(out);
    }
}
Lasoldo Solsifa