tags:

views:

153

answers:

7

I wand to develop http client in Java for college project which login to site, obtain data from HTML data, complete and send forms. I don't know which http lib to use : Apache HTTP client - don't create DOM model but work with http redirects, multi threading. HTTPUnit - create DOM model and is easy to work with forms, fields, tables etc. but I don't know how will work with multi-threading and proxy settings.

Any advice ?

+1  A: 

HTTPUnit is for unit testing. Unless you mean "testing client", I don't think it's appropriate for creating an application.

I wand to develop http client in Java

You realize, of course, that the Apache HTTP client is not your answer either. You sound like you want to create a first web app.

You'll need servlets and JSPs. Get Apache's Tomcat and learn enough JSP and JSTL to do what you need to do. Don't bother with frameworks, since it's your first.

When you have it running, then try a framework like Spring.

duffymo
The question seems to be quite clearly client-side. Servlets and JSPs aren't relevant for the client-side functionality.
lexicore
Doesn't sound like jorik1000 is trying to develop a server-side application, but rather a specialised web client that scrapes and submits information. HttpUnit is designed to make unit testing of web pages easy, but as a consequence it's also a good tool for working with a web page at a high level to general stuff like pulling out information and filling in forms.
isme
JSPs aren't client side?
duffymo
JSP (Java ServerPages) is a server-side technology like PHP and Perl. A client only ever sees the result of the server's processing of JSP directives.
isme
I realize that they're compiled into servlets that run on the server side, but the fact that the client "sees" the result sure has client side flavor to me.
duffymo
+1  A: 

It seems to be a cURL support for java :
http://curl.haxx.se/libcurl/java/

Vitalyson
I like cURL, but why depend on a native C library when there's a pure Java library such as Apache HTTPClient?
R. Kettelerij
+1  A: 

Depends on how complex your websites are. Options are Apache HttpClient (plus something like JTidy) or testing-oriented packages like HtmlUnit or Canoo WebTest. HtmlUnit is quite powerful - you'd be able to process JavaScript, for instance.

lexicore
+1 for pointing out [Canoo WebTest](http://webtest.canoo.com/webtest/manual/WebTestHome.html). It's new to me. But it looks like it's designed more specifically for testing pages, and not suitable for general page manipulation and data extraction. How does it compare to HtmlUnit?
isme
+2  A: 

It sounds like you are trying to create a web-scraping application. For this purpose, I recommend the HtmlUnit library.

It makes it easy to work with forms, proxies, and data embedded in web pages. Under the hood I think it uses Apache's HttpClient to handle HTTP requests, but this is probably too low-level for you to be worried about.

With this library you can control a web page in Java the same way you would control it in a web browser: clicking a button, typing text, selecting values.

Here are some examples from HtmlUnit's getting started page:

Submitting a form:

@Test
public void submittingForm() throws Exception {
    final WebClient webClient = new WebClient();

    // Get the first page
    final HtmlPage page1 = webClient.getPage("http://some_url");

    // Get the form that we are dealing with and within that form, 
    // find the submit button and the field that we want to change.
    final HtmlForm form = page1.getFormByName("myform");

    final HtmlSubmitInput button = form.getInputByName("submitbutton");
    final HtmlTextInput textField = form.getInputByName("userid");

    // Change the value of the text field
    textField.setValueAttribute("root");

    // Now submit the form by clicking the button and get back the second page.
    final HtmlPage page2 = button.click();

    webClient.closeAllWindows();
}

Using a proxy server:

@Test
public void homePage_proxy() throws Exception {
    final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_2, "http://myproxyserver", myProxyPort);

    //set proxy username and password 
    final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
    credentialsProvider.addProxyCredentials("username", "password");

    final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
    assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());

    webClient.closeAllWindows();
}

The WebClient class is single threaded, so every thread that deals with a web page will need its own WebClient instance.

Unless you need to process Javascript or CSS, you can also disable these when you create the client:

WebClient client = new WebClient();
client.setJavaScriptEnabled(false);
client.setCssEnabled(false);
isme
A: 

Jetty has a nice client side library. I like to use it because I often need to create a server along with the client. The Apache HTTP Client is really good and seems to have some more features that work like being able to resolve a proxy using SSL.

Joshua
+1  A: 

HTTPUnit is meant for testing purposes, I don't think it is best suited to be embedded inside your application.

When you want to consume HTTP resources (like webpages) I'd recommend Apache HTTPClient. But you may find this framework to low level for your use case which is webpage scraping. So I'd recommend an integration framework like Apache Camel for this purpose. For example the following route reads a webpage (using Apache HTTPClient), transforms the HTML to well-formed HTML (using TagSoup) and transforms the result to a XML representation for further processing.

from("http://mycollege.edu/somepage.html).unmarshall().tidyMarkup().to("xslt:mystylesheet.xsl")

You can further process the resulting XML using XPath or transform it to a POJO using JAXB for example.

R. Kettelerij
I use HtmlUnit because it's easy. I can pull out the info i need from a page by XPath and then run away. What you are suggesting sounds like overkill. Why do you recommend this way? What's wrong with using HtmlUnit in an application?
isme
+1 for mentioning the HttpClient + TagSoup combo. When I rolled my own scraping library, these worked great together, and were faster than the full-fat HtmlUnit.
isme
Note the 'Unit' part, these libraries are primarily focused on (unit)testing. Nevertheless I've removed the reference to HTMLUnit since it provides more general scraping functions.
R. Kettelerij
I would say that unit-testing a web site is a use case of web scraping. Both HttpUnit and HtmlUnit make it easy to scrape sites for information. I confess I haven't used HttpUnit, but their [unit-testing howto](http://httpunit.sourceforge.net/doc/cookbook.html) reads as a scraping howto just as well. As I understand it, HtmlUnit has better DOM support (through the magic `getByXPath` method), but HttpUnit exposes more HTTP concepts like the raw requests and responses. Wether the HTTP stuff is useful depends on the site you're trying to scrape.
isme
A: 

If you really want to simulate a browser, then Selenium RC

flybywire