views:

792

answers:

9

I want to write some app, that communicates with web application, and acts something like human user (BOT).
What programming language would you suggest to use?

    Things that app have to do:
  • Send and receive information via http (GET and POST methods)
  • Ability to change any http field (User-Agent, Content-Type etc.).
  • Deal with received data in GZIP'ed format.
  • Authentification (guess that this one falls under http post? Writted it separate just in case)
  • Finding name of forms input field by what content is next to it (easy use of regex?)

  • Basic math operations with parts of received content, waiting some time (sleep), random numbers are also needed but I think this is easy in every language.
    Things that would be optional (in preferable order):
  1. Ability to run on remote shell (linux, no X).
  2. Portable (and no need to install 66 separate librarys to run it)
  3. Using under proxy.
  4. Ability to change course of action at any time. (maybe added later)
  5. OO would be nice.

So maybe you have written something similar and can give me advise. I am not afraid to learn new things.
Thanks in advance :)

+1  A: 

I've recently started working on a similar project, and I've been using the python port of mechanize as a stateful web browser. It's pretty slick, and while the documentation is kind of sparse, it's quite simple and easy to use.

http://wwwsearch.sourceforge.net/mechanize/

bradtgmurray
A: 

can you check with php using the lib curl.

http://www.php.net/curl

Peter GA.
Can I set it to work for hours maybe days and see the progres done?
that client client side?
monkey_boys
+3  A: 

use libcurl with c/c++ python/perl or any langauge of your choice and convenience. good luck

kiwi
+2  A: 

Perl + LWP::UserAgent and WWW::Mechanize.

squadette
A: 

Python with urllib2 and lxml.

Marcel
+1  A: 

Pretty much any popular language can deal with this. Use the one you are most familiar with.

Jim
Hey understand that any language that is turing complete can deal with this :) I am asking which is most suited for this.
None are especially suitable compared with the others. The effectiveness of your solution is going to be determined primarily with your familiarity with the language you use, not by the language itself.
Jim
+5  A: 

Python seems like a good choice. I'll go through your requirements and link to the things in the standard library which meet your needs:

  1. Send and receive information with GET and POST: the urllib2 module, specifically the urlopen function (http://docs.python.org/lib/module-urllib2.html)

    print urlopen("http://www.google.com/").read() # GET
    print urlopen("http://icanhascheezburger.com/", "s=funny").read() # POST

  2. Changing HTTP headers: the urllib2.Request object can set headers

    urlopen(Request("http://slashdot.org/", headers={"User-agent": "Python"}))

  3. Deal with gzip-ed data: the GzipFile class of the gzip module (http://www.python.org/doc/lib/module-gzip.html)

  4. Authentication: urllib2 also does this, either by manually setting a header to indicate your session cookie, or keeping track of all of your cookies for you using the HTTPCookieProcessor class

  5. Page processing: Python has the re module for regular expressions (http://docs.python.org/lib/module-re.html), a sleep function in the time module (http://docs.python.org/lib/module-time.html), and nice random numbers in the random module (http://docs.python.org/lib/module-random.html)

    if re.search("input", page): sleep( randint(0,10) )

  6. As for your optional requirements, all of this will work at the command line and can be run remotely, everything mentioned here is in the standard library so you don't need any third-party modules, the urllib2.ProxyHandler class deals with proxies very nicely, and Python is an extremely object-oriented language.

Eli Courtwright
+1  A: 

What you need is HtmlUnit:

HtmlUnit is a "browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.

It is typically used for testing purposes or to retrieve information from web sites.

kmilo