ansaurus

Question

Answer 1

+3 A:

You don't mention what you want to use this for; One solution is to simply "script" your web browser using tools like Selenium if having a web browser repeat your actions is an acceptable solution. You can use the Selenium IDE to record what you do and then alter the parameters.

Mark Fowler 2009-02-26 11:37:13

Answer 2

+7 A:

HtmlUnit is a scriptable, headless browser written in Java. We use it for some extremely fault-heavy, complex web pages and it usually does a very good job.

To simplify things even more you can run it in Jython. The resultant program reads more like a transcript of how one might use a browser than hard work.

toothygoose 2009-02-26 11:40:47

Answer 3

+3 A:

I wish I could just "record my session" quickly and then parametrize the things that vary from session to session.

If you have Visual Studio test edition it's web test function does that exactly. If you aren't using VS or want a stand alone tool I have had great success with OpenSpan. It is more than just web, it does windows apps, and java!

Robert MacLean 2009-02-26 11:46:06

Answer 4

+3 A:

Selenium would be my 1st pick, as the IDE lets you do a lot of things the easy way by "recording" a session for you. But, if you're not happy with what it provides, you can also use the Python module called Beautiful Soup to programmatically walk through a website.

ojrac 2009-04-16 20:25:04

Answer 5

+2 A:

Coscripter

http://coscripter.research.ibm.com/coscripter

Simplifying web-based processes.

CoScripter is a system for recording, automating, and sharing processes performed in a web browser such as printing photos online, requesting a vacation hold for postal mail, or checking flight arrival times. Instructions for processes are recorded and stored in easy-to-read text here on the CoScripter web site, so anyone can make use of them. If you are having trouble with a web-based process, check to see if someone has written a CoScript for it!

Wget

To quickly pull down content, use wget:

wget -r -n -k -w 2 foo.com

Then parse the HTML locally.

Twill

In addition to Selenium, you might also check out Twill, the command line companion:

http://twill.idyll.org/

ramanujan 2009-04-18 07:11:35

Answer 6

+2 A:

I'd also look at Selenium and/or BeautifulSoup if you're willing to use Python. There's also a nice testing tool Twill for automatic website testing that might do what you want. It's also written in Python, and it has a Python API, but there is also a simplified command language you can use with it. Here is an example from the Twill documentation::

setlocal username <your username>
setlocal password <your password>

go http://www.slashdot.org/
formvalue 1 unickname $username
formvalue 1 upasswd $password
submit

code 200     # make sure form submission is correct!

Rick Copeland 2009-04-18 21:49:30

Thank you for introducing me to Twill. I think it addresses what I need a bit. Although there is no recording capability I think the simplicity of its scripting language will allow me to build very quick screen-scraping code. Therefore I have selected it as the answer to the bounty.

Ish 2009-04-22 23:05:37

I also chose your answer because of the quick code sample which showed me how easy it is to use.

Ish 2009-04-22 23:06:20

Answer 7

+1 A:

I used DomInspector for manually inspecting the site of interest to parametrize it's structure. Then simple Apache HttpClient and hand-made parser using this parametrized structure. Basically I could extract any info from any site automatically with a little tweak of parameters.. It's similar to how SAX parser works, all you need to tell it is at what sequence of tags you want to start grabbing the data. For example, google have pretty standard format of search results.. So, you just run to the third occurrence of 'tab' and start getting text from the first 'div' up until the end '/div'

Dima 2009-04-18 22:03:10

Answer 8

+1 A:

iMacro is scriptable but just for Firefox I assume it's not great in performance but can handle most complex situations and can record stuff easily.

dr. evil 2009-04-20 14:34:05

I have used iMacro. It works great in bringing the page down and it remembers username, password ... etc as if you were doing it manually. However, it will require Firefox. So if you are thinking of running your scraper on a headless (Gnome, KDE) server without window manager, then you are out of luck.

VN44CA 2009-07-10 19:53:46

Answer 9

+1 A:

Internet Explorer supports Browser Helper Objects (BHOs). They can access IE' HWND (window handle) and it's easy to scrape the pixels from there. The IWebBrowser2 COM interface also gives you access to the HTTP requests, and you can get back the parsed HTML document via IWebBrowser2::Document = IHTMLDocument / IHTMLDocument2 /IHTMLDocument3

MSalters 2009-04-20 15:02:04

Answer 10

+1 A:

Using FireFox, it should be possible to implement much of it with its powerful support for addons and enhancements, however that wouldn't really mean to run "headless", but really be a real scripted browser. Also, I seem to recall having read that google's chrome browser uses a similar technique to do automated regression testing.

none 2009-04-20 18:43:53

Answer 11

+1 A:

I can't personally vouch for it, but there is a free firefox plugin: DejaClick I installed it the other day and did some remedial recording, playback, and script editing activities with it. It pulled them off without much of a learning curve. If your end goal is to show something in a web browser, then it should suffice.

They offer web transaction monitoring services, implying that you can export the scripts for other uses, but they may be too proprietary to use outside of your web browser / their paid service.

http://www.dejaclick.com/

scottwed 2009-04-21 03:52:03

Answer 12

+1 A:

I'd check out Badboy. It runs an IE browser, but you can literally click record and it records all of your activity.

You can then automate the processing of that script and populate values from a datasource (ODBC, Excel, etc...)

Badboy Software

Doug Hays 2009-04-21 04:28:51

Answer 13

+1 A:

Try iOpus iMacros http://www.iopus.com/imacros/. I am using this for screen scrapping and its working very well and the speed is also very good. Its not that costly either.

It will record the script while you are browsing. You can then parametrize the script and execute using Java, .net, etc.

Bhushan 2009-04-21 04:34:26

Answer 14

+1 A:

I would look at Fiddler judging by your requests it will do everything you need.

Alex 2009-04-21 12:20:07

Answer 15

+1 A:

Python and Perl both have a module called Mechanize (WWW::Mechanize for perl) that makes it easy to do browser behavior programmaticly (filling out forms, handling cookies, etc).

So, Python + BeautifulSoup (great html/xml parser) + mechanize (browser functions) = super easy/fast scraper

Alex 2009-04-22 12:24:54

Quick Question, Can Mechanize handle AJAX?

VN44CA 2009-07-10 20:09:22

ansaurus

tags:

views:

answers:

Super-fast screen scraping techniques?

related questions