tags:

views:

586

answers:

7

I need to build a content gathering program that will simply read numbers on specified web pages, and save that data for analysis later. I don't need it to search for links or related data, just gather all data from websites that will have changing content daily.

I have very little programming experience, and I am hoping this will be good for learning. Speed is not a huge issue, I estimate that the crawler would at most have to load 4000 pages in a day.

Thanks.

Edit: Is there any way to test ahead of time if the websites from which I am gathering data are protected against crawlers?

+3  A: 

Python probably, or Perl.

Perl has a very nice LWP, Pythin have urllib2.

Both are easy, script languages avaliable on most linuxes/unices.

I've done a crawler in Perl quite a few times, it's an evening of work.

And no, they can't really protect themself from crawlers, except for using captcha of sort - everything else is easier to crack than to set up.

There was a point about Java: Java is fine. It's more verbose and requires some development environment setup: so you wouldn't do it in one evening, probably a week. For a small task, which question author indicated, that might be an overkill. On other hand, there are very useful libraries like lint, tagsoup (DOM traversal for random HTML out there) and lucene (full text indexing and search), so you might want Java for more serious projects. In this case, I'd recommend Apache commons-httpclient library for web-crawling (or nutch if you're crazy :).

Also: there are shelfware products that monitor changes on specified websites and present them in useful ways, so you might just grab one.

alamar
websites can't protect themselves against crawlers, but crawlers are honour-bound to obey the Robots Exclusion Protocol - and clients are honour-bound to abide by the terms of service for any website.
DDaviesBrackett
I agree with you.
alamar
Most things on the web bring up Java for programming crawlers. Is java too complicated? or whats the problem with that?
Alex
Added a section about Java.
alamar
Python also has Beautiful Soup,
Lucas Jones
Can't describe what I don't know.
alamar
Roughly 1 year later: I am now fluent in Python. Would highly recommend python to first time programmers simply because the syntax is attractive and that was the biggest problem for me in learning to program. Python for life.
Alex
A: 

I'd say PHP. It's a web-oriented language, meaning lots of library functions to do all the odd little things you'll need to do in a project like this. It has a good lib for this built in (CURL), and it's a dead simple language. You will outgrow it relatively fast if you continue programming in it, but for something simple like this, it's a good choice.

dsimcha
I wouldn't recommend PHP for client-side scripting.It can be done, but it's backwards.
alamar
+1  A: 

Is there any way to test ahead of time if the websites from which I am gathering data are protected against crawlers?

Other than CAPTCHAs it is good etiquette to respect the contents of the robots.txt file if it exists.

Kevin Loney
+1  A: 

The language you are most comfotable with is more than likey the best language to use.

I have very little programming experience

You might find that a web crawler is a bit of a baptism of fire and you need to build a few other less trivial applications to become familiar with your chosen language (and framework if applicapble).

Good luck!

Greg B
A: 

Perl or python are the obvious choices, it depends what suits you best really at the end of the day. Neither are that difficult but in general if you find that you prefer a flowing linguistic language that is really flexible perl would suit you better, where as if you find you prefer a more rigid language with a more mathematical mindset (especially in believing there is only one way to do something right) then you'd probably feel more at home in python. Other languages can do the job pretty well but those two are obvious choices due to portability and being strong languages for CLI scripting tasks, especially text manipulation, as well as being strong webdev languages leading to large numbers of useful modules available for web orientated tasks(giving the benefit of php mentioned, but without the negative aspects of php for clientside) . If large numbers of useful modules being available is a pro for you then perl has massive amounts more for this kind of task than any other language (on CPAN) it might be worth checking out if there is code you can reuse out there before taking the dive into which language to use. In certain areas one is faster than the other (python generally excels at complex maths, perl can generally process text quicker, depends how you do it though).

Other language choices are out there, a compiled language is less portable and so generally more of a pain setting it up on a server, however executes faster. Scripting languages are generally designed to manipulated text and files with greater ease than compiled languages, though not always true. I feel more comfortable with perl, so I'd use it, but me saying that is not the basis you should make a decision on, find out which has more resources you can use, and which you like the feel of better (read some code see which style makes more sense to you) and then decide.

Oh and orielly have a book on programming collective intelligence aimed at beginners to the subject, I've never read it but its supposed to be pretty good, flick through it in a shop and give it consideration as its largely about web-crawler algorithms... It uses python for examples.

Toby
A: 

I did create a webcrawler once, but it was created to search through sites for more links to other sites and follow these. It had to remember those links and make sure I wouldn't visit a site twice, thus I needed a very quick way to check for duplicate URL's. To do this, I created my own hash table in Delphi 2007. With some additional knowledge on how to use the Internet Explorer COM interface, I managed to read quite a lot of pages in a short amount of time. I've used Delphi to write this crawler because I wanted a lot of performance.

Then again, I also chose Delphi because it's the language that I'm most comfortable with, plus it helped me to learn a lot about several interesting topics, including on how to write your own hash table algorithms. Besides, it was a very interesting challenge for an experienced programmer like me.

My advise has already been provided: use the tools that you're most comfortable with.

Workshop Alex
A: 

If you're a beginner, I would suggest an "easy" language such as REBOL. In REBOL, a basic script to check a bunch of wikipedia pages for modifications would look like the code below. Obviously, "easy" is subjective and you'd still need some basic changes to this code to meet your requirements.

records: load %records.txt
; (content of records.txt file looks like this- indentation not important)
[
    [en.wikipedia.org/wiki/Budget_deficit
    "US Budget Deficit (wikipedia)"
    {<li id="lastmod">This page was last modified on }
    "1 June 2009 at 11:26."]

    [en.wikipedia.org/wiki/List_of_U.S._states_by_unemployment_rate
    "US Unemployment Rate (wikipedia)"
    {<li id="lastmod">This page was last modified on }
    "25 May 2009 at 20:15."]
]

; Now loop through the records and check web for changes
foreach rec records [
   html: read rec/1   ; add error-chking here for 404s or timeout
    parse/all html [any [thru rec/3 copy curr-mod-date to </li>]]
       unless rec/4 = curr-mod-date [
          print ["CHANGE DETECTED:" rec/2]
          ; parse again to collect and save specific page data here
          ; update %records.txt file with updated timestamps
     ]
]

REBOL is not well-known, but it's friendly, very tiny, cross-platform and GUI-enabled. I've had a lot of success with it for quick and dirty scripts.

Edoc