views:

314

answers:

3

I need some help comparing different programming languages, such as: C++, Java, Python, Ruby and PHP, for a task which is related for web data mining (developing web crawler, string manipulations and etc.). I have a bit experience with PHP, and I think advantages that it has for this particular task are simple syntax, in-depth string parsing capabilities, networking functions, and portability, but don't know much about other languages and their advantages and disadvantages related for this particular task.

A: 

Google's first crawler was written in Python 1.5

I'm no expert on other languages, but I would go with python and html5lib or Beautifulsoup.

Kugel
+1  A: 

The specific language will not matter nearly as much as your familiarity. These days, all high-level languages will come with the basics. Unless you need it to be super-fast (you're probably going to be limited by download speed, not the speed that you parse the HTML) or have other constraints not listed, the language won't matter that much.

Just make sure that you use the libraries. In particular an HTML parsing library that is good with invalid markup (not an XML parser) and regular expressions where appropriate.

Daniel Grace
A: 

As a previous post implies - being familiar makes a big difference. I would also say look at what the language was originally designed to do - it gives a good idea of what its best at.

PHP - designed for server side scripting, not really ideal for this use.

Perl - Designed to pull text apart (good start) and excellent libraries - look at LWP and the modules under HTML such as HTML::Treebuilder - a good choice. Unrivalled selection of modules to plugin.

Python - A good choice, look at beautifulsoup and urllib

Ruby - also a good choice, look at hpricot a lot less mature than Perl or Python in terms of modules available.

I have written quite a bit of web spider/data mining software and have always used Perl. If I was starting from scratch today I might choose python.

Kevin Philp