views:

1150

answers:

7

I would like to know what is the best language for a lightweight and fast web crawler. Is it better to do it in C99, C++ or any other language?

A: 

That's extremely subjective. I could write the fastest web-crawler in the world using only assembly and a custom-engineered motherboard and chipset.

This will be totally dependent on how you want to write it, what you want it to do, and what your underlying hardware and data support structures are.

I would imagine your choice of data storage mechanism would have a much bigger impact on your overall performance than your programming language.

zombat
+4  A: 

These questions are always difficult to answer because in order to give an appropriate answer we need to know what your time line is, what resources you have, and what type of performance you deem necessary. As always, its not an issue of "what is the best language" but rather, its about finding the balance between how fast YOU need to be, and how fast the program needs to be.

I think a nice sweet spot is C or C++. If you need it sooner, C#/Java are better solutions. If you want to make the ultimate fastest crawler ever, do it in assembler.

Paperino
+1  A: 

As already stated, this is somewhat of a moot question. A web crawler should be relatively language-independent, so pick whatever you're most comfortable with. However, one of my roommates built one in Python a few months ago and seemed pretty happy with it.

Evan Meagher
+3  A: 

You'll probably get things done much more quickly with a script language like Perl or Python, which also have the advantage of easy to use HTTP libraries and text/html parsing.

I'm not saying it will run the fastest, but you'll probably have to work much much harder with languages like C++ just to get it to work. Also, the advantages C/++ would have on such script languages in terms of performance are negligible, since a web crawler would probably more IO bound than CPU bound.

So I say, use Python. Even if you don't know it, it'll take you less time to learn than to implement the same functionality in C/C++.

Assaf Lavie
A: 

C# would be a quick and powerful language to use if you are on the Windows platform. Java would be a good way to go on Linux/LAMP. Of course, it is really up to you and what you are comfortable with, but either of these would be excellent choices.

alchemical
+1  A: 

A Web crawler is going to be speed limited by network latency, not CPU, so language isn't likely to be the factor for program speed. Depending on the size of your data-mining task you might have CPU issues at a later phase after the data have been downloaded.

If you mentioned C and C++ because they are the only languages you know, it might be time to learn another language like Ruby or Python that have different style considerations than C-based languages. Python has libraries built in to support this type of project (Ruby might also, I don't know it as well).

acrosman
+1  A: 

I have written a crawler solely in php and using mysql for storage. The major issue with php is that arrays (dont get me started about what defines an array) are very slow when they grow in size (number of rows).

However I would suggest java even if its "heavy" at startup its pretty darn fast when its done initializing.

Oh and to speed up your search pages, I would advice taking a look at http://www.sphinxsearch.com/

Good luck!