What languages are good for writing a web crawler?

views:

178

answers:

+1 Q:

What languages are good for writing a web crawler?

I have substantial PHP experience, although I realize that PHP probably isn't the best language for a large-scale web crawler because a process can't run indefinitely. What languages do people suggest?

+1 A:

C++ - if you know what you're doing. You will not need a web server and a web application, because a web crawler is just a client, after all.

cripox 2010-09-08 01:34:58

True! I guess I just need to interface it with a DB of sorts... any thoughts on Python?

Shamoon 2010-09-08 01:35:41

PHP does not require a web server to use the CLI version of it. Just an "extra information" tidbit.

Brad F Jacobs 2010-09-08 01:36:39

Well.. PHP can run in CLI, but it'll still overflow memory shortly, no?

Shamoon 2010-09-08 01:37:30

No clue, never tried to create such an application in PHP. I would imagine so, however.

Brad F Jacobs 2010-09-08 01:39:33

I think PHP would work pretty well, really. You can raise the amount of memory a process is allowed to consume, the time it's allowed to run, etc. You can let threads die after a while, just keep track of what still needs to be indexed in a database or something, and have new threads pick up where the old ones left off.

no 2010-09-08 01:52:55

Unless you encounter a bug in PHP which causes a memory leak somewhere, you shouldn't have any problem with it running for extended periods of time. You just need to keep the used memory to a minimum, something you'd need to do in any other language as well.

deceze 2010-09-08 02:03:04

+4 A:

Any language you can easily use with a good network library and support for parsing the formats you want to crawl. Those are really the only qualifications.

Chuck 2010-09-08 01:37:01

Those are the qualifications for it being *possible*, but I'd think the qualifications for it being a *good* language to do so in are a bit stricter (speed comes to mind especially).

peachykeen 2010-09-08 01:51:09

@peachykeen: It's possible without the latter two — it would just be more work. As for speed, I suppose INTERCAL is probably a poor choice for a crawler, but I don't see why speed is more important for a webcrawler than any other kind of program (especially given that a Web-anything is extremely likely to be IO-bound). Your crawler would have to be pretty slow for its execution time to overwhelm the latency of the Web.

Chuck 2010-09-08 01:59:40

@Chuck: Probably, but it's still something I'd consider. I wouldn't run the crawler on its own server, so it would need to share time with my desktop or server. Plus, I've seen some nasty pages that take a measurable time to load in most HTML parsers, so you'd have to be careful there. Network is probably your bottleneck, but IMO speed is worth a brief mention. :)

peachykeen 2010-09-08 02:12:42

And, not less important, another qualification is "a language that allows you do do comfortably everything else than crawling". It makes a difference whether you're just collecting, processing, *heavy* processing or whatever you're doing with data.

Chubas 2010-09-08 07:03:38

+5 A:

Most languages would probably be a reasonable fit, the critical components are

Libraries to deal with the Internet Protcols
Libraries to deal with regular expressions
Libraries to parse HTML content

Today most languages have libraries with good support for the above, of course you will need some way to persist the results that might be a database of some sorts.

The more important thing rather than the language is understanding all concepts you need to deal with. Here are some Python examples that might help get you started.

http://www.example-code.com/python/pythonspider.asp

Chris Taylor 2010-09-08 01:40:16

+1 A:

You could consider using a combination of python and PyGtkMozEmbed or PyWebKitGtk plus javascript to create your spider.

The spidering could be done in javascript after the page and all other scripts have loaded.

You'd have one of the few web spiders that supports javascript, and might pick up some hidden stuff the others don't see :)

no 2010-09-08 01:43:48

C# and C++ are probably the best two languages for this, it's just a matter of which you know better and which is faster (C# is probably easier).

I wouldn't recommend Python, Javascript, or PHP. They will usually be slower in text processing compared to a C-family language. If you're looking to crawl any significant chunk of the web, you'll need all the speed you can get.

I've used C# and the HtmlAgilityPack to do so before, it works relatively well and is pretty easy to pick up. The ability to use a lot of the same commands to work with HTML as you would XML makes it nice (I had experience working with XML in C#).

You might want to test the speed of available C# HTML parsing libraries vs C++ parsing libraries. I know in my app, I was running through 60-70 fairly messy pages a second and pulling a good bit of data out of each (but that was a site with a pretty constant layout).

Edit: I notice you mentioned accessing a database. Both C++ and C# have libraries to work with most common database systems, from SQLite (which would be great for a quick crawler on a few sites) to midrange engines like MySQL and MSSQL up to the bigger DB engines (I've never used Oracle or DB2 from either language, but it's possible).

peachykeen 2010-09-08 01:50:07

-1. a webcrawler is primarily an IO bound application. Writing it in C++ will not make the network work any faster.

aaronasterling 2010-09-08 02:02:25

[citation needed] on Javascript and Python's excessive slowness for this task. The Googlebot was written in Python, and that was in an era when both Python and computer hardware were considerably slower than they are today.

Chuck 2010-09-08 02:02:32

@aaronasterling: But writing it in PHP will make it *run* slower than C++. Depending on the complexity of the processing that needs done and what data needs to be saved to disk (say you have to walk the DOM and save every img src to a database), C++/C# could give you quite a performance increase. The actual parsing is all text operations and plenty of them. Overall, C++ or C# provide better processing performance, regardless of network speed.

peachykeen 2010-09-08 02:03:46

@Chuck: edited that sentence to not demonize them so much. They might not be too slow, but they'll be slower.

peachykeen 2010-09-08 02:07:08

@peachykeen. you missed my point. There is plenty of time for python/javascript/php to process a page _while_ it is waiting on a network connection. As I said, it is an IO bound application.

aaronasterling 2010-09-08 02:07:57

Agreed and disagreed. C *will* probably be faster, but probably not significantly so. Most XML/DOM parsing libraries are just interfaces for compiled C libraries anyway, the thin PHP/Python/Pwhatever wrapper shouldn't cause any significant slowdown. The main speed bump is network latency, it doesn't really matter whether PHP or C++ code is waiting for it.

deceze 2010-09-08 02:09:17

@aaronasterling: But what are the chances of the crawler being the only thing running and able to consume as much CPU time as it wants? I've seen some pages that took half a second to load in C#, they were so messy. While it may not be critical in this case, parsing speed is still something I'd take into consideration.

peachykeen 2010-09-08 02:16:10

@peachykeen. as deceze points out, most parsing libraries for the high level languages are written in c anyways. If they prove inadequate, then another one can be selected or that _one portion_ of the program can be written in c _after_ it has been shown to be too slow.

aaronasterling 2010-09-08 02:18:44

There are actually historical examples of code written in "slow" languages ending up beating code written in C. Take Apache (not Apache2, written in C) and Tclhttpd (written in Tcl, which itself was at the time even slower than Perl). Tclhttpd beat Apache hands down for static file transfers. Apache2 learned from this and they changed their I/O algorithm to be more like how the tcl interpreter does it. So C/C++ may not be fastest simply because it is fast at adding to numbers.

slebetman 2010-09-08 02:20:36

@aaronasterling: I think you're missing my point in turn. ;) If I were doing this, *I* wouldn't use a high level language that might waste *more* time, after seeing how some pages can cause C languages to have a fit. If it goes slow in those, usually it goes slower in other. Call me anal, but I still stand by C++/C# for the processing end. It may just be opinion, though. :)

peachykeen 2010-09-08 02:22:06

@slebetman: The part that will test the language comes after the file transfer, though. When you try to parse the file and the strings within it, then language speed will come into play. C/C++ isn't always faster, but compared to some (especially PHP) it usually gives you a leg up.

peachykeen 2010-09-08 02:24:37

@peachykeen: On the other hand, a lot of interpreters (especially for old languages like Perl and Tcl) have optimized string manipulation routines. C/C++ on the other hand have very, very slow string manipulation routines. Compare `strlen()` in C which is linear in time to `string length` in tcl which is constant time. This is because "strings" in higher languages are typically implemented as proper data structures in the back end, not raw C strings. Of course, there's no guarantees. Microsoft's javascript for instance behaves as if the String object is merely a thin wrapper around C strings.

slebetman 2010-09-08 03:30:43

why write your own when you can copy http://code.activestate.com/recipes/576551-simple-web-crawler/

you might need to fix a few things here and there, like use htmlentities instead of replacing & with &

bronzebeard 2010-09-08 06:53:32

ansaurus

tags:

views:

answers:

What languages are good for writing a web crawler?

related questions