views:

178

answers:

6

I have substantial PHP experience, although I realize that PHP probably isn't the best language for a large-scale web crawler because a process can't run indefinitely. What languages do people suggest?

+1  A: 

C++ - if you know what you're doing. You will not need a web server and a web application, because a web crawler is just a client, after all.

cripox
True! I guess I just need to interface it with a DB of sorts... any thoughts on Python?
Shamoon
PHP does not require a web server to use the CLI version of it. Just an "extra information" tidbit.
Brad F Jacobs
Well.. PHP can run in CLI, but it'll still overflow memory shortly, no?
Shamoon
No clue, never tried to create such an application in PHP. I would imagine so, however.
Brad F Jacobs
I think PHP would work pretty well, really. You can raise the amount of memory a process is allowed to consume, the time it's allowed to run, etc. You can let threads die after a while, just keep track of what still needs to be indexed in a database or something, and have new threads pick up where the old ones left off.
no
Unless you encounter a bug in PHP which causes a memory leak somewhere, you shouldn't have any problem with it running for extended periods of time. You just need to keep the used memory to a minimum, something you'd need to do in any other language as well.
deceze
+4  A: 

Any language you can easily use with a good network library and support for parsing the formats you want to crawl. Those are really the only qualifications.

Chuck
Those are the qualifications for it being *possible*, but I'd think the qualifications for it being a *good* language to do so in are a bit stricter (speed comes to mind especially).
peachykeen
@peachykeen: It's possible without the latter two — it would just be more work. As for speed, I suppose INTERCAL is probably a poor choice for a crawler, but I don't see why speed is more important for a webcrawler than any other kind of program (especially given that a Web-anything is extremely likely to be IO-bound). Your crawler would have to be pretty slow for its execution time to overwhelm the latency of the Web.
Chuck
@Chuck: Probably, but it's still something I'd consider. I wouldn't run the crawler on its own server, so it would need to share time with my desktop or server. Plus, I've seen some nasty pages that take a measurable time to load in most HTML parsers, so you'd have to be careful there. Network is probably your bottleneck, but IMO speed is worth a brief mention. :)
peachykeen
And, not less important, another qualification is "a language that allows you do do comfortably everything else than crawling". It makes a difference whether you're just collecting, processing, *heavy* processing or whatever you're doing with data.
Chubas
+5  A: 

Most languages would probably be a reasonable fit, the critical components are

  1. Libraries to deal with the Internet Protcols
  2. Libraries to deal with regular expressions
  3. Libraries to parse HTML content

Today most languages have libraries with good support for the above, of course you will need some way to persist the results that might be a database of some sorts.

The more important thing rather than the language is understanding all concepts you need to deal with. Here are some Python examples that might help get you started.

http://www.example-code.com/python/pythonspider.asp

Chris Taylor
+1  A: 

You could consider using a combination of python and PyGtkMozEmbed or PyWebKitGtk plus javascript to create your spider.

The spidering could be done in javascript after the page and all other scripts have loaded.

You'd have one of the few web spiders that supports javascript, and might pick up some hidden stuff the others don't see :)

no
A: 

C# and C++ are probably the best two languages for this, it's just a matter of which you know better and which is faster (C# is probably easier).

I wouldn't recommend Python, Javascript, or PHP. They will usually be slower in text processing compared to a C-family language. If you're looking to crawl any significant chunk of the web, you'll need all the speed you can get.

I've used C# and the HtmlAgilityPack to do so before, it works relatively well and is pretty easy to pick up. The ability to use a lot of the same commands to work with HTML as you would XML makes it nice (I had experience working with XML in C#).

You might want to test the speed of available C# HTML parsing libraries vs C++ parsing libraries. I know in my app, I was running through 60-70 fairly messy pages a second and pulling a good bit of data out of each (but that was a site with a pretty constant layout).

Edit: I notice you mentioned accessing a database. Both C++ and C# have libraries to work with most common database systems, from SQLite (which would be great for a quick crawler on a few sites) to midrange engines like MySQL and MSSQL up to the bigger DB engines (I've never used Oracle or DB2 from either language, but it's possible).

peachykeen
-1. a webcrawler is primarily an IO bound application. Writing it in C++ will not make the network work any faster.
aaronasterling
[citation needed] on Javascript and Python's excessive slowness for this task. The Googlebot was written in Python, and that was in an era when both Python and computer hardware were considerably slower than they are today.
Chuck
@aaronasterling: But writing it in PHP will make it *run* slower than C++. Depending on the complexity of the processing that needs done and what data needs to be saved to disk (say you have to walk the DOM and save every img src to a database), C++/C# could give you quite a performance increase. The actual parsing is all text operations and plenty of them. Overall, C++ or C# provide better processing performance, regardless of network speed.
peachykeen
@Chuck: edited that sentence to not demonize them so much. They might not be too slow, but they'll be slower.
peachykeen
@peachykeen. you missed my point. There is plenty of time for python/javascript/php to process a page _while_ it is waiting on a network connection. As I said, it is an IO bound application.
aaronasterling
Agreed and disagreed. C *will* probably be faster, but probably not significantly so. Most XML/DOM parsing libraries are just interfaces for compiled C libraries anyway, the thin PHP/Python/Pwhatever wrapper shouldn't cause any significant slowdown. The main speed bump is network latency, it doesn't really matter whether PHP or C++ code is waiting for it.
deceze
@aaronasterling: But what are the chances of the crawler being the only thing running and able to consume as much CPU time as it wants? I've seen some pages that took half a second to load in C#, they were so messy. While it may not be critical in this case, parsing speed is still something I'd take into consideration.
peachykeen
@peachykeen. as deceze points out, most parsing libraries for the high level languages are written in c anyways. If they prove inadequate, then another one can be selected or that _one portion_ of the program can be written in c _after_ it has been shown to be too slow.
aaronasterling
There are actually historical examples of code written in "slow" languages ending up beating code written in C. Take Apache (not Apache2, written in C) and Tclhttpd (written in Tcl, which itself was at the time even slower than Perl). Tclhttpd beat Apache hands down for static file transfers. Apache2 learned from this and they changed their I/O algorithm to be more like how the tcl interpreter does it. So C/C++ may not be fastest simply because it is fast at adding to numbers.
slebetman
@aaronasterling: I think you're missing my point in turn. ;) If I were doing this, *I* wouldn't use a high level language that might waste *more* time, after seeing how some pages can cause C languages to have a fit. If it goes slow in those, usually it goes slower in other. Call me anal, but I still stand by C++/C# for the processing end. It may just be opinion, though. :)
peachykeen
@slebetman: The part that will test the language comes after the file transfer, though. When you try to parse the file and the strings within it, then language speed will come into play. C/C++ isn't always faster, but compared to some (especially PHP) it usually gives you a leg up.
peachykeen
@peachykeen: On the other hand, a lot of interpreters (especially for old languages like Perl and Tcl) have optimized string manipulation routines. C/C++ on the other hand have very, very slow string manipulation routines. Compare `strlen()` in C which is linear in time to `string length` in tcl which is constant time. This is because "strings" in higher languages are typically implemented as proper data structures in the back end, not raw C strings. Of course, there's no guarantees. Microsoft's javascript for instance behaves as if the String object is merely a thin wrapper around C strings.
slebetman
A: 

why write your own when you can copy http://code.activestate.com/recipes/576551-simple-web-crawler/

you might need to fix a few things here and there, like use htmlentities instead of replacing & with &

bronzebeard