I have substantial PHP experience, although I realize that PHP probably isn't the best language for a large-scale web crawler because a process can't run indefinitely. What languages do people suggest?
C++ - if you know what you're doing. You will not need a web server and a web application, because a web crawler is just a client, after all.
Any language you can easily use with a good network library and support for parsing the formats you want to crawl. Those are really the only qualifications.
Most languages would probably be a reasonable fit, the critical components are
- Libraries to deal with the Internet Protcols
- Libraries to deal with regular expressions
- Libraries to parse HTML content
Today most languages have libraries with good support for the above, of course you will need some way to persist the results that might be a database of some sorts.
The more important thing rather than the language is understanding all concepts you need to deal with. Here are some Python examples that might help get you started.
You could consider using a combination of python and PyGtkMozEmbed or PyWebKitGtk plus javascript to create your spider.
The spidering could be done in javascript after the page and all other scripts have loaded.
You'd have one of the few web spiders that supports javascript, and might pick up some hidden stuff the others don't see :)
C# and C++ are probably the best two languages for this, it's just a matter of which you know better and which is faster (C# is probably easier).
I wouldn't recommend Python, Javascript, or PHP. They will usually be slower in text processing compared to a C-family language. If you're looking to crawl any significant chunk of the web, you'll need all the speed you can get.
I've used C# and the HtmlAgilityPack to do so before, it works relatively well and is pretty easy to pick up. The ability to use a lot of the same commands to work with HTML as you would XML makes it nice (I had experience working with XML in C#).
You might want to test the speed of available C# HTML parsing libraries vs C++ parsing libraries. I know in my app, I was running through 60-70 fairly messy pages a second and pulling a good bit of data out of each (but that was a site with a pretty constant layout).
Edit: I notice you mentioned accessing a database. Both C++ and C# have libraries to work with most common database systems, from SQLite (which would be great for a quick crawler on a few sites) to midrange engines like MySQL and MSSQL up to the bigger DB engines (I've never used Oracle or DB2 from either language, but it's possible).
why write your own when you can copy http://code.activestate.com/recipes/576551-simple-web-crawler/
you might need to fix a few things here and there, like use htmlentities instead of replacing & with &