views:

365

answers:

4

Hi im writing a web crawler in python to extract news articles from news websites like nytimes.com. i want to know what would be a good db to use as a backend for this project?

Thanks in advance!

+5  A: 

Personally, I love PostGreSQL -- but other free DBs such as MySql (or, if you have reasonably small amounts of data -- a few GB at most -- even the SQLite that comes with Python) will be fine too.

Alex Martelli
+1 Beat me to it. I would personally go with MySQL over PostGre, but that's just because I'm already familiar with it.
Chinmay Kanchi
+2  A: 

I think the database itself will probably be one of the easier aspects of a web crawler like this.

If expect high load reading or writing the database (for example if you intend to run many crawlers at the same time) then you will want to steer in the direction of MySql, otherwise something like Sqlite will probably do you just fine.

Kragen
+3  A: 

This could be a great project to use a document database like CouchDB, MongoDB, or SimpleDB.

MongoDB has a hosted solution: http://mongohq.com

SimpleDB is a great choice if you are hosting this on Amazon Web Services

CouchDB is an open source package from the Apache Foundation.

Jackson Miller
if the no of records increase whill these dbs be able to cope?
Prabhu
That is part of why I think a crawler would be well suited to these DBs. Google's underlying database is BigTable which is similar in design to the databases I mentioned.SimpleDB has a 10GB limit per domain and a 2500 result limit on SELECT statements. I don't know of any size limitations for CouchDB or MongoDB (doesn't mean they aren't there, just that I couldn't find them with a Google search).
Jackson Miller
A: 

You can take a look at Firebird

Firebird python driver are developped by the core team

Hugues Van Landeghem