I'm setting up a large database that will generate statistical reports from incoming data.
The system will for the most part operate as follows:
- Approximately 400k-500k rows - about 30 columns, mostly varchar(5-30) and datetime - will be uploaded each morning. Its approximately 60MB while in flat file form, but grows steeply in the DB with the addition of suitable indexes.
- Various statistics will be generated from the current day's data.
- Reports from these statistics will be generated and stored.
- Current data set will get copied into a partitioned history table.
- Throughout the day, the current data set (which was copied, not moved) can be queried by end users for information that is not likely to include constants, but relationships between fields.
- Users may request specialized searches from the history table, but the queries will be crafted by a DBA.
- Before the next day's upload, the current data table is truncated.
This will essentially be version 2 of our existing system.
Right now, we're using MySQL 5.0 MyISAM tables (Innodb was killing on space usage alone) and suffering greatly on #6 and #4. #4 is currently not a partitioned tabled as 5.0 doesn't support it. In order to get around the tremendous amount of time (hours and hours) its taking to insert records into history, we're writing each day to an unindexed history_queue table, and then on the weekends during our slowest time, writing the queue to the history table. The problem is that any historical queries generated in the week are possibly several days behind then. We can't reduce the indexes on the historical table or its queries become unusable.
We're definitely moving to at least MySQL 5.1 (if we stay with MySQL) for the next release but strongly considering PostgreSQL. I know that debate has been done to death, but I was wondering if anybody had any advice relevant to this situation. Most of the research is revolving around web site usage. Indexing is really our main beef with MySQL and it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Commercial databases (Oracle and MS SQL) are simply not an option - although I wish Oracle was.
NOTE on MyISAM vs Innodb for us: We were running Innodb and for us, we found it MUCH slower, like 3-4 times slower. BUT, we were also much newer to MySQL and frankly I'm not sure we had db tuned appropriately for Innodb.
We're running in an environment with a very high degree of uptime - battery backup, fail-over network connections, backup generators, fully redundant systems, etc. So the integrity concerns with MyISAM were weighed and deemed acceptable.
In regards to 5.1: I've heard the stability issues concern with 5.1. Generally I assume that any recently (within last 12 months) piece of software is not rock-solid stable. The updated feature set in 5.1 is just too much to pass up given the chance to re-engineer the project.
In regards to PostgreSQL gotchas: COUNT(*) without any where clause is a pretty rare case for us. I don't anticipate this being an issue. COPY FROM isn't nearly as flexible as LOAD DATA INFILE but an intermediate loading table will fix that. My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.