views:

341

answers:

8

I'm setting up a large database that will generate statistical reports from incoming data.
The system will for the most part operate as follows:

  1. Approximately 400k-500k rows - about 30 columns, mostly varchar(5-30) and datetime - will be uploaded each morning. Its approximately 60MB while in flat file form, but grows steeply in the DB with the addition of suitable indexes.
  2. Various statistics will be generated from the current day's data.
  3. Reports from these statistics will be generated and stored.
  4. Current data set will get copied into a partitioned history table.
  5. Throughout the day, the current data set (which was copied, not moved) can be queried by end users for information that is not likely to include constants, but relationships between fields.
  6. Users may request specialized searches from the history table, but the queries will be crafted by a DBA.
  7. Before the next day's upload, the current data table is truncated.

This will essentially be version 2 of our existing system.

Right now, we're using MySQL 5.0 MyISAM tables (Innodb was killing on space usage alone) and suffering greatly on #6 and #4. #4 is currently not a partitioned tabled as 5.0 doesn't support it. In order to get around the tremendous amount of time (hours and hours) its taking to insert records into history, we're writing each day to an unindexed history_queue table, and then on the weekends during our slowest time, writing the queue to the history table. The problem is that any historical queries generated in the week are possibly several days behind then. We can't reduce the indexes on the historical table or its queries become unusable.

We're definitely moving to at least MySQL 5.1 (if we stay with MySQL) for the next release but strongly considering PostgreSQL. I know that debate has been done to death, but I was wondering if anybody had any advice relevant to this situation. Most of the research is revolving around web site usage. Indexing is really our main beef with MySQL and it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.

I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?

Commercial databases (Oracle and MS SQL) are simply not an option - although I wish Oracle was.

NOTE on MyISAM vs Innodb for us: We were running Innodb and for us, we found it MUCH slower, like 3-4 times slower. BUT, we were also much newer to MySQL and frankly I'm not sure we had db tuned appropriately for Innodb.

We're running in an environment with a very high degree of uptime - battery backup, fail-over network connections, backup generators, fully redundant systems, etc. So the integrity concerns with MyISAM were weighed and deemed acceptable.

In regards to 5.1: I've heard the stability issues concern with 5.1. Generally I assume that any recently (within last 12 months) piece of software is not rock-solid stable. The updated feature set in 5.1 is just too much to pass up given the chance to re-engineer the project.

In regards to PostgreSQL gotchas: COUNT(*) without any where clause is a pretty rare case for us. I don't anticipate this being an issue. COPY FROM isn't nearly as flexible as LOAD DATA INFILE but an intermediate loading table will fix that. My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.

+1  A: 

In my experience Inodb is slighly faster for really simple queries, pg for more complex queries. Myisam is probably even faster than Innodb for retrieval, but perhaps slower for indexing/index repair.

These mostly varchar fields, are you indexing them with char(n) indexes?

Can you normalize some of them? It'll cost you on the rewrite, but may save time on subsequent queries, as your row size will decrease, thus fitting more rows into memory at one time.

ON EDIT:

OK, so you have two problems, query time against the daily, and updating the history, yes?

As to the second: in my experience, mysql myism is bad at re-indexing. On tables the size of your daily (0.5 to 1M records, with rather wide (denormalized flat input) records), I found it was faster to re-write the table than to insert and wait for the re-indexing and attendant disk thrashing.

So this might or might not help:

create new_table select * from old_table ;

copies the tables but no indices.

Then insert the new records as normally. Then create the indexes on new table, wait a while. Drop old table, and rename new table to old table.

Edit: In response to the fourth comment: I don't know that MyIsam is always that bad. I know in my particular case, I was shocked at how much faster copying the table and then adding the index was. As it happened, I was doing something similar to what you were doing, copying large denormalized flat files into the database, and then renormalizing the data. But that's an anecdote, not data. ;)

(I also think I found that overall InnoDb was faster, given that I was doing as much inserting as querying. A very special case of database use.)

Note that copying with a select a.*, b.value as foo join ... was also faster than an update a.foo = b.value ... join, which follows, as the update was to an indexed column.

tpdi
They are indexed char(n), yes.Its normalized where it makes sense, but most of the values that are limited in their nature are already CHAR(1), so normalizing doesn't gain us much. Unfortunately most of the queries end up getting run against silly varchar(30) or sometimes even varchar(100) fields.
rfusca
Have you checked if indexing CHAR(1) fields gets you anything? A table scan may be faster... And that will certainly speed up the inserts.
derobert
I'd have to check for certain, but I believe we already took out all the indexes on the CHAR(1) fields for just that reason. The data is basically logistics and inventory type stuff. Shippers, product descriptions, cities, states - but with a huge, huge variety.
rfusca
Hmmm...I'll give the create table thing a shot, but if MyISAM is THAT bad at reindexing - WOW. That kind of seals the deal on moving to nearly anything else. We're doing OK, not great, but OK on the current day query time. Its the trade off on history inserts vs history query time that kills.
rfusca
+1  A: 

What is not clear to me is how complex the analytical processing is. In my oppinion, having 500K records to process should not be such a big problem, in terms of analytical processing, it is a small recordset.

Even if it is a complex job, if you can leave it over night to complete (since it is a daily process, as I understood from your post), it should still be enough.

Regarding the resulted table, I would not reduce the indexes of the table. Again, you can do the loading over night, including indexes refresh, and have the resulted, updated data set ready for use in the morning, with quicker access than in case of raw tables (non-indexed).

I saw PosgreSQL used in a datawarehouse like environment, working on the setup I've described (data transformation jobs over night) and with no performance complaints.

Cătălin Pitiș
The static daily processing (script/db time included) actually runs impressingly fast. It uploads the flat file, shifts some tables around, calc's stats (~100 diff numbers), generates reports tables, generates PNG charts from stats, generates PDFs from charts and stats and emails them in ~ 3 min. :)
rfusca
The situation around the history is the issue - 500k not a problem, 400 million, much more of a problem.
rfusca
That's a different story, and you probably need partitioning support for history table(s). Indexing is for sure.
Cătălin Pitiș
+1  A: 

I'd go for PostgreSQL. You need for example partitioned tables, which are in stable Postgres releases since at least 2005 - in MySQL it is a novelty. I've heard about stability issues in new features of 5.1. With MyISAM you have no referential integrity, transactions and concurrent access suffers a lot - read this blog entry "Using MyISAM in production" for more.

And Postgres is much faster on complicated queries, which will be good for your #6. There is also a very active and helpful mailing list, where you can get support even from core Postgres developers for free. It has some gotchas though.

Tometzky
See updated post.
rfusca
+3  A: 

In my practical experience I have to say, that postgresql had quite a performance jump from 7.x/8.0 to 8.1 (for our use cases in some instances 2x-3x faster), from 8.1 to 8.2 the improvement was smaller but still noticeable. I don't know the improvements between 8.2 and 8.3, but I expect there is some performance improvement too, I havent tested it so far.

Regarding indices, I would recommend to drop those, and only create them again after filling the database with your data, it is much faster.

Further improve the crap out of your postgresql settings, there is so much gain from it. The default settings are at least sensible now, in pre 8.2 times pg was optimized for running on a pda.

In some cases, especially if you have complicated queries it can help to deactivate nested loops in your settings, which forces pg to use better performing approaches on your queries.

Ah, yes, did I say that you should go for postgresql?

(An alternative would be firebird, which is not so flexible, but in my experience it is in some cases performing much better than mysql and postgresql)

Mauli
+1 for mention of firebird.
Avery Payne
+2  A: 

My work tried a pilot project to migrate historical data from an ERP setup. The size of the data is on the small side, only 60Gbyte, covering over ~ 21 million rows, the largest table having 16 million rows. There's an additional ~15 million rows waiting to come into the pipe but the pilot has been shelved due to other priorities. The plan was to use PostgreSQL's "Job" facility to schedule queries that would regenerate data on a daily basis suitable for use in analytics.

Running simple aggregates over the large 16-million record table, the first thing I noticed is how sensitive it is to the amount of RAM available. An increase in RAM at one point allowed for a year's worth of aggregates without resorting to sequential table scans.

If you decide to use PostgreSQL, I would highly recommend re-tuning the config file, as it tends to ship with the most conservative settings possible (so that it will run on systems with little RAM). Tuning takes a little bit, maybe a few hours, but once you get it to a point where response is acceptable, just set it and forget it.

Once you have the server-side tuning done (and it's all about memory, surprise!) you'll turn your attention to your indexes. Indexing and query planning also requires a little effort but once set you'll find it to be effective. Partial indexes are a nice feature for isolating those records that have "edge-case" data in them, I highly recommend this feature if you are looking for exceptions in a sea of similar data.

Lastly, use the table space feature to relocate the data onto a fast drive array.

Avery Payne
I'm really thinking partial indexes may help significant for some our our historical queries - its one of the biggest reasons we're looking as PostgreSQL. We've starting trying to tune but we don't have the db fully setup, so we've tabled tuning till we get some full datasets in.
rfusca
+1  A: 

The Infobright people appear to be doing some interesting things along these lines:

http://www.infobright.org/

-- psj

psj
A: 

If Oracle is not considered an option because of cost issues, then Oracle Express Edition is available for free (as in beer). It has size limitations, but if you do not keep history around for too long anyway, it should not be a concern.

andri
Oracle express doesn't even hold close to the amount of data I'll need.
rfusca
A: 

Check your hardware. Are you maxing the IO? Do you have buffers configured properly? Is your hardware sized correctly? Memory for buffering and fast disks are key.

If you have too many indexes, it'll slow inserts down substantially.

How are you doing your inserts? If you're doing one record per INSERT statement:

INSERT INTO TABLE blah VALUES (?, ?, ?, ?)

and calling it 500K times, your performance will suck. I'm surprised it's finishing in hours. With MySQL you can insert hundreds or thousands of rows at a time:

INSERT INTO TABLE blah VALUES
  (?, ?, ?, ?),
  (?, ?, ?, ?),
  (?, ?, ?, ?)

If you're doing one insert per web requests, you should consider logging to the file system and doing bulk imports on a crontab. I've used that design in the past to speed up inserts. It also means your webpages don't depend on the database server.

It's also much faster to use LOAD DATA INFILE to import a CSV file. See http://dev.mysql.com/doc/refman/5.1/en/load-data.html

The other thing I can suggest is be wary of the SQL hammer -- you may not have SQL nails. Have you considered using a tool like Pig or Hive to generate optimized data sets for your reports?

EDIT

If you're having troubles batch importing 500K records, you need to compromise somewhere. I would drop some indexes on your master table, then create optimized views of the data for each report.

Gary Richardson
Data is initially loaded into the database with a load infile - its plenty fast. Going to the history table is a Insert Into history select columns... from current.
rfusca
Its definitely the indexes that are slowing down the insert,but we can't remove those - else the historical queries are waaaay too slow.We don't load any data from a webpage. The slow insert is from one table to another.
rfusca
I've not tried Pig or Hive, I'll have to investigate. I'm pretty sure our solutions is going to have to be SQL dependent - the whole analysis db is just too integrated into several other systems to suffer THAT much of a change.
rfusca
you can use pig/hive to pre-process the data into optimized formats for each report. You're basically doing a mini-ETL process at that point.
Gary Richardson