views:

148

answers:

6

Hi,

I have an application where I receive each data 40.000 rows. I have 5 million rows to handle (500 Mb MySQL 5.0 database).

Actually, thoses rows are stored in the same table => slow to update, hard to backup, ...

Which king of scheme is used in such application to allow long term accessibility to the datas without problems with too big tables, easy backup, fast read / write ?

Is postgresql better than mysql for such purpose ?

Thanks in advance

BEst regards

A: 

First, huge data volumes are not always handled well in a relational database.

What some folks do is to put huge datasets in files. Plain old files. Fast to update, easy to back up.

The files are formatted so that the database bulk loader will work quickly.

Second, no one analyzes huge data volumes. They rarely summarize 5,000,000 rows. Usually, they want a subset.

So, you write simple file filters to cut out their subset, load that into a "data mart" and let them query that. You can build all the indexes they need. Views, everything.

This is one way to handle "Data Warehousing", which is that your problem sounds like.

S.Lott
+2  A: 

We're having log tables of 100-200million rows now, and it is quite painful.

  • backup is impossible, requires several days of down time.

  • purging old data is becoming too painful - it usually ties down the database for several hours

So far we've only seen these solutions:

  • backup , set up a MySQL slave. Backing up the slave doesn't impact the main db. (We havn't done this yet - as the logs we load and transform are from flat files - we back up these files and can regenerate the db in case of failures)

  • Purging old data, only painless way we've found is to introduce a new integer column that identifies the current date, and partition the tables(requires mysql 5.1) on that key, per day. Dropping old data is a matter of dropping a partition, which is fast.

If in addition you need to do continuously transactions on these tables(as opposed to just load data every now and then and mostly query that data), you probably need to look into InnoDB and not the default MyISAM tables.

nos
The problem is the replication who is not consistent on long periods with the master... Inconsistancies is more frequent with monster tables...
A: 

First, make sure that your logging table is not over-indexed. By that i mean that every time you insert/update/delete from a table any indexes that you have also need to be updated which slows down the process. If you have a lot of indexes specified on your log table you should take a critical look at them and decide if they are indeed necessary. If not, drop them.

You should also consider an archiving procedure such that "old" log information is moved to a separate database at some arbitrary interval, say once a month or once a year. It all depends on how your logs are used.

Paul Sasik
I need indexes on assets, datetime, treated flag due to the application...
+1  A: 

The general answer is: you probably don't need all that detail around all the time.

For example, instead of keeping every sale in a giant Sales table, you create records in a DailySales table (one record per day), or even a group of tables (DailySalesByLocation = one record per location per day, DailySalesByProduct = one record per product per day, etc.)

egrunin
yes, sadly... the customers have the ability to export the data for the whole period. But table by location can be a good solution in our case ! Thanks for the idea
+2  A: 

1 - 40000 rows / day is not that big

2 - Partition your data against the insert date : you can easily delete old data this way.

3 - Don't hesitate to go through a datamart step. (compute often asked metrics in intermediary tables)

FYI, I have used PostgreSQL with tables containing several GB of data without any problem (and without partitioning). INSERT/UPDATE time was constant

chburd
1. At this time, it is not very big, but the number of asset is growing very fast... I have to find a working architecture for 1.000.000 / day for that project.2. It is hard to select data accross different partitions no ?I think Postgres will be more stable for that kind of project
Since I am not an expert, have a look at http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html to see examples with partitions : seems not too complicated to setup and easy to select
chburd
A: 

This is the sort of thing that NoSQL DBs might be useful for, if you're not doing the sort of reporting that requires complicated joins.

CouchDB, MongoDB, and Riak are document-oriented databases; they don't have the heavyweight reporting features of SQL, but if you're storing a large log they might be the ticket, as they're simpler and can scale more readily than SQL DBs.

They're a little easier to get started with than Cassandra or HBase (different type of NoSQL), which you might also look into.

From this SO post: http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/

Chris
Is it frequent to have that kind of database for big tables and use a traditionnal database for the rest of the application ?
An app that uses a NoSQL database will often be integrated with standard relational databases. (In fact, I think one of the speakers at NoSQL Live said this last week.) You might find the above link of interest.
Chris