views:

218

answers:

3

I have a project in which I'm doing data mining a large database. I currently store all of the data in text files, I'm trying to understand the costs and benefits of storing the data relational database instead. The points look like this:

CREATE TABLE data (
    source1 CHAR(5),
    source2 CHAR(5),
    idx11   INT,
    idx12   INT,
    idx21   INT,
    idx22   INT,
    point1  FLOAT,
    point2  FLOAT
);

How many points like this can I have with reasonable performance? I currently have ~150 million data points, and I probably won't have more than 300 million. Assume that I am using a box with 4 dual-core 2ghz Xeon CPUs and 8GB of RAM.

+7  A: 

PostgreSQL should be able to amply accommodate your data -- up to 32 Terabytes per table, etc, etc. If I understand correctly, you're talking about 5 GB currently, 10 GB max (about 36 bytes/row and up to 300 million rows), so almost any database should in fact be able to accommodate you easily.

Alex Martelli
+1 for postgres, if you're going to do any stat work on the data (and "data mining" implies you will) then with postgres, you can use PL/R and it can make you're life easier.
rfusca
+2  A: 

MySQL is more than capable of serving your needs as well as Alex's suggestion of PostgreSQL. Reasonable performance shouldn't be difficult to achieve, but if the table is going to be heavily accessed and have a large amount of DML, you will want to know more about the locking used by the database you end up choosing.

I believe PostgreSQL can use row level locking out of the box, where MySQL will depend on the storage engine you choose. MyISAM only locks at the table level, and thus concurrency suffers, but storage engines such as InnoDB for MySQL can and will use row-level locking to increase throughput. My suggestion would be to start with MyISAM and move to InnoDB only if you find you need row level locking. MyISAM works well in most situations and is extremely light-weight. I've had tables over 1 billion rows in MySQL using MyISAM and with good indexing and partitioning, you can get great performance. You can read more about storage engines in MySQL at MySQL Storage Engines and about table partitioning at Table Partitioning. Here is an article on partitions in practice on a table of 113M rows that you may find useful as well.

I think the benefits of storing the data in a relational database far outweigh the costs. There are so many things you can do once your data is within a database. Point in time recovery, ensuring data integrity, finer grained security access, partitioning of data, availability to other applications through a common language. (SQL) etc. etc.

Good luck with your project.

RC
+3  A: 

FYI: Postgres scales better than MySQL on multi-processor / overlapping requests, from a review I was reading a few months back (sorry, no link).

I assume from your profile this is some sort of biometric (codon sequences, enzyme vs protein amino acid sequence, or some such) problem. If you are going to attack this with concurrent requests, I'd go with Postgres.

OTOH, if the data is going to be loaded once, then scanned by a single thread, maybe MySQL in its "ACID not required" mode would be the best match.

You've got some planning to do in case of access use case(s) before you can select the "best" stack.

Roboprog
There will almost certainly be no concurrent requests, this is a database only for myself. I'd just like to replace a lot of my hacky loops over text files with SQL queries, because it will make things smaller and less likely to contain bugs. Thanks for the tip!
James Thompson