views:

2148

answers:

6

I was wondering if InnoDB would be the best way to format the table? The table contains one field, primary key, and the table will get 816k rows a day (est.). This will get very large very quick! I'm working on a file storage way (would this be faster)? The table is going to store ID numbers of Twitter Ids that have already been processed?

Also, any estimated memory usage on a SELECT min('id') statement? Any other ideas are greatly appreciated!

Thanks, James Hartig

+1  A: 

If these ID numbers are monotonically increasing and your writes only append data (never modify it), it'll probably be a lot faster to use a single file. A SELECT min('id') then just becomes reading the first line of the file, and anything else is a binary search.

Ant P.
+3  A: 

I'd recommend you start partioning your table by ID or date. Partioning splits a large table into several smaller table according to some defined logic (like splitting it by date ranges), which makes them much more managable performance and memory wise. MySQL 5.1 has this feature built-in, or you can implement it using custom solutions.

In implement storage in a flat-file, you lose all the advantages of a database - you can no longer perform queries involving the data.

Eran Galperin
A: 

If you have an index on your id column, select min(id) should be O(1), there shouldn't be much of a memory requirement for this.

If your primary key is on the twitter id then you have an index on it.

A: 

There is a good comparison of storage engines on MySQL Dev zone:

From your description I would say MyISAM would be better, but it depends quite a lot on the compared reading and writing patterns of your app.

Christian Lescuyer
+2  A: 

The only definitive answer is to try both and test and see what happens.

Generally, MyISAM is faster for writes and reads, but not both at the same time. When you write to a MyISAM table the entire table gets locked for the insert to complete. InnoDB has more overhead but uses row-level locking so that reads and writes can happen concurrently without the problems that MyISAM's table locking incurs.

However, your problem, if I understand it correctly, is a little different. Having only one column, that column being a primary key has an important consideration in the different ways that MyISAM and InnoDB handle primary key indexes.

In MyISAM, the primary key index is just like any other secondary index. Internally each row has a row id and the index nodes just point to the row ids of the data pages. A primary key index is not handled differently than any other index.

In InnoDB, however, primary keys are clustered, meaning they stay attached to the data pages and ensure that the row contents remain in physically sorted order on disk according to the primary key (but only within single data pages, which themselves could be scattered in any order.)

This being the case, I would expect that InnoDB might have an advantage in that MyISAM would essentially have to do double work -- write the integer once in the data pages, and then write it again in the index pages. InnoDB wouldn't do this, the primary key index would be identical to the data pages, and would only have to write once. It would only have to manage the data in one place, where MyISAM would needlessly have to manage two copies.

For either storage engine, doing something like min() or max() should be trivial on an indexed column, or just checking the existence of a number in the index. Since the table is only one column no bookmark lookups would even be necessary as the data would be represented entirely within the index itself. This should be a very efficient index.

I also wouldn't be all that worried about the size of the table. Where the width of a row is only one integer, you can fit a huge number of rows per index/data page.

ʞɔıu
A: 

With one single field, being the primary key, only ever adding records, this is not really suited to a regular database.

For a start, you're storing twice as much info as you need to, with every field going into the data table and index.

As an aside, relational database are so called since, for one, they store related data into a single row; it's hard to see how your data qualifies :-) If you were storing other stuff as well, a database would be worth it.

You don't mention whether the data will be accessed by multiple processes at once - if not, then you don't need all the advantages conferred by database ACID principles. Even if you do want ACID, that can still be achieved without a full blown database.

My first though would be to construct your own B-tree or B+-tree data file to store the twitter IDs to avoid the data duplication. The only queries I can see you doing (based on the question) are:

  • select min(id) from tbl; and
  • select id from tbl where id = ?

The first can be made O(1) by simply storing the lowest in another file outside of the B-tree structure (and replacing it when you get a lower one). I'm not sure of the business case for this one unless it's to quickly find out a certain twitter ID isn't in the table (so you'd probably want max as well in that case).

The second is standard tree searching techniques which is what a database generally uses under the covers anyway.

paxdiablo
well i need to fill the gaps in the table if there is any, which is easier with mysql because data will be completed by multiple scripts
James Hartig