views:

102

answers:

3

I have a few large databases, greater than 100 million records. They consist of the following:

  1. A unique key.
  2. An integer value, not unique, but used for sorting the query.
  3. A VARCHAR(200).

I have them in a mysql isam table now. My thought was, hey, I'll just set up a covering index on the data, and it should pull out reasonably fast. Queries are of the form...

select valstr,account 
    from datatable 
    where account in (12349809, 987987223,...[etc]) 
    order by orderPriority;

This seemed OK in some tests, but on our newer installation, its terribly slow. It seems faster to have no index at all, which seems odd.

In any case, I'm thinking, maybe a different database? We use a datawarehousing db for other parts of the system, but its not well suited for anything in text. Any free, or fairly cheap, db's are an option, as long as they have reasonably useful API access. SQL optional.

Thanks in advance.

-Kevin

+1  A: 

Here's a reasonably sized example of a MySQL database using the innodb engine which takes advantage of clustered indexes on a table with approx. 125 million rows and with a query runtime of 0.021 seconds which seems fairly reasonable.

http://stackoverflow.com/questions/3534597/rewriting-mysql-select-to-reduce-time-and-writing-tmp-to-disk/3535735#3535735

http://pastie.org/1105206

Other useful links:

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://dev.mysql.com/doc/refman/5.0/en/innodb-adaptive-hash.html

Hope it proves of interest.

f00
A: 

CouchDB will give you storage by key and you can create views to do the query/sorting. Second option could be cassandra, but there's a pretty big learning curve.

Matt Williamson
A: 

CouchDB and MongoDB and Riak are all going to be good at finding the key (account) relatively quickly.

The problems you're going to have (with any solution) are tied to the "order by" and "account in" clauses.

Problem #1: account in

120M records likely means gigabytes of data. You probably have an index over a gig. The reason this is a problem is that your "in" clause can easily span the whole index. If you search for accounts "0000001" and "9999581" you probably need to load a lot of index.

So just to find the records your DB first has to load potentially a gig of memory. Then to actually load the data you have to go back to the disk again. If your "accounts" on the in clause are not "close together" then you're going back multiple times to fetch various blocks. At some point it may be quicker to just do a table scan then to load the index and the table.

Then you get to problem #2...

Problem #2: order by

If you have a lot of data coming back from the "in" clause, then order by is just another layer of slowness. With an "order by" the server can't stream you the data. Instead it has to load all of the records in memory and then sort them and then stream them.

Solutions:

  1. Have lots of RAM. If the RAM can't fit the entire index, then the loads will be slow.
  2. Try limiting the number of "in" items. Even 20 or 30 items in this clause can make the query really slow.
  3. Try a Key-Value database?

I'm a big fan of K/V databases, but you have to look at point #1. If you don't have a lot of RAM and you have lots of data, then the system is going to run slowly no matter what DB you use. That RAM / DB size ratio is really important if you want good performance in these scenarios (small look-ups in big datasets).

Gates VP