views:

32

answers:

2

After an unfortunate misadventure with MySQL, I finally gave up on using it.

What I have?

Large set of files in the following format:

ID1: String String String String
ID2: String String String String
ID3: String String String String
ID4: String String String String

What I did?

Used MySQL on a powerful machine to import everything into a database in the following form:

ID1 String
ID1 String
ID1 String
ID1 String
...
...

What happened?

The database import was successful. Indexing is failing because apparently it requires more than 200 GB for 2 billion records. Reasonable request but I simply don't have that much space because the table itself is occupying about 240 GB after normalizing.

What I am planning to do?

I have a cluster of 20 nodes with about 80GB access for all of them combined (all of them have an NFS mount). I setup the nodes for distributed computing using Parallel Python. I am planning to rewrite my logic to utilize the power of the cluster.

My Question:

I need to do a lot of the following type of lookups:

What IDs contain a given string?

For instance, given an arbitrary string: "String1", I need to know that say, "ID1, ID2234" contain it.

I know of two methods for now:

  • Using python call grep
  • Each of the 20 nodes takes control over a set of files and upon a request for search, searches their associated files.

Can someone suggest a good approach to speed up this otherwise inefficient task?

A: 

I'd suggest looking at the use of a non-relational database to support this. There are a number of key/value stores that you could look at for storing your data, which should be more efficient than a database. You may want to look at NoSQL on Wikipedia to start with.

EDIT: Are you using the most compact data types possible for the data in your database? Are your IDs integers of the lowest size possible to store the range of IDs? If your strings are ASCII, are you storing them as ASCII strings rather than Unicode (VARCHAR rather than NVARCHAR)?

SamStephens
@SamStephens: Thanks for the reply. I have been looking at them but assuming that I picked BerekelyDB for my need (as it has nice python bindings), how useful will it be considering that my cluster doesn't have that much space and relies upon an NFS mount? I mean, how does the workflow look like? Can you give me some insights from this perspective?
Legend
@Legend: I'm afraid I don't have experience with BerkeleyDB, my knowledge is theoretical. I simply imagined you'd use the NoSQL database instead of the SQL database. Having said that, now I think further, a NoSQL database may not index your data more compactly, and may not solve your issue. I've voted Don's answer up, his point is certainly valuable. Also am editing my question to ask more.
SamStephens
+1  A: 

For the requirement of looking up which IDs are associated with a given string, I suggest inverting the ID/string relation so the records are keyed by unique strings and the associated data is a sequence of IDs. A string lookup can the be implemented by either a binary search if sorted, or a hash algorithm. This may concentrate your data considerably if you have a lot of the same strings repeating.

Don O'Donnell